warc

package module
v0.0.0-...-96250af Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Oct 4, 2021 License: AGPL-3.0 Imports: 21 Imported by: 1

README

warc

GitHub Slack GoDoc License

warc is an implementation of ISO28500 1.0, the WebARCive specfication. it provides readers, writers, and structs for working with warc records.

from the spec:

The WARC (Web ARChive) file format offers a convention for concatenating multiple resource records (data objects), each consisting of a set of simple text headers and an arbitrary data block into one long file. The WARC format is an extension of the ARC File Format [ARC] that has traditionally been used to store "web crawls" as sequences of content blocks harvested from the World Wide Web. Each capture in an ARC file is preceded by a one-line header that very briefly describes the harvested content and its length. This is directly followed by the retrieval protocol response messages and content. The original ARC format file is used by the Internet Archive (IA) since 1996 for managing billions of objects, and by several national libraries. package warc

Affero General Public License v3

Getting Involved

We would love involvement from more people! If you notice any errors or would like to submit changes, please see our Contributing Guidelines.

We use GitHub issues for tracking bugs and feature requests and Pull Requests (PRs) for submitting changes

Usage

import "github.com/datatogether/warc"

Documentation

Overview

Package warc is an implementation of ISO28500 1.0, the WebARCive specfication. it provides readers, writers, and structs for working with warc records. from the spec: The WARC (Web ARChive) file format offers a convention for concatenating multiple resource records (data objects), each consisting of a set of simple text headers and an arbitrary data block into one long file. The WARC format is an extension of the ARC File Format [ARC] that has traditionally been used to store "web crawls" as sequences of content blocks harvested from the World Wide Web. Each capture in an ARC file is preceded by a one-line header that very briefly describes the harvested content and its length. This is directly followed by the retrieval protocol response messages and content. The original ARC format file is used by the Internet Archive (IA) since 1996 for managing billions of objects, and by several national libraries.

Index

Constants

View Source
const (
	// An identifier assigned to the current record that is globally unique for
	// its period of intended use. No identifier scheme is mandated by this
	// specification, but each record-id shall be a legal URI and clearly
	// indicate a documented and registered scheme to which it conforms (e.g.,
	// via a URI scheme prefix such as "http:" or "urn:"). Care should be taken
	// to ensure that this value is written with no internal whitespace.
	FieldNameWARCRecordID = "WARC-Record-ID"
	// The number of octets in the block, similar to [RFC2616]. If no block is
	// present, a value of '0' (zero) shall be used.
	FieldNameContentLength = "Content-Length"
	// 	A 14-digit UTC timestamp formatted according to YYYY-MM-DDThh:mm:ssZ,
	// described in the W3C profile of ISO8601 [W3CDTF]. The timestamp shall
	// represent the instant that data capture for record creation began.
	// Multiple records written as part of a single capture event (see section
	// 5.7) shall use the same WARC-Date, even though the times of their
	// writing will not be exactly synchronized.
	FieldNameWARCDate = "WARC-Date"
	// 	The type of WARC record: one of 'warcinfo', 'response', 'resource',
	// 'request', 'metadata', 'revisit', 'conversion', or 'continuation'. Other
	// types of WARC records may be defined in extensions of the core format.
	// Types are further described in WARC Record Types.
	// A WARC file needs not contain any particular record types, though
	// starting all WARC files with a "warcinfo" record is recommended.
	FieldNameWARCType = "WARC-Type"
	// The MIME type [RFC2045] of the information contained in the record's
	// block. For example, in HTTP request and response records, this would be
	// 'application/http' as per section 19.1 of [RFC2616] (or
	// 'application/http; msgtype=request' and 'application/http;
	// msgtype=response' respectively). In particular, the content-type is not
	// the value of the HTTP Content-Type header in an HTTP response but a MIME
	// type to describe the full archived HTTP message (hence
	// 'application/http' if the block contains request or response headers).
	FieldNameContentType = "Content-Type"
	// 	The WARC-Record-IDs of any records created as part of the same capture
	// event as the current record. A capture event comprises the information
	// automatically gathered by a retrieval against a single target-URI; for
	// example, it might be represented by a 'response' or 'revisit' record
	// plus its associated 'request' record.
	// This field may be used to associate records of types 'request',
	// 'response', 'resource', 'metadata', and 'revisit' with one another when
	// they arise from a single capture event (When so used, any
	// WARC-Concurrent-To association shall be considered bidirectional even if
	// the  header only appears on one record.) The WARC Concurrent-to field
	// shall not be used in 'warcinfo', 'conversion', and 'continuation'
	// records.
	FieldNameWARCConcurrentTo = "WARC-Concurrent-To"
	// An optional parameter indicating the algorithm name and calculated value
	// of a digest applied to the full block of the record.
	// An example is a SHA-1 labelled Base32 ([RFC3548]) value:
	// WARC-Block-Digest: sha1:AB2CD3EF4GH5IJ6KL7MN8OPQ
	FieldNameWARCBlockDigest = "WARC-Block-Digest"
	// An optional parameter indicating the algorithm name and calculated value
	// of a digest applied to the payload referred to or contained by the
	// record - which is not necessarily equivalent to the record block.
	// The payload of an application/http block is its 'entity-body' (per
	// [RFC2616]). In contrast to WARC-Block-Digest, the WARC-Payload-Digest
	// field may also be used for data not actually present in the current
	// record block, for example when a block is left off in accordance with a
	// 'revisit' profile (see 'revisit'), or when a record is segmented (the
	// WARC-Payload-Digest recorded in the first segment of a segmented record
	// shall be the digest of the payload of the logical record).
	FieldNameWARCPayloadDigest = "WARC-Payload-Digest"
	// The numeric Internet address contacted to retrieve any included content.
	// An IPv4 address shall be written as a "dotted quad"; an IPv6 address
	// shall be written as per [RFC1884]. For an HTTP retrieval, this will be
	// the IP address used at retrieval time corresponding to the hostname in
	// the record's target-URI.
	FieldNameWARCIPAddress = "WARC-IP-Address"
	// The WARC-Refers-To field may be used to associate a 'metadata' record to
	// another record it describes. The WARC-Refers-To field may also be used
	// to associate a record of type 'revisit' or 'conversion' with the
	// preceding record which helped determine the present record content. The
	// WARC-Refers-To field shall not be used in 'warcinfo', 'response',
	// ‘resource’, 'request', and 'continuation' records.
	FieldNameWARCRefersTo = "WARC-Refers-To"
	// The original URI whose capture gave rise to the information content in
	// this record. In the context of web harvesting, this is the URI that was
	// the target of a crawler's retrieval request. For a 'revisit' record, it
	// is the URI that was the target of a retrieval request.  Indirectly, such
	// as for a 'metadata', or 'conversion' record, it is a copy of the
	// WARC-Target-URI appearing in the original record to which the newer
	// record pertains. The URI in this value shall be properly escaped
	// according to [RFC3986] and written with no internal whitespace.
	FieldNameWARCTargetURI = "WARC-Target-URI"
	// For practical reasons, writers of the WARC format may place limits on
	// the time or storage allocated to archiving a single resource. As a
	// result, only a truncated portion of the original resource may be
	// available for saving into a WARC record.
	//
	// Any record may indicate that truncation of its content block has
	// occurred and give the reason with a 'WARC-Truncated' field.
	FieldNameWARCTruncated = "WARC-Truncated"
	// When present, indicates the WARC-Record-ID of the associated 'warcinfo'
	// record for this record. Typically, the Warcinfo-ID parameter is used
	// when the context of the applicable 'warcinfo' record is unavailable,
	// such as after distributing single records into separate WARC files. WARC
	// writing applications (such web crawlers) may choose to always record
	// this parameter.
	FieldNameWARCWarcinfoID = "WARC-Warcinfo-ID"
	// The WARC-Filename field may be used in 'warcinfo' type records and shall
	// not be used for other record types.
	FieldNameWARCFilename = "WARC-Filename"
	// A URI signifying the kind of analysis and handling applied in a
	// 'revisit' record. (Like an XML namespace, the URI may, but need not,
	// return human-readable or machine-readable documentation.) If reading
	// software does not recognize the given URI as a supported kind of
	// handling, it shall not attempt to interpret the associated record block.
	FieldNameWARCProfile = "WARC-Profile"
	// The content-type of the record's payload as determined by an independent
	// check. This string shall not be arrived at by blindly promoting an HTTP
	// Content-Type value up from a record block into the WARC header without
	// direct analysis of the payload, as such values may often be unreliable.
	FieldNameWARCIdentifiedPayloadType = "WARC-Identified-Payload-Type"
	// Reports the current record's relative ordering in a sequence of
	// segmented records.
	// In the first segment of any record that is completed in one or more
	// later 'continuation' WARC records, this parameter is mandatory. Its
	// value there is "1". In a 'continuation' record, this parameter is also
	// mandatory. Its value is the sequence number of the current segment in
	// the logical whole record, increasing by 1 in each next segment.
	FieldNameWARCSegmentNumber = "WARC-Segment-Number"
	// Identifies the starting record in a series of segmented records whose
	// content blocks are reassembled to obtain a logically complete content
	// block.
	// This field is mandatory on all 'continuation' records, and shall not be
	// used in other records. See the section below, Record segmentation, for
	// full details on the use of WARC record segmentation.
	FieldNameWARCSegmentOriginID = "WARC-Segment-Origin-ID"
	// In the final record of a segmented series, reports the total length of
	// all segment content blocks when concatenated together.
	// This field is mandatory on the last 'continuation' record of a series,
	// and shall not be used elsewhere.
	FieldNameWARCSegmentTotalLength = "WARC-Segment-Total-Length"
)

Named fields within a WARC record provide information about the current record, and allow additional per-record information. WARC both reuses appropriate headers from other standards and defines new headers, all beginning "WARC-", for WARC-specific purposes.

WARC named fields of the same type shall not be repeated in the same WARC record (for example, a WARC record shall not have several WARC-Date or several WARC-Target-URI), except as noted (e.g., WARC-Concurrent-To).

View Source
const TimeFormat = "2006-01-02T15:04:05Z"

TimeFormat is time.RFC3339, but with no timezone (just a Z).

Variables

This section is empty.

Functions

func CanonicalKey

func CanonicalKey(key string) string

CanonicalKey conforms keys to CanonicalMIMEHeaderKey (which is Capitals-For-First-Letter-Separated-By-Dashes) for any general input with exceptions for capitalized "WARC" header keys. The WARC 1.0 spec calls for case-insensitive header keys, but the spec token diagrams list headers as being case-sensitive, so we'll honor case any case on read, but write records that match the spec token diagrams.

func CountWriter

func CountWriter(w io.Writer) io.WriteSeeker

CountWriter implements a limited version of io.Seeker around the provided Writer. It only supports offset == 0 and whence == io.SeekCurrent or io.SeekEnd, and returns the current number of written bytes in both cases.

func NewRequestResponseRecords

func NewRequestResponseRecords(info CaptureHelper, req *http.Request, resp *http.Response) (Record, Record, error)

NewRequestResponseRecords creates a new request/response record pair for the provided HTTP request and response.

Make sure to provide the request Body in the CaptureHelper so it can be read from again. The response Body should not yet have been used; if the caller needs the body, replace it with an ioutil.NopCloser(io.TeeReader) (the caller is then responsible for calling body.Close()).

func NewUUID

func NewUUID() string

NewUUID generates a new version 4 uuid

func Sanitize

func Sanitize(contentSniff string, body []byte) (sanitized []byte, err error)

Sanitize removes any data from a warc record body that may interfere with parsing

func Sha1Digest

func Sha1Digest(data []byte) string

Sha1Digest calculates the shasum of a slice of bytes

func WriteHTTPHeaders

func WriteHTTPHeaders(w io.Writer, headers http.Header) error

WriteHTTPHeaders writes all http headers to an io.Writer, separated by newlines Used to add http headers to a record

func WriteRecords

func WriteRecords(w io.Writer, records Records) error

WriteRecords calls Write on each record to w. Deprecated: see Writer type

func WriteRequestMethodAndHeaders

func WriteRequestMethodAndHeaders(w io.Writer, req *http.Request) error

WriteRequestMethodAndHeaders calls req.Write(w). (deprecated, see NewRequestResponseRecords)

Types

type CaptureHelper

type CaptureHelper struct {
	WarcinfoID string
	RemoteAddr string

	// The request body will need to be read multiple times, so please provide
	// one of the following.  (note: bytes.Reader and strings.Reader are
	// ReadSeekers.)
	ReqBodyReadSeeker  io.ReadSeeker
	ReqBodyBytesBuffer *bytes.Buffer
}

CaptureHelper is used for the NewRequestResponseRecords() method. Additional fields may be added in the future.

func (*CaptureHelper) DialContext

func (c *CaptureHelper) DialContext(dialer *net.Dialer) func(ctx context.Context, network, addr string) (net.Conn, error)

DialContext returns a wrapper around net.DialContext that saves the connected-to IP in the CaptureHelper.

type Header map[string]string

Header mimics net/http's header package, but with string values Users should use Get & Set methods instead of accessing the map directly.

func (Header) Get

func (h Header) Get(key string) string

Get a key from the header map

func (Header) Set

func (h Header) Set(key, value string)

Set a key on the header map

type Reader

type Reader struct {
	// contains filtered or unexported fields
}

Reader parses WARC records from an underlying scanner. Create a new reader with NewReader

func NewReader

func NewReader(r io.Reader) (*Reader, error)

NewReader creates a new WARC reader from an io.Reader Always use NewReader, (instead of manually allocating a reader)

func (*Reader) Read

func (r *Reader) Read() (Record, error)

Read a record, will return nil, io.EOF to signal no more records

func (*Reader) ReadAll

func (r *Reader) ReadAll() (records Records, err error)

ReadAll Consumes the entire reader, returning a slice of records

type Record

type Record struct {
	Format  RecordFormat
	Type    RecordType
	Headers Header
	Content *bytes.Buffer
}

A Record consists of a version indicator (eg: WARC/1.0), zero or more headers, and possibly a content block. Upgrades to specific types of records can be done using type assertions and/or the Type method.

func UnmarshalRecord

func UnmarshalRecord(data []byte) (Record, error)

UnmarshalRecord reads a single record from data

func (*Record) Body

func (r *Record) Body() ([]byte, error)

Body returns a record's body with any HTTP headers omitted

func (*Record) Bytes

func (r *Record) Bytes() ([]byte, error)

Bytes returns the record formatted as a byte slice

func (*Record) ContentLength

func (r *Record) ContentLength() int

ContentLength of content block in bytes, returns 0 if Content-Length header is missing or invalid

func (*Record) Date

func (r *Record) Date() time.Time

Date gives the time.Time of record creation, returns empty (zero) time if no Warc-Date header is present, or if the header is an invalid timestamp

func (*Record) ID

func (r *Record) ID() string

ID gives The ID for this record

func (*Record) SetBody

func (r *Record) SetBody(body []byte) error

SetBody sets the body of the record, leaving any written http headers in record

func (*Record) TargetURI

func (r *Record) TargetURI() string

TargetURI is a convenience method for getting the uri that this record is targeting

func (*Record) Write

func (r *Record) Write(w io.Writer) error

Write this record to the given writer.

Automatically handles the Content-Length, WARC-Type headers, as well as WARC-Block-Digest for Response and Revisit records.

type RecordFormat

type RecordFormat int

RecordFormat determines different formats for records, this is for any later support of ARC files, should we need to add it.

const (
	// RecordFormatWarc default is the Warc Format 1.0
	RecordFormatWarc RecordFormat = iota
	// RecordFormatUnknown reporesents unknown / errored record format
	RecordFormatUnknown
)

func (RecordFormat) String

func (r RecordFormat) String() string

type RecordType

type RecordType int

RecordType enumerates different types of WARC Records

const (
	// RecordTypeUnknown is the default type of record, which shouldn't be
	// accepted by anything that wants to know a type of record.
	RecordTypeUnknown RecordType = iota
	// RecordTypeWarcInfo describes the records that follow it, up through end
	// of file, end of input, or until next 'warcinfo' record. Typically, this
	// appears once and at the beginning of a WARC file. For a web archive, it
	// often contains information about the web crawl which generated the
	// following records.
	// The format of this descriptive record block may vary, though the use of
	// the "application/warc-fields" content-type is recommended. Allowable
	// fields include, but are not limited to, all \[DCMI\] plus the following
	// field definitions. All fields are optional.
	RecordTypeWarcInfo
	// RecordTypeResponse should contain a complete scheme-specific response,
	// including network protocol information where possible. The exact
	// contents of a 'response' record are determined not just by the record
	// type but also by the URI scheme of the record's target-URI, as described
	// below.
	RecordTypeResponse
	// RecordTypeResource contains a resource, without full protocol response
	// information. For example: a file directly retrieved from a locally
	// accessible repository or the result of a networked retrieval where the
	// protocol information has been discarded. The exact contents of a
	// 'resource' record are determined not just by the record type but also by
	// the URI scheme of the record's target-URI, as described below.
	// For all 'resource' records, the payload is defined as the record block.
	// A 'resource' record, with a synthesized target-URI, may also be used to
	// archive other artefacts of a harvesting process inside WARC files.
	RecordTypeResource
	// RecordTypeRequest holds the details of a complete scheme-specific
	// request, including network protocol information where possible. The
	// exact contents of a 'request' record are determined not just by the
	// record type but also by the URI scheme of the record's target-URI, as
	// described below.
	RecordTypeRequest
	// RecordTypeMetadata contains content created in order to further
	// describe, explain, or accompany a harvested resource, in ways not
	// covered by other record types. A 'metadata' record will almost always
	// refer to another record of another type, with that other record holding
	// original harvested or transformed content. (However, it is allowable for
	// a 'metadata' record to refer to any record type, including other
	// 'metadata' records.) Any number of metadata records may reference one
	// specific other record.
	// The format of the metadata record block may vary. The
	// "application/warc-fields" format, defined earlier, may be used.
	// Allowable fields include all \[DCMI\] plus the following field
	// definitions. All fields are optional.
	RecordTypeMetadata
	// RecordTypeRevisit describes the revisitation of content already
	// archived, and might include only an abbreviated content body which has
	// to be interpreted relative to a previous record. Most typically, a
	// 'revisit' record is used instead of a 'response' or 'resource' record to
	// indicate that the content visited was either a complete or substantial
	// duplicate of material previously archived.
	// Using a 'revisit' record instead of another type is optional, for when
	// benefits of reduced storage size or improved cross-referencing of
	// material are desired.
	RecordTypeRevisit
	// RecordTypeConversion shall contain an alternative version of another
	// record's content that was created as the result of an archival process.
	// Typically, this is used to hold content transformations that maintain
	// viability of content after widely available rendering tools for the
	// originally stored format disappear. As needed, the original content may
	// be migrated (transformed) to a more viable format in order to keep the
	// information usable with current tools while minimizing loss of
	// information (intellectual content, look and feel, etc). Any number of
	// 'conversion' records may be created that reference a specific source
	// record, which may itself contain transformed content. Each
	// transformation should result in a freestanding, complete record, with no
	// dependency on survival of the original record.
	// Metadata records may be used to further describe transformation records.
	// Wherever practical, a 'conversion' record should contain a
	// 'WARC-Refers-To' field to identify the prior material converted.
	RecordTypeConversion
	// RecordTypeContinuation blocks from 'continuation' records must be appended to
	// corresponding prior record block(s) (e.g., from other WARC files) to
	// create the logically complete full-sized original record. That is,
	// 'continuation' records are used when a record that would otherwise cause
	// a WARC file size to exceed a desired limit is broken into segments. A
	// continuation record shall contain the named fields
	// 'WARC-Segment-Origin-ID' and 'WARC-Segment-Number', and the last
	// 'continuation' record of a series shall contain a
	// 'WARC-Segment-Total-Length' field. The full details of WARC record
	// segmentation are described in the below section Record Segmentation. See
	// also annex C.8 below for an example of a ‘continuation’ record.
	RecordTypeContinuation
)

func ParseRecordType

func ParseRecordType(s string) RecordType

ParseRecordType parses a RecordType from a string

func (RecordType) String

func (r RecordType) String() string

RecordType satisfies the stringer interface

type Records

type Records []*Record

Records provides utility functions for slices of records.

A WARC format file is the simple concatenation of one or more WARC records. The first record usually describes the records to follow. In general, record content is either the direct result of a retrieval attempt — web pages, inline images, URL redirection information, DNS hostname lookup results, standalone files, etc. — or is synthesized material (e.g., metadata, transformed content) that provides additional information about archived content.

func UnmarshalRecords

func UnmarshalRecords(data []byte) (Records, error)

UnmarshalRecords reads a slice of records from a slice of bytes

func (Records) FilterTypes

func (rs Records) FilterTypes(types ...RecordType) Records

FilterTypes return all record types that match a provide list of RecordTypes

func (Records) RemoveTargetURIRecords

func (rs Records) RemoveTargetURIRecords(uri string) (recs Records)

RemoveTargetURIRecords returns a Records slice with all records that refer to uri removed

func (Records) TargetURIRecord

func (rs Records) TargetURIRecord(uri string, types ...RecordType) *Record

TargetURIRecord returns a record matching uri optionally filtered by a list of record types. There are a number of "gotchas" if multiple record types of the same url are in the list. TODO - eliminate "gotchas"

type Writer

type Writer struct {

	// RecordCallback will be called after each record is written to the file.
	// If a WriteSeeker was not provided, the provided positions will be
	// invalid.
	RecordCallback func(r *Record, startPos, endPos int64)
	// contains filtered or unexported fields
}

Writer provides functionality for writing WARC files in compressed and uncompressed formats.

To construct a Writer, call NewWriterCompressed or NewWriterRaw.

func NewWriterCompressed

func NewWriterCompressed(rawFile io.WriteSeeker, cmprsWriter *gzip.Writer) (*Writer, error)

NewWriterCompressed initializes a WARC Writer writing to a compressed stream. The first parameter should be the "backing stream" of the compression. The second parameter is a compress/gzip writer writing to the rawFile parameter.

Seek will only be called with whence == io.SeekCurrent and offset == 0.

See also CountWriter() if you need a "fake" Seek implementation.

func NewWriterRaw

func NewWriterRaw(out io.Writer) (*Writer, error)

NewWriterRaw initializes a WARC Writer writing to an uncompressed stream. If the provided Writer implements io.Seeker, the RecordCallback will be available. If the provided Writer implements interface{Flush() error}, it will be flushed after every written Record.

See also CountWriter() if you need a "fake" Seek implementation.

func (*Writer) Close

func (w *Writer) Close() error

Close cleans up any resources the warc.Writer might be holding on to.

func (*Writer) WriteRecord

func (w *Writer) WriteRecord(rec *Record) (startPos, endPos int64, err error)

WriteRecord adds the record to the WARC file and returns the file offsets the record was written at.

No processing is done to the Record contents beyond those mentioned in Record.Write. If clients want extra processing (e.g. setting the Warcinfo-Id header) they are encouraged to create a wrapper.

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL