Documentation ¶
Overview ¶
Package warc is an implementation of ISO28500 1.0, the WebARCive specfication. it provides readers, writers, and structs for working with warc records. from the spec: The WARC (Web ARChive) file format offers a convention for concatenating multiple resource records (data objects), each consisting of a set of simple text headers and an arbitrary data block into one long file. The WARC format is an extension of the ARC File Format [ARC] that has traditionally been used to store "web crawls" as sequences of content blocks harvested from the World Wide Web. Each capture in an ARC file is preceded by a one-line header that very briefly describes the harvested content and its length. This is directly followed by the retrieval protocol response messages and content. The original ARC format file is used by the Internet Archive (IA) since 1996 for managing billions of objects, and by several national libraries.
Index ¶
- Constants
- func CanonicalKey(key string) string
- func CountWriter(w io.Writer) io.WriteSeeker
- func NewRequestResponseRecords(info CaptureHelper, req *http.Request, resp *http.Response) (Record, Record, error)
- func NewUUID() string
- func Sanitize(contentSniff string, body []byte) (sanitized []byte, err error)
- func Sha1Digest(data []byte) string
- func WriteHTTPHeaders(w io.Writer, headers http.Header) error
- func WriteRecords(w io.Writer, records Records) error
- func WriteRequestMethodAndHeaders(w io.Writer, req *http.Request) error
- type CaptureHelper
- type Header
- type Reader
- type Record
- func (r *Record) Body() ([]byte, error)
- func (r *Record) Bytes() ([]byte, error)
- func (r *Record) ContentLength() int
- func (r *Record) Date() time.Time
- func (r *Record) ID() string
- func (r *Record) SetBody(body []byte) error
- func (r *Record) TargetURI() string
- func (r *Record) Write(w io.Writer) error
- type RecordFormat
- type RecordType
- type Records
- type Writer
Constants ¶
const ( // An identifier assigned to the current record that is globally unique for // its period of intended use. No identifier scheme is mandated by this // specification, but each record-id shall be a legal URI and clearly // indicate a documented and registered scheme to which it conforms (e.g., // via a URI scheme prefix such as "http:" or "urn:"). Care should be taken // to ensure that this value is written with no internal whitespace. FieldNameWARCRecordID = "WARC-Record-ID" // The number of octets in the block, similar to [RFC2616]. If no block is // present, a value of '0' (zero) shall be used. FieldNameContentLength = "Content-Length" // A 14-digit UTC timestamp formatted according to YYYY-MM-DDThh:mm:ssZ, // described in the W3C profile of ISO8601 [W3CDTF]. The timestamp shall // represent the instant that data capture for record creation began. // Multiple records written as part of a single capture event (see section // 5.7) shall use the same WARC-Date, even though the times of their // writing will not be exactly synchronized. FieldNameWARCDate = "WARC-Date" // The type of WARC record: one of 'warcinfo', 'response', 'resource', // 'request', 'metadata', 'revisit', 'conversion', or 'continuation'. Other // types of WARC records may be defined in extensions of the core format. // Types are further described in WARC Record Types. // A WARC file needs not contain any particular record types, though // starting all WARC files with a "warcinfo" record is recommended. FieldNameWARCType = "WARC-Type" // The MIME type [RFC2045] of the information contained in the record's // block. For example, in HTTP request and response records, this would be // 'application/http' as per section 19.1 of [RFC2616] (or // 'application/http; msgtype=request' and 'application/http; // msgtype=response' respectively). In particular, the content-type is not // the value of the HTTP Content-Type header in an HTTP response but a MIME // type to describe the full archived HTTP message (hence // 'application/http' if the block contains request or response headers). FieldNameContentType = "Content-Type" // The WARC-Record-IDs of any records created as part of the same capture // event as the current record. A capture event comprises the information // automatically gathered by a retrieval against a single target-URI; for // example, it might be represented by a 'response' or 'revisit' record // plus its associated 'request' record. // This field may be used to associate records of types 'request', // 'response', 'resource', 'metadata', and 'revisit' with one another when // they arise from a single capture event (When so used, any // WARC-Concurrent-To association shall be considered bidirectional even if // the header only appears on one record.) The WARC Concurrent-to field // shall not be used in 'warcinfo', 'conversion', and 'continuation' // records. FieldNameWARCConcurrentTo = "WARC-Concurrent-To" // An optional parameter indicating the algorithm name and calculated value // of a digest applied to the full block of the record. // An example is a SHA-1 labelled Base32 ([RFC3548]) value: // WARC-Block-Digest: sha1:AB2CD3EF4GH5IJ6KL7MN8OPQ FieldNameWARCBlockDigest = "WARC-Block-Digest" // An optional parameter indicating the algorithm name and calculated value // of a digest applied to the payload referred to or contained by the // record - which is not necessarily equivalent to the record block. // The payload of an application/http block is its 'entity-body' (per // [RFC2616]). In contrast to WARC-Block-Digest, the WARC-Payload-Digest // field may also be used for data not actually present in the current // record block, for example when a block is left off in accordance with a // 'revisit' profile (see 'revisit'), or when a record is segmented (the // WARC-Payload-Digest recorded in the first segment of a segmented record // shall be the digest of the payload of the logical record). FieldNameWARCPayloadDigest = "WARC-Payload-Digest" // The numeric Internet address contacted to retrieve any included content. // An IPv4 address shall be written as a "dotted quad"; an IPv6 address // shall be written as per [RFC1884]. For an HTTP retrieval, this will be // the IP address used at retrieval time corresponding to the hostname in // the record's target-URI. FieldNameWARCIPAddress = "WARC-IP-Address" // The WARC-Refers-To field may be used to associate a 'metadata' record to // another record it describes. The WARC-Refers-To field may also be used // to associate a record of type 'revisit' or 'conversion' with the // preceding record which helped determine the present record content. The // WARC-Refers-To field shall not be used in 'warcinfo', 'response', // ‘resource’, 'request', and 'continuation' records. FieldNameWARCRefersTo = "WARC-Refers-To" // The original URI whose capture gave rise to the information content in // this record. In the context of web harvesting, this is the URI that was // the target of a crawler's retrieval request. For a 'revisit' record, it // is the URI that was the target of a retrieval request. Indirectly, such // as for a 'metadata', or 'conversion' record, it is a copy of the // WARC-Target-URI appearing in the original record to which the newer // record pertains. The URI in this value shall be properly escaped // according to [RFC3986] and written with no internal whitespace. FieldNameWARCTargetURI = "WARC-Target-URI" // For practical reasons, writers of the WARC format may place limits on // the time or storage allocated to archiving a single resource. As a // result, only a truncated portion of the original resource may be // available for saving into a WARC record. // // Any record may indicate that truncation of its content block has // occurred and give the reason with a 'WARC-Truncated' field. FieldNameWARCTruncated = "WARC-Truncated" // When present, indicates the WARC-Record-ID of the associated 'warcinfo' // record for this record. Typically, the Warcinfo-ID parameter is used // when the context of the applicable 'warcinfo' record is unavailable, // such as after distributing single records into separate WARC files. WARC // writing applications (such web crawlers) may choose to always record // this parameter. FieldNameWARCWarcinfoID = "WARC-Warcinfo-ID" // The WARC-Filename field may be used in 'warcinfo' type records and shall // not be used for other record types. FieldNameWARCFilename = "WARC-Filename" // A URI signifying the kind of analysis and handling applied in a // 'revisit' record. (Like an XML namespace, the URI may, but need not, // return human-readable or machine-readable documentation.) If reading // software does not recognize the given URI as a supported kind of // handling, it shall not attempt to interpret the associated record block. FieldNameWARCProfile = "WARC-Profile" // The content-type of the record's payload as determined by an independent // check. This string shall not be arrived at by blindly promoting an HTTP // Content-Type value up from a record block into the WARC header without // direct analysis of the payload, as such values may often be unreliable. FieldNameWARCIdentifiedPayloadType = "WARC-Identified-Payload-Type" // Reports the current record's relative ordering in a sequence of // segmented records. // In the first segment of any record that is completed in one or more // later 'continuation' WARC records, this parameter is mandatory. Its // value there is "1". In a 'continuation' record, this parameter is also // mandatory. Its value is the sequence number of the current segment in // the logical whole record, increasing by 1 in each next segment. FieldNameWARCSegmentNumber = "WARC-Segment-Number" // Identifies the starting record in a series of segmented records whose // content blocks are reassembled to obtain a logically complete content // block. // This field is mandatory on all 'continuation' records, and shall not be // used in other records. See the section below, Record segmentation, for // full details on the use of WARC record segmentation. FieldNameWARCSegmentOriginID = "WARC-Segment-Origin-ID" // In the final record of a segmented series, reports the total length of // all segment content blocks when concatenated together. // This field is mandatory on the last 'continuation' record of a series, // and shall not be used elsewhere. FieldNameWARCSegmentTotalLength = "WARC-Segment-Total-Length" )
Named fields within a WARC record provide information about the current record, and allow additional per-record information. WARC both reuses appropriate headers from other standards and defines new headers, all beginning "WARC-", for WARC-specific purposes.
WARC named fields of the same type shall not be repeated in the same WARC record (for example, a WARC record shall not have several WARC-Date or several WARC-Target-URI), except as noted (e.g., WARC-Concurrent-To).
const TimeFormat = "2006-01-02T15:04:05Z"
TimeFormat is time.RFC3339, but with no timezone (just a Z).
Variables ¶
This section is empty.
Functions ¶
func CanonicalKey ¶
CanonicalKey conforms keys to CanonicalMIMEHeaderKey (which is Capitals-For-First-Letter-Separated-By-Dashes) for any general input with exceptions for capitalized "WARC" header keys. The WARC 1.0 spec calls for case-insensitive header keys, but the spec token diagrams list headers as being case-sensitive, so we'll honor case any case on read, but write records that match the spec token diagrams.
func CountWriter ¶
func CountWriter(w io.Writer) io.WriteSeeker
CountWriter implements a limited version of io.Seeker around the provided Writer. It only supports offset == 0 and whence == io.SeekCurrent or io.SeekEnd, and returns the current number of written bytes in both cases.
func NewRequestResponseRecords ¶
func NewRequestResponseRecords(info CaptureHelper, req *http.Request, resp *http.Response) (Record, Record, error)
NewRequestResponseRecords creates a new request/response record pair for the provided HTTP request and response.
Make sure to provide the request Body in the CaptureHelper so it can be read from again. The response Body should not yet have been used; if the caller needs the body, replace it with an ioutil.NopCloser(io.TeeReader) (the caller is then responsible for calling body.Close()).
func Sha1Digest ¶
Sha1Digest calculates the shasum of a slice of bytes
func WriteHTTPHeaders ¶
WriteHTTPHeaders writes all http headers to an io.Writer, separated by newlines Used to add http headers to a record
func WriteRecords ¶
WriteRecords calls Write on each record to w. Deprecated: see Writer type
Types ¶
type CaptureHelper ¶
type CaptureHelper struct { WarcinfoID string RemoteAddr string // The request body will need to be read multiple times, so please provide // one of the following. (note: bytes.Reader and strings.Reader are // ReadSeekers.) ReqBodyReadSeeker io.ReadSeeker ReqBodyBytesBuffer *bytes.Buffer }
CaptureHelper is used for the NewRequestResponseRecords() method. Additional fields may be added in the future.
type Header ¶
Header mimics net/http's header package, but with string values Users should use Get & Set methods instead of accessing the map directly.
type Reader ¶
type Reader struct {
// contains filtered or unexported fields
}
Reader parses WARC records from an underlying scanner. Create a new reader with NewReader
func NewReader ¶
NewReader creates a new WARC reader from an io.Reader Always use NewReader, (instead of manually allocating a reader)
type Record ¶
type Record struct { Format RecordFormat Type RecordType Headers Header Content *bytes.Buffer }
A Record consists of a version indicator (eg: WARC/1.0), zero or more headers, and possibly a content block. Upgrades to specific types of records can be done using type assertions and/or the Type method.
func UnmarshalRecord ¶
UnmarshalRecord reads a single record from data
func (*Record) ContentLength ¶
ContentLength of content block in bytes, returns 0 if Content-Length header is missing or invalid
func (*Record) Date ¶
Date gives the time.Time of record creation, returns empty (zero) time if no Warc-Date header is present, or if the header is an invalid timestamp
func (*Record) SetBody ¶
SetBody sets the body of the record, leaving any written http headers in record
type RecordFormat ¶
type RecordFormat int
RecordFormat determines different formats for records, this is for any later support of ARC files, should we need to add it.
const ( // RecordFormatWarc default is the Warc Format 1.0 RecordFormatWarc RecordFormat = iota // RecordFormatUnknown reporesents unknown / errored record format RecordFormatUnknown )
func (RecordFormat) String ¶
func (r RecordFormat) String() string
type RecordType ¶
type RecordType int
RecordType enumerates different types of WARC Records
const ( // RecordTypeUnknown is the default type of record, which shouldn't be // accepted by anything that wants to know a type of record. RecordTypeUnknown RecordType = iota // RecordTypeWarcInfo describes the records that follow it, up through end // of file, end of input, or until next 'warcinfo' record. Typically, this // appears once and at the beginning of a WARC file. For a web archive, it // often contains information about the web crawl which generated the // following records. // The format of this descriptive record block may vary, though the use of // the "application/warc-fields" content-type is recommended. Allowable // fields include, but are not limited to, all \[DCMI\] plus the following // field definitions. All fields are optional. RecordTypeWarcInfo // RecordTypeResponse should contain a complete scheme-specific response, // including network protocol information where possible. The exact // contents of a 'response' record are determined not just by the record // type but also by the URI scheme of the record's target-URI, as described // below. RecordTypeResponse // RecordTypeResource contains a resource, without full protocol response // information. For example: a file directly retrieved from a locally // accessible repository or the result of a networked retrieval where the // protocol information has been discarded. The exact contents of a // 'resource' record are determined not just by the record type but also by // the URI scheme of the record's target-URI, as described below. // For all 'resource' records, the payload is defined as the record block. // A 'resource' record, with a synthesized target-URI, may also be used to // archive other artefacts of a harvesting process inside WARC files. RecordTypeResource // RecordTypeRequest holds the details of a complete scheme-specific // request, including network protocol information where possible. The // exact contents of a 'request' record are determined not just by the // record type but also by the URI scheme of the record's target-URI, as // described below. RecordTypeRequest // RecordTypeMetadata contains content created in order to further // describe, explain, or accompany a harvested resource, in ways not // covered by other record types. A 'metadata' record will almost always // refer to another record of another type, with that other record holding // original harvested or transformed content. (However, it is allowable for // a 'metadata' record to refer to any record type, including other // 'metadata' records.) Any number of metadata records may reference one // specific other record. // The format of the metadata record block may vary. The // "application/warc-fields" format, defined earlier, may be used. // Allowable fields include all \[DCMI\] plus the following field // definitions. All fields are optional. RecordTypeMetadata // RecordTypeRevisit describes the revisitation of content already // archived, and might include only an abbreviated content body which has // to be interpreted relative to a previous record. Most typically, a // 'revisit' record is used instead of a 'response' or 'resource' record to // indicate that the content visited was either a complete or substantial // duplicate of material previously archived. // Using a 'revisit' record instead of another type is optional, for when // benefits of reduced storage size or improved cross-referencing of // material are desired. RecordTypeRevisit // RecordTypeConversion shall contain an alternative version of another // record's content that was created as the result of an archival process. // Typically, this is used to hold content transformations that maintain // viability of content after widely available rendering tools for the // originally stored format disappear. As needed, the original content may // be migrated (transformed) to a more viable format in order to keep the // information usable with current tools while minimizing loss of // information (intellectual content, look and feel, etc). Any number of // 'conversion' records may be created that reference a specific source // record, which may itself contain transformed content. Each // transformation should result in a freestanding, complete record, with no // dependency on survival of the original record. // Metadata records may be used to further describe transformation records. // Wherever practical, a 'conversion' record should contain a // 'WARC-Refers-To' field to identify the prior material converted. RecordTypeConversion // RecordTypeContinuation blocks from 'continuation' records must be appended to // corresponding prior record block(s) (e.g., from other WARC files) to // create the logically complete full-sized original record. That is, // 'continuation' records are used when a record that would otherwise cause // a WARC file size to exceed a desired limit is broken into segments. A // continuation record shall contain the named fields // 'WARC-Segment-Origin-ID' and 'WARC-Segment-Number', and the last // 'continuation' record of a series shall contain a // 'WARC-Segment-Total-Length' field. The full details of WARC record // segmentation are described in the below section Record Segmentation. See // also annex C.8 below for an example of a ‘continuation’ record. RecordTypeContinuation )
func ParseRecordType ¶
func ParseRecordType(s string) RecordType
ParseRecordType parses a RecordType from a string
func (RecordType) String ¶
func (r RecordType) String() string
RecordType satisfies the stringer interface
type Records ¶
type Records []*Record
Records provides utility functions for slices of records.
A WARC format file is the simple concatenation of one or more WARC records. The first record usually describes the records to follow. In general, record content is either the direct result of a retrieval attempt — web pages, inline images, URL redirection information, DNS hostname lookup results, standalone files, etc. — or is synthesized material (e.g., metadata, transformed content) that provides additional information about archived content.
func UnmarshalRecords ¶
UnmarshalRecords reads a slice of records from a slice of bytes
func (Records) FilterTypes ¶
func (rs Records) FilterTypes(types ...RecordType) Records
FilterTypes return all record types that match a provide list of RecordTypes
func (Records) RemoveTargetURIRecords ¶
RemoveTargetURIRecords returns a Records slice with all records that refer to uri removed
func (Records) TargetURIRecord ¶
func (rs Records) TargetURIRecord(uri string, types ...RecordType) *Record
TargetURIRecord returns a record matching uri optionally filtered by a list of record types. There are a number of "gotchas" if multiple record types of the same url are in the list. TODO - eliminate "gotchas"
type Writer ¶
type Writer struct { // RecordCallback will be called after each record is written to the file. // If a WriteSeeker was not provided, the provided positions will be // invalid. RecordCallback func(r *Record, startPos, endPos int64) // contains filtered or unexported fields }
Writer provides functionality for writing WARC files in compressed and uncompressed formats.
To construct a Writer, call NewWriterCompressed or NewWriterRaw.
func NewWriterCompressed ¶
NewWriterCompressed initializes a WARC Writer writing to a compressed stream. The first parameter should be the "backing stream" of the compression. The second parameter is a compress/gzip writer writing to the rawFile parameter.
Seek will only be called with whence == io.SeekCurrent and offset == 0.
See also CountWriter() if you need a "fake" Seek implementation.
func NewWriterRaw ¶
NewWriterRaw initializes a WARC Writer writing to an uncompressed stream. If the provided Writer implements io.Seeker, the RecordCallback will be available. If the provided Writer implements interface{Flush() error}, it will be flushed after every written Record.
See also CountWriter() if you need a "fake" Seek implementation.
func (*Writer) WriteRecord ¶
WriteRecord adds the record to the WARC file and returns the file offsets the record was written at.
No processing is done to the Record contents beyond those mentioned in Record.Write. If clients want extra processing (e.g. setting the Warcinfo-Id header) they are encouraged to create a wrapper.