Documentation ¶
Overview ¶
Package gowarc provides a framework for handling WARC files, enabling their parsing, creation, and validation.
WARC Overview ¶
The WARC format offers a standard way to structure, manage and store billions of resources collected from the web and elsewhere. It is used to build applications for harvesting, managing, accessing, mining and exchanging content.
For more details, visit the WARC specification: https://iipc.github.io/warc-specifications/specifications/warc-format/warc-1.1/
WARC record creation ¶
The WarcRecordBuilder, initialized via NewRecordBuilder, is the primary tool for creating WARC records. By default, the WarcRecordBuilder generates a record id and calculates the 'Content-Length' and 'WARC-Block-Digest'.
Use WarcFileWriter, initialized with NewWarcFileWriter, to write WARC files.
WARC record parsing ¶
To parse single WARC records, use the Unmarshaler initialized with NewUnmarshaler.
To read entire WARC files, employ the WarcFileReader initialized through NewWarcFileReader.
Validation and repair ¶
The gowarc package supports validation during both the creation and parsing of WARC records. Control over the scope of validation and the handling of validation errors can be achieved by setting the appropriate options in the WarcRecordBuilder, Unmarshaler, or WarcFileReader.
Index ¶
- Constants
- Variables
- type Block
- type HeaderFieldError
- type HttpRequestBlock
- type HttpResponseBlock
- type Marshaler
- type PatternNameGenerator
- type PayloadBlock
- type ProtocolHeaderBlock
- type RecordType
- type RevisitRef
- type SyntaxError
- type Unmarshaler
- type Validation
- type WarcFields
- func (wf *WarcFields) Add(name string, value string)
- func (wf *WarcFields) AddId(name, value string)
- func (wf *WarcFields) AddInt(name string, value int)
- func (wf *WarcFields) AddInt64(name string, value int64)
- func (wf *WarcFields) AddTime(name string, value time.Time)
- func (wf *WarcFields) CanonicalHeaderKey(s string) string
- func (wf *WarcFields) Delete(key string)
- func (wf *WarcFields) Get(key string) string
- func (wf *WarcFields) GetAll(name string) []string
- func (wf *WarcFields) GetId(name string) string
- func (wf *WarcFields) GetInt(key string) (int, error)
- func (wf *WarcFields) GetInt64(name string) (int64, error)
- func (wf *WarcFields) GetTime(name string) (time.Time, error)
- func (wf *WarcFields) Has(name string) bool
- func (wf *WarcFields) Set(name string, value string)
- func (wf *WarcFields) SetId(name, value string)
- func (wf *WarcFields) SetInt(name string, value int)
- func (wf *WarcFields) SetInt64(name string, value int64)
- func (wf *WarcFields) SetTime(name string, value time.Time)
- func (wf *WarcFields) Sort()
- func (wf *WarcFields) String() string
- func (wf *WarcFields) Write(w io.Writer) (bytesWritten int64, err error)
- type WarcFieldsBlock
- type WarcFileNameGenerator
- type WarcFileReader
- type WarcFileWriter
- type WarcFileWriterOption
- func WithAddWarcConcurrentToHeader(addConcurrentHeader bool) WarcFileWriterOption
- func WithAfterFileCreationHook(f func(fileName string, size int64, warcInfoId string) error) WarcFileWriterOption
- func WithBeforeFileCreationHook(f func(fileName string) error) WarcFileWriterOption
- func WithCompressedFileSuffix(suffix string) WarcFileWriterOption
- func WithCompression(compress bool) WarcFileWriterOption
- func WithCompressionLevel(gzipLevel int) WarcFileWriterOption
- func WithExpectedCompressionRatio(ratio float64) WarcFileWriterOption
- func WithFileNameGenerator(generator WarcFileNameGenerator) WarcFileWriterOption
- func WithFlush(flush bool) WarcFileWriterOption
- func WithMarshaler(marshaler Marshaler) WarcFileWriterOption
- func WithMaxConcurrentWriters(count int) WarcFileWriterOption
- func WithMaxFileSize(size int64) WarcFileWriterOption
- func WithOpenFileSuffix(suffix string) WarcFileWriterOption
- func WithRecordOptions(opts ...WarcRecordOption) WarcFileWriterOption
- func WithSegmentation() WarcFileWriterOption
- func WithWarcInfoFunc(f func(recordBuilder WarcRecordBuilder) error) WarcFileWriterOption
- type WarcRecord
- type WarcRecordBuilder
- type WarcRecordOption
- func WithAddMissingContentLength(addMissingContentLength bool) WarcRecordOption
- func WithAddMissingDigest(addMissingDigest bool) WarcRecordOption
- func WithAddMissingRecordId(addMissingRecordId bool) WarcRecordOption
- func WithBlockErrorPolicy(policy errorPolicy) WarcRecordOption
- func WithBufferMaxMemBytes(size int64) WarcRecordOption
- func WithBufferTmpDir(dir string) WarcRecordOption
- func WithDefaultDigestAlgorithm(defaultDigestAlgorithm string) WarcRecordOption
- func WithDefaultDigestEncoding(defaultDigestEncoding digestEncoding) WarcRecordOption
- func WithFixContentLength(fixContentLength bool) WarcRecordOption
- func WithFixDigest(fixDigest bool) WarcRecordOption
- func WithFixSyntaxErrors(fixSyntaxErrors bool) WarcRecordOption
- func WithFixWarcFieldsBlockErrors(fixWarcFieldsBlockErrors bool) WarcRecordOption
- func WithNoValidation() WarcRecordOption
- func WithRecordIdFunc(recordIdFunc func() (string, error)) WarcRecordOption
- func WithSkipParseBlock() WarcRecordOption
- func WithSpecViolationPolicy(policy errorPolicy) WarcRecordOption
- func WithStrictValidation() WarcRecordOption
- func WithSyntaxErrorPolicy(policy errorPolicy) WarcRecordOption
- func WithUnknownRecordTypePolicy(policy errorPolicy) WarcRecordOption
- func WithVersion(version *WarcVersion) WarcRecordOption
- type WarcVersion
- type WriteResponse
Examples ¶
Constants ¶
const ( Base16 digestEncoding = 1 Base32 digestEncoding = 2 Base64 digestEncoding = 3 )
const ( // WARC header field name constants ContentLength = "Content-Length" ContentType = "Content-Type" WarcBlockDigest = "WARC-Block-Digest" WarcConcurrentTo = "WARC-Concurrent-To" WarcDate = "WARC-Date" WarcFilename = "WARC-Filename" WarcIPAddress = "WARC-IP-Address" WarcIdentifiedPayloadType = "WARC-Identified-Payload-Type" WarcPayloadDigest = "WARC-Payload-Digest" WarcProfile = "WARC-Profile" WarcRecordID = "WARC-Record-ID" WarcRefersTo = "WARC-Refers-To" WarcRefersToDate = "WARC-Refers-To-Date" WarcRefersToTargetURI = "WARC-Refers-To-Target-URI" WarcSegmentNumber = "WARC-Segment-Number" WarcSegmentOriginID = "WARC-Segment-Origin-ID" WarcSegmentTotalLength = "WARC-Segment-Total-Length" WarcTargetURI = "WARC-Target-URI" WarcTruncated = "WARC-Truncated" WarcType = "WARC-Type" WarcWarcinfoID = "WARC-Warcinfo-ID" WarcPageID = "WARC-Page-ID" // Browsertrix extension field WarcResourceType = "WARC-Resource-Type" // Browsertrix extension field WarcJSONMetadata = "WARC-JSON-Metadata" // Browsertrix extension field )
const ( ErrIgnore errorPolicy = 0 // Ignore the given error. ErrWarn errorPolicy = 1 // Ignore given error, but submit a warning. ErrFail errorPolicy = 2 // Fail on given error. )
const ( // Well known content types ApplicationWarcFields = "application/warc-fields" ApplicationHttp = "application/http" )
const ( // Well known revisit profiles ProfileIdenticalPayloadDigestV1_1 = "http://netpreserve.org/warc/1.1/revisit/identical-payload-digest" ProfileServerNotModifiedV1_1 = "http://netpreserve.org/warc/1.1/revisit/server-not-modified" ProfileIdenticalPayloadDigestV1_0 = "http://netpreserve.org/warc/1.0/revisit/identical-payload-digest" ProfileServerNotModifiedV1_0 = "http://netpreserve.org/warc/1.0/revisit/server-not-modified" )
Variables ¶
var ( // WARC versions V1_0 = &WarcVersion{id: 1, txt: "1.0", major: 1, minor: 0} // WARC 1.0 V1_1 = &WarcVersion{id: 2, txt: "1.1", major: 1, minor: 1} // WARC 1.1 )
Functions ¶
This section is empty.
Types ¶
type Block ¶
type Block interface { // RawBytes returns the bytes of the Block RawBytes() (io.Reader, error) BlockDigest() string Size() int64 IsCached() bool Cache() error io.Closer }
Block is the interface used to represent the content of a WARC record as specified by the WARC specification: https://iipc.github.io/warc-specifications/specifications/warc-format/warc-1.1/#warc-record-content-block
A Block might be cached or non-cached. Calling RawBytes or BlockDigest more than once will fail if the block is not cached.
NOTE: Blocks are not required to be thread safe.
type HeaderFieldError ¶
type HeaderFieldError struct {
// contains filtered or unexported fields
}
HeaderFieldError is used for violations of WARC header specification
func (*HeaderFieldError) Error ¶
func (e *HeaderFieldError) Error() string
type HttpRequestBlock ¶
type HttpRequestBlock interface { PayloadBlock ProtocolHeaderBlock HttpRequestLine() string HttpHeader() *http.Header }
type HttpResponseBlock ¶
type HttpResponseBlock interface { PayloadBlock ProtocolHeaderBlock HttpStatusLine() string HttpStatusCode() int HttpHeader() *http.Header }
type Marshaler ¶
type Marshaler interface {
Marshal(w io.Writer, record WarcRecord, maxSize int64) (WarcRecord, int64, error)
}
Marshaler is the interface that wraps the Marshal function.
Marshal converts a WARC record to its serialized form and returns the size of the marshalled record or any error encountered.
Depending on implementation, Marshal might return a WarcRecord which is the continuation of the record being written. See the description of record segmentation at https://iipc.github.io/warc-specifications/specifications/warc-format/warc-1.1/#record-segmentation
func NewMarshaler ¶
func NewMarshaler() Marshaler
type PatternNameGenerator ¶
type PatternNameGenerator struct { Directory string // Directory to store warcfiles. Defaults to the empty string Prefix string // Prefix available to be used in pattern. Defaults to the empty string Serial int32 // Serial number available for use in pattern. It is atomically increased with every generated file name. Pattern string // Pattern for generated file name. Defaults to: "%{prefix}s%{ts}s-%04{serial}d-%{hostOrIp}s.%{ext}s" Extension string // Extension for file name. Defaults to: "warc" // contains filtered or unexported fields }
PatternNameGenerator implements the WarcFileNameGenerator.
New filenames are generated based on a pattern which defaults to the recommendation in the WARC 1.1 standard (https://iipc.github.io/warc-specifications/specifications/warc-format/warc-1.1/#annex-c-informative-warc-file-size-and-name-recommendations). The pattern is like golangs fmt package (https://pkg.go.dev/fmt), but allows for named fields in curly braces. The available predefined names are:
- prefix - content of the Prefix field
- ext - content of the Extension field
- ts - current time as 14-digit GMT Time-stamp
- serial - atomically increased serial number for every generated file name. Initial value is 0 if Serial field is not set
- ip - primary IP address of the node
- host - host name of the node
- hostOrIp - host name of the node, falling back to IP address if host name could not be resolved
func (*PatternNameGenerator) NewWarcfileName ¶
func (g *PatternNameGenerator) NewWarcfileName() (string, string)
NewWarcfileName returns a directory (might be the empty string for current directory) and a file name
type PayloadBlock ¶
PayloadBlock is a Block with a well-defined payload.
Ref: https://iipc.github.io/warc-specifications/specifications/warc-format/warc-1.1/#warc-record-payload
type ProtocolHeaderBlock ¶
type ProtocolHeaderBlock interface { // ProtocolHeaderBytes returns the raw bytes from the protocol's header. ProtocolHeaderBytes() []byte }
ProtocolHeaderBlock is a Block with a well-defined protocol header e.g. http response
type RecordType ¶
type RecordType uint16
RecordType represents the type of a WARC record.
const ( // WARC record types Warcinfo RecordType = 1 Response RecordType = 2 Resource RecordType = 4 Request RecordType = 8 Metadata RecordType = 16 Revisit RecordType = 32 Conversion RecordType = 64 Continuation RecordType = 128 )
func (RecordType) String ¶
func (rt RecordType) String() string
String returns a string representation of the record type.
type RevisitRef ¶
type SyntaxError ¶
type SyntaxError struct {
// contains filtered or unexported fields
}
SyntaxError is used for syntactical errors like wrong line endings
func (*SyntaxError) Error ¶
func (e *SyntaxError) Error() string
func (*SyntaxError) Unwrap ¶
func (e *SyntaxError) Unwrap() error
type Unmarshaler ¶
type Unmarshaler interface {
Unmarshal(b *bufio.Reader) (WarcRecord, int64, *Validation, error)
}
Unmarshaler is the interface implemented by types that can unmarshal a WARC record. A new instance of Unmarshaler is created by calling NewUnmarshaler. NewUnmarshaler accepts a number of options that can be used to control the unmarshalling process. See WarcRecordOption for details.
Unmarshal parses the WARC record from the given reader and returns:
- The parsed WARC record. If an error occurred during the parsing, the returned WARC record might be nil.
- The offset value indicating the number of characters that have been discarded until the start of a new record is found.
- A pointer to a Validation object that stores any errors or warnings encountered during the parsing process. The validation object is only populated if the error specification is set to ErrWarn or ErrFail.
- The standard error object in Go. If no error occurred during the parsing, this object is nil. Otherwise, it contains details about the encountered error.
If the reader contains multiple records, Unmarshal parses the first record and returns. If the reader contains no records, Unmarshal returns an io.EOF error.
Example ¶
data := bytes.NewBufferString(" WARC/1.1\r\n" + "WARC-Date: 2017-03-06T04:03:53Z\r\n" + "WARC-Record-ID: <urn:uuid:e9a0cecc-0221-11e7-adb1-0242ac120008>\r\n" + "WARC-Filename: temp-20170306040353.warc.gz\r\n" + "WARC-Type: warcinfo\r\n" + "Content-Type: application/warc-fields\r\n" + "Warc-Block-Digest: sha1:af4d582b4ffc017d07a947d841e392a821f754f3\r\n" + "Content-Length: 34\r\n" + "\r\n" + "format: WARC File Format 1.1\r\n" + "\r\n\r\n") input := bufio.NewReader(data) // Create a new unmarshaler unmarshaler := gowarc.NewUnmarshaler(gowarc.WithSpecViolationPolicy(gowarc.ErrWarn), gowarc.WithSyntaxErrorPolicy(gowarc.ErrWarn)) wr, off, validation, err := unmarshaler.Unmarshal(input) if err == nil { fmt.Printf("Offset: %d, %s\n%s", off, wr, validation) }
Output: Offset: 2, WARC record: version: WARC/1.1, type: warcinfo, id: urn:uuid:e9a0cecc-0221-11e7-adb1-0242ac120008 gowarc: Validation errors: 1: gowarc: record was found 2 bytes after expected offset 2: block: wrong digest: expected sha1:af4d582b4ffc017d07a947d841e392a821f754f3, computed: sha1:8a936f9fd60d664cf95b1ffb40f1c4093e65bb40 3: too few bytes in end of record marker. Expected "\r\n\r\n", was ""
func NewUnmarshaler ¶
func NewUnmarshaler(opts ...WarcRecordOption) Unmarshaler
type Validation ¶
type Validation []error
Validation contain validation results.
func (*Validation) Error ¶
func (v *Validation) Error() string
func (*Validation) String ¶
func (v *Validation) String() string
func (*Validation) Valid ¶
func (v *Validation) Valid() bool
Valid returns true if no validation errors where found.
type WarcFields ¶
type WarcFields []*nameValue
WarcFields represents the key value pairs in a WARC-record header.
It is also used for representing the record block of records with content-type "application/warc-fields".
All key-manipulating functions take case-insensitive keys and modify them to their canonical form.
func (*WarcFields) Add ¶
func (wf *WarcFields) Add(name string, value string)
Add adds the key, value pair to the header. It appends to any existing values associated with key. The key is case-insensitive.
func (*WarcFields) AddId ¶
func (wf *WarcFields) AddId(name, value string)
AddId adds the key, value pair to the header. It appends to any existing values associated with key. The key is case-insensitive.
The value is surrounded with '<' and '>' if not already present.
func (*WarcFields) AddInt ¶
func (wf *WarcFields) AddInt(name string, value int)
AddInt adds the key, value pair to the header. It appends to any existing values associated with key. The key is case-insensitive.
func (*WarcFields) AddInt64 ¶
func (wf *WarcFields) AddInt64(name string, value int64)
AddInt64 adds the key, value pair to the header. It appends to any existing values associated with key. The key is case-insensitive.
func (*WarcFields) AddTime ¶
func (wf *WarcFields) AddTime(name string, value time.Time)
AddTime adds the key, value pair to the header. It appends to any existing values associated with key. The key is case-insensitive.
The value is converted to RFC 3339 format.
func (*WarcFields) CanonicalHeaderKey ¶
func (wf *WarcFields) CanonicalHeaderKey(s string) string
func (*WarcFields) Delete ¶
func (wf *WarcFields) Delete(key string)
Delete deletes the values associated with key. The key is case-insensitive.
func (*WarcFields) Get ¶
func (wf *WarcFields) Get(key string) string
Get gets the first value associated with the given key. It is case-insensitive. If the key doesn't exist or there are no values associated with the key, Get returns the empty string. To access multiple values of a key, use GetAll.
func (*WarcFields) GetAll ¶
func (wf *WarcFields) GetAll(name string) []string
GetAll returns all values associated with the given key. It is case-insensitive.
func (*WarcFields) GetId ¶
func (wf *WarcFields) GetId(name string) string
GetId is like Get, but removes the surrounding '<' and '>' from the field value.
func (*WarcFields) GetInt ¶
func (wf *WarcFields) GetInt(key string) (int, error)
GetInt is like Get, but converts the field value to int.
func (*WarcFields) GetInt64 ¶
func (wf *WarcFields) GetInt64(name string) (int64, error)
GetInt64 is like Get, but converts the field value to int64.
func (*WarcFields) GetTime ¶
func (wf *WarcFields) GetTime(name string) (time.Time, error)
GetTime is like Get, but converts the field value to time.Time. The field is expected to be in RFC 3339 format.
func (*WarcFields) Has ¶
func (wf *WarcFields) Has(name string) bool
Has returns true if field exists. This can be used to separate a missing field from a field for which value is the empty string.
func (*WarcFields) Set ¶
func (wf *WarcFields) Set(name string, value string)
Set sets the header entries associated with key to the single element value. It replaces any existing values associated with key. The key is case-insensitive
func (*WarcFields) SetId ¶
func (wf *WarcFields) SetId(name, value string)
SetId sets the header entries associated with key to the single element value. It replaces any existing values associated with key. The key is case-insensitive
The value is surrounded with '<' and '>' if not already present.
func (*WarcFields) SetInt ¶
func (wf *WarcFields) SetInt(name string, value int)
SetInt sets the header entries associated with key to the single element value. It replaces any existing values associated with key. The key is case-insensitive
func (*WarcFields) SetInt64 ¶
func (wf *WarcFields) SetInt64(name string, value int64)
SetInt64 sets the header entries associated with key to the single element value. It replaces any existing values associated with key. The key is case-insensitive
func (*WarcFields) SetTime ¶
func (wf *WarcFields) SetTime(name string, value time.Time)
SetTime sets the header entries associated with key to the single element value. It replaces any existing values associated with key. The key is case-insensitive
The value is converted to RFC 3339 format.
func (*WarcFields) Sort ¶
func (wf *WarcFields) Sort()
Sort sorts the fields in lexicographical order.
Only field names are sorted. Order of values for a repeated field is kept as is.
func (*WarcFields) String ¶
func (wf *WarcFields) String() string
type WarcFieldsBlock ¶
type WarcFieldsBlock interface { Block WarcFields() *WarcFields }
type WarcFileNameGenerator ¶
type WarcFileNameGenerator interface { // NewWarcfileName returns a directory (might be the empty string for current directory) and a file name NewWarcfileName() (string, string) }
WarcFileNameGenerator is the interface that wraps the NewWarcfileName function.
type WarcFileReader ¶
type WarcFileReader struct {
// contains filtered or unexported fields
}
WarcFileReader is used to read WARC files. Use NewWarcFileReader to create a new instance.
func NewWarcFileReader ¶
func NewWarcFileReader(filename string, offset int64, opts ...WarcRecordOption) (*WarcFileReader, error)
NewWarcFileReader creates a new WarcFileReader from the supplied filename. If offset is > 0, the reader will start reading from that offset. The WarcFileReader can be configured with options. See WarcRecordOption.
Example ¶
reader, err := gowarc.NewWarcFileReader("test.warc.gz", 0, gowarc.WithStrictValidation()) if err != nil { fmt.Println("Error creating warc reader:", err) return } for { record, _, _, err := reader.Next() if err == io.EOF { break } if err != nil { fmt.Println("Error reading record:", err) return } fmt.Println("Record type:", record.Type().String()) fmt.Println("Record version:", record.Version()) // Do more with record as per needs }
Output:
func NewWarcFileReaderFromStream ¶
func NewWarcFileReaderFromStream(r io.Reader, offset int64, opts ...WarcRecordOption) (*WarcFileReader, error)
NewWarcFileReaderFromStream creates a new WarcFileReader from the supplied io.Reader. The WarcFileReader can be configured with options. See WarcRecordOption.
It is the responsibility of the caller to close the io.Reader.
func (*WarcFileReader) Close ¶
func (wf *WarcFileReader) Close() error
Close closes the WarcFileReader.
func (*WarcFileReader) Next ¶
func (wf *WarcFileReader) Next() (WarcRecord, int64, *Validation, error)
Next reads the next WarcRecord from the WarcFileReader. The method also provides the offset at which the record is found within the file.
The validation and error values that Next produces depend on the errorPolicy options that have been set on the WarcFileReader:
ErrIgnore: This setting ignores all errors. A WarcRecord and its offset are returned without any validation. An error is only returned if the file is so badly formatted that nothing meaningful can be parsed.
ErrWarn: Similar to ErrIgnore, this setting returns a WarcRecord and its offset. However, the record is validated and all validation errors are collected in a Validation object which can then be examined.
ErrFail: If this is set, the method will return an error in the case of a validation error, and WarcRecord might be nil.
Mixed Policies: It's possible to set different error policies for different types of errors with the following options: WithSyntaxErrorPolicy, WithSpecViolationPolicy and WithUnknownRecordTypePolicy. The return values of Next would be a mix of the aforementioned scenarios based on the policies set.
When at end of file, returned offset is equal to length of file, WarcRecord is nil and err is io.EOF.
type WarcFileWriter ¶
type WarcFileWriter struct {
// contains filtered or unexported fields
}
WarcFileWriter is used to write WARC files. Use NewWarcFileWriter to create a new instance.
The WarcFileWriter writes to one or more files simultaneously. The number of files is controlled by the WithMaxConcurrentWriters option. The WarcFileWriter will create a new file when the current file size exceeds the value set by the WithMaxFileSize option. File names are generated by the WarcFileNameGenerator set by the WithFileNameGenerator option. The WarcFileWriter will add a Warcinfo record to each file if the WithWarcInfoFunc option is set.
func NewWarcFileWriter ¶
func NewWarcFileWriter(opts ...WarcFileWriterOption) *WarcFileWriter
NewWarcFileWriter creates a new WarcFileWriter with the supplied options.
Example ¶
nameGenerator := &gowarc.PatternNameGenerator{Directory: "directory-name"} w := gowarc.NewWarcFileWriter(gowarc.WithFileNameGenerator(nameGenerator)) defer func() { w.Close() }() builder := gowarc.NewRecordBuilder(gowarc.Response, gowarc.WithStrictValidation()) _, err := builder.WriteString("HTTP/1.1 200 OK\r\nDate: Tue, 19 Sep 2016 17:18:40 GMT\r\nContent-Length: 19 ....") if err != nil { panic(err) } builder.AddWarcHeader(gowarc.WarcRecordID, "<urn:uuid:e9a0cecc-0221-11e7-adb1-0242ac120008>") builder.AddWarcHeader(gowarc.WarcDate, "2006-01-02T15:04:05Z") builder.AddWarcHeader(gowarc.ContentType, "application/http;msgtype=response") if wr, _, err := builder.Build(); err == nil { w.Write(wr) }
Output:
func (*WarcFileWriter) Close ¶
func (w *WarcFileWriter) Close() error
Close closes the current file(s) being written to and then releases all resources used by the WarcFileWriter.
Calling Write after Close will panic.
func (*WarcFileWriter) Rotate ¶
func (w *WarcFileWriter) Rotate() error
Rotate closes the current files beeing written to.
A call to Write after Rotate creates new files.
func (*WarcFileWriter) String ¶
func (w *WarcFileWriter) String() string
func (*WarcFileWriter) Write ¶
func (w *WarcFileWriter) Write(record ...WarcRecord) []WriteResponse
Write marshals one or more WarcRecords to file.
If more than one is written, then those will be written sequentially to the same file if size permits. If the writer was created with the WithAddWarcConcurrentToHeader option, each record will have cross-reference headers.
Returns a slice with one WriteResponse for each record written.
type WarcFileWriterOption ¶
type WarcFileWriterOption interface {
// contains filtered or unexported methods
}
WarcFileWriterOption configures how to write WARC files.
func WithAddWarcConcurrentToHeader ¶
func WithAddWarcConcurrentToHeader(addConcurrentHeader bool) WarcFileWriterOption
WithAddWarcConcurrentToHeader configures if records written in the same call to Write should have WARC-Concurrent-To headers added for cross-reference.
default false
func WithAfterFileCreationHook ¶
func WithAfterFileCreationHook(f func(fileName string, size int64, warcInfoId string) error) WarcFileWriterOption
WithAfterFileCreationHook sets a function to be called after a new file is created.
The function receives the file name of the new file, the size of the file and the WARC-Warcinfo-ID.
func WithBeforeFileCreationHook ¶
func WithBeforeFileCreationHook(f func(fileName string) error) WarcFileWriterOption
WithBeforeFileCreationHook sets a function to be called before a new file is created.
The function receives the file name of the new file.
func WithCompressedFileSuffix ¶
func WithCompressedFileSuffix(suffix string) WarcFileWriterOption
WithCompressedFileSuffix sets a suffix to be added after the name generated by the WarcFileNameGenerator id compression is on.
defaults to ".gz"
func WithCompression ¶
func WithCompression(compress bool) WarcFileWriterOption
WithCompression sets if writer should write gzip compressed WARC files.
defaults to true
func WithCompressionLevel ¶
func WithCompressionLevel(gzipLevel int) WarcFileWriterOption
WithCompressionLevel sets the gzip level (1-9) to use for compression.
defaults to 5
func WithExpectedCompressionRatio ¶
func WithExpectedCompressionRatio(ratio float64) WarcFileWriterOption
WithExpectedCompressionRatio sets the expectd reduction in size when using compression.
This value is used to decide if a record will fit into a Warcfile's MaxFileSize when using compression since it's not possible to know this before the record is written. If the value is far from the actual size reduction, an under- or overfilled file might be the result.
defaults to .5 (half the uncompressed size)
func WithFileNameGenerator ¶
func WithFileNameGenerator(generator WarcFileNameGenerator) WarcFileWriterOption
WithFileNameGenerator sets the WarcFileNameGenerator to use for generating new Warc file names.
Default is to use a PatternNameGenerator with the default pattern.
func WithFlush ¶
func WithFlush(flush bool) WarcFileWriterOption
WithFlush sets if writer should commit each record to stable storage.
defaults to false
func WithMarshaler ¶
func WithMarshaler(marshaler Marshaler) WarcFileWriterOption
WithMarshaler sets the Warc record marshaler to use.
defaults to defaultMarshaler
func WithMaxConcurrentWriters ¶
func WithMaxConcurrentWriters(count int) WarcFileWriterOption
WithMaxConcurrentWriters sets the maximum number of Warc files that can be written simultaneously.
defaults to one
func WithMaxFileSize ¶
func WithMaxFileSize(size int64) WarcFileWriterOption
WithMaxFileSize sets the max size of the Warc file before creating a new one.
defaults to 1 GiB
func WithOpenFileSuffix ¶
func WithOpenFileSuffix(suffix string) WarcFileWriterOption
WithOpenFileSuffix sets a suffix to be added to the file name while the file is open for writing.
The suffix is automatically removed when the file is closed.
defaults to ".open"
func WithRecordOptions ¶
func WithRecordOptions(opts ...WarcRecordOption) WarcFileWriterOption
WithRecordOptions sets the options to use for creating WarcInfo records.
See WithWarcInfoFunc
func WithSegmentation ¶
func WithSegmentation() WarcFileWriterOption
WithSegmentation sets if writer should use segmentation for large WARC records.
defaults to false
func WithWarcInfoFunc ¶
func WithWarcInfoFunc(f func(recordBuilder WarcRecordBuilder) error) WarcFileWriterOption
WithWarcInfoFunc sets a warcinfo-record generator function to be called for every new WARC-file created.
The function receives a WarcRecordBuilder which is prepopulated with WARC-Record-ID, WARC-Type, WARC-Date and Content-Type. After the submitted function returns, Content-Length and WARC-Block-Digest fields are calculated.
When this option is set, records written to the warcfile will have the WARC-Warcinfo-ID automatically set to point to the generated warcinfo record.
Use WithRecordOptions to modify the options used to create the WarcInfo record.
defaults nil (no generation of warcinfo record)
type WarcRecord ¶
type WarcRecord interface { // Version returns the WARC version of the record. Version() *WarcVersion // Type returns the WARC record type. Type() RecordType // WarcHeader returns the WARC header fields. WarcHeader() *WarcFields // Block returns the content block of the record. Block() Block // RecordId returns the WARC-Record-ID header field. RecordId() string // ContentLength returns the Content-Length header field. ContentLength() (int64, error) // Date returns the WARC-Date header field. Date() (time.Time, error) // String returns a string representation of the record. String() string // Closer closes the record and releases any resources associated with it. io.Closer // ToRevisitRecord takes RevisitRef referencing the record we want to make a revisit of and returns a revisit record. ToRevisitRecord(ref *RevisitRef) (WarcRecord, error) // RevisitRef extracts a RevisitRef from the current record if it is a revisit record. RevisitRef() (*RevisitRef, error) // CreateRevisitRef creates a RevisitRef which references the current record. // // The RevisitRef might be used by another record's ToRevisitRecord to create a revisit record referencing this record. CreateRevisitRef(profile string) (*RevisitRef, error) // Merge merges this record with its referenced record(s) // // It is implemented only for revisit records, but this function will be enhanced to also support segmented records. Merge(record ...WarcRecord) (WarcRecord, error) // ValidateDigest validates block and payload digests if present. // // If option FixDigest is set, an invalid or missing digest will be corrected in the header. // Digest validation requires the whole content block to be read. As a side effect the Content-Length field is also validated // and if option FixContentLength is set, a wrong content length will be corrected in the header. // // If the record is not cached, it might not be possible to read any content from this record after validation. // // The result is dependent on the SpecViolationPolicy option: // ErrIgnore: only fatal errors are returned. // ErrWarn: all errors found will be added to the Validation. // ErrFail: the first error is returned and no more validation is done. ValidateDigest(validation *Validation) error }
WarcRecord is the interface implemented by types that can represent a WARC record. A new instance of WarcRecord is created by a WarcRecordBuilder.
type WarcRecordBuilder ¶
type WarcRecordBuilder interface { io.Writer io.StringWriter io.ReaderFrom io.Closer AddWarcHeader(name string, value string) AddWarcHeaderInt(name string, value int) AddWarcHeaderInt64(name string, value int64) AddWarcHeaderTime(name string, value time.Time) Build() (WarcRecord, *Validation, error) Size() int64 SetRecordType(recordType RecordType) }
func NewRecordBuilder ¶
func NewRecordBuilder(recordType RecordType, opts ...WarcRecordOption) WarcRecordBuilder
NewRecordBuilder initializes a WarcRecordBuilder used for creating a new record.
WarcRecordBuilder implements io.Writer for adding the content block. recordType might be 0, but then SetRecordType or AddWarcHeader(WarcType, "myRecordType") must be called before Build is called.
When finished with adding headers and writing content, call Build on the WarcRecordBuilder to create a WarcRecord.
Example ¶
builder := gowarc.NewRecordBuilder(gowarc.Response) _, err := builder.WriteString("HTTP/1.1 200 OK\nDate: Tue, 19 Sep 2016 17:18:40 GMT\nServer: Apache/2.0.54 (Ubuntu)\n" + "Last-Modified: Mon, 16 Jun 2013 22:28:51 GMT\nETag: \"3e45-67e-2ed02ec0\"\nAccept-Ranges: bytes\n" + "Content-Length: 19\nConnection: close\nContent-Type: text/plain\n\nThis is the content") if err != nil { panic(err) } builder.AddWarcHeader(gowarc.WarcRecordID, "<urn:uuid:e9a0cecc-0221-11e7-adb1-0242ac120008>") builder.AddWarcHeader(gowarc.WarcDate, "2006-01-02T15:04:05Z") builder.AddWarcHeader(gowarc.ContentLength, "257") builder.AddWarcHeader(gowarc.ContentType, "application/http;msgtype=response") builder.AddWarcHeader(gowarc.WarcBlockDigest, "sha1:B285747AD7CC57AA74BCE2E30B453C8D1CB71BA4") if wr, v, err := builder.Build(); err == nil { fmt.Println(wr, v) }
Output: WARC record: version: WARC/1.1, type: response, id: urn:uuid:e9a0cecc-0221-11e7-adb1-0242ac120008
type WarcRecordOption ¶
type WarcRecordOption interface {
// contains filtered or unexported methods
}
WarcRecordOption configures validation, marshaling and unmarshaling of WARC records.
func WithAddMissingContentLength ¶
func WithAddMissingContentLength(addMissingContentLength bool) WarcRecordOption
WithAddMissingContentLength sets if missing Content-Length header should be calculated.
defaults to true
func WithAddMissingDigest ¶
func WithAddMissingDigest(addMissingDigest bool) WarcRecordOption
WithAddMissingDigest sets if missing Block digest and eventually Payload digest header fields should be calculated.
Only fields which can be generated automatically are added. That includes WarcRecordID, ContentLength, BlockDigest and PayloadDigest.
defaults to true
func WithAddMissingRecordId ¶
func WithAddMissingRecordId(addMissingRecordId bool) WarcRecordOption
WithAddMissingRecordId sets if missing WARC-Record-ID header should be generated.
defaults to true
func WithBlockErrorPolicy ¶
func WithBlockErrorPolicy(policy errorPolicy) WarcRecordOption
WithBlockErrorPolicy sets the policy for handling errors in block parsing.
For most records this is the content fetched from the original source and errors here should be ignored.
defaults to ErrIgnore
func WithBufferMaxMemBytes ¶
func WithBufferMaxMemBytes(size int64) WarcRecordOption
WithBufferMaxMemBytes sets the maximum amount of memory a buffer is allowed to use before overflowing to disk.
defaults to 1 MiB
func WithBufferTmpDir ¶
func WithBufferTmpDir(dir string) WarcRecordOption
WithBufferTmpDir sets the directory to use for temporary files.
If not set or dir is the empty string then the default directory for temporary files is used (see os.TempDir).
func WithDefaultDigestAlgorithm ¶
func WithDefaultDigestAlgorithm(defaultDigestAlgorithm string) WarcRecordOption
WithDefaultDigestAlgorithm sets which algorithm to use for digest generation.
Valid values: 'md5', 'sha1', 'sha256' and 'sha512'.
defaults to sha1
func WithDefaultDigestEncoding ¶
func WithDefaultDigestEncoding(defaultDigestEncoding digestEncoding) WarcRecordOption
WithDefaultDigestEncoding sets which encoding to use for digest generation.
Valid values: Base16, Base32 and Base64.
defaults to Base32
func WithFixContentLength ¶
func WithFixContentLength(fixContentLength bool) WarcRecordOption
WithFixContentLength sets if a ContentLength header with value which do not match the actual content length should be set to the real value.
This will not have any impact if SpecViolationPolicy is ErrIgnore ¶
defaults to true
func WithFixDigest ¶
func WithFixDigest(fixDigest bool) WarcRecordOption
WithFixDigest sets if a BlockDigest header or a PayloadDigest header with a value which do not match the actual content should be recalculated.
This will not have any impact if SpecViolationPolicy is ErrIgnore ¶
defaults to true
func WithFixSyntaxErrors ¶
func WithFixSyntaxErrors(fixSyntaxErrors bool) WarcRecordOption
WithFixSyntaxErrors sets if an attempt to fix syntax errors should be done when those are detected.
This will not have any impact if SyntaxErrorPolicy is ErrIgnore ¶
defaults to true
func WithFixWarcFieldsBlockErrors ¶
func WithFixWarcFieldsBlockErrors(fixWarcFieldsBlockErrors bool) WarcRecordOption
WithFixWarcFieldsBlockErrors sets if an attempt to fix syntax errors in warcfields block should be done when those are detected.
A warcfields block is typically generated by a web crawler. An error in this context suggests a potential bug in the crawler's WARC writer.
defaults to false
func WithNoValidation ¶
func WithNoValidation() WarcRecordOption
WithNoValidation sets the parser to do as little validation as possible.
This option is for parsing as fast as possible and being as lenient as possible. Settings implied by this option are:
SyntaxErrorPolicy = ErrIgnore SpecViolationPolicy = ErrIgnore UnknownRecordPolicy = ErrIgnore SkipParseBlock = true
func WithRecordIdFunc ¶
func WithRecordIdFunc(recordIdFunc func() (string, error)) WarcRecordOption
WithRecordIdFunc sets a function for generating WARC-Record-ID if AddMissingRecordId is true.
Expected output is a valid URI without the surrounding '<' and '>' as described in the WARC spec (https://iipc.github.io/warc-specifications/specifications/warc-format/warc-1.1/#warc-record-id-mandatory)
defaults to generating uuid
func WithSkipParseBlock ¶
func WithSkipParseBlock() WarcRecordOption
WithSkipParseBlock sets parser to skip detecting known block types.
This implies that no payload digest can be computed.
func WithSpecViolationPolicy ¶
func WithSpecViolationPolicy(policy errorPolicy) WarcRecordOption
WithSpecViolationPolicy sets the policy for handling violations of the WARC specification in WARC records.
defaults to ErrWarn
func WithStrictValidation ¶
func WithStrictValidation() WarcRecordOption
WithStrictValidation sets the parser to fail on first error or violation of WARC specification.
Settings implied by this option are:
SyntaxErrorPolicy = ErrFail SpecViolationPolicy = ErrFail UnknownRecordPolicy = ErrFail SkipParseBlock = false
func WithSyntaxErrorPolicy ¶
func WithSyntaxErrorPolicy(policy errorPolicy) WarcRecordOption
WithSyntaxErrorPolicy sets the policy for handling syntax errors in WARC records.
defaults to ErrWarn
func WithUnknownRecordTypePolicy ¶
func WithUnknownRecordTypePolicy(policy errorPolicy) WarcRecordOption
WithUnknownRecordTypePolicy sets the policy for handling unknown record types.
defaults to ErrWarn
func WithVersion ¶
func WithVersion(version *WarcVersion) WarcRecordOption
WithVersion sets the WARC version to use for new records.
defaults to WARC/1.1
type WarcVersion ¶
type WarcVersion struct {
// contains filtered or unexported fields
}
WarcVersion represents a WARC specification version.
For record creation, only WARC 1.0 and 1.1 are supported which are represented by the constants V1_0 and V1_1. During parsing of a record, the WarcVersion will take on the version value found in the record itself.
func (*WarcVersion) Major ¶
func (v *WarcVersion) Major() uint8
func (*WarcVersion) Minor ¶
func (v *WarcVersion) Minor() uint8
func (*WarcVersion) String ¶
func (v *WarcVersion) String() string
String returns a string representation of the WARC version in the format used by WARC files i.e. 'WARC/1.0' or 'WARC/1.1'.