gowarc

package module
v2.0.1 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Oct 31, 2024 License: Apache-2.0 Imports: 32 Imported by: 11

Documentation

Overview

Package gowarc provides a framework for handling WARC files, enabling their parsing, creation, and validation.

WARC Overview

The WARC format offers a standard way to structure, manage and store billions of resources collected from the web and elsewhere. It is used to build applications for harvesting, managing, accessing, mining and exchanging content.

For more details, visit the WARC specification: https://iipc.github.io/warc-specifications/specifications/warc-format/warc-1.1/

WARC record creation

The WarcRecordBuilder, initialized via NewRecordBuilder, is the primary tool for creating WARC records. By default, the WarcRecordBuilder generates a record id and calculates the 'Content-Length' and 'WARC-Block-Digest'.

Use WarcFileWriter, initialized with NewWarcFileWriter, to write WARC files.

WARC record parsing

To parse single WARC records, use the Unmarshaler initialized with NewUnmarshaler.

To read entire WARC files, employ the WarcFileReader initialized through NewWarcFileReader.

Validation and repair

The gowarc package supports validation during both the creation and parsing of WARC records. Control over the scope of validation and the handling of validation errors can be achieved by setting the appropriate options in the WarcRecordBuilder, Unmarshaler, or WarcFileReader.

Index

Examples

Constants

View Source
const (
	Base16 digestEncoding = 1
	Base32 digestEncoding = 2
	Base64 digestEncoding = 3
)
View Source
const (
	// WARC header field name constants
	ContentLength             = "Content-Length"
	ContentType               = "Content-Type"
	WarcBlockDigest           = "WARC-Block-Digest"
	WarcConcurrentTo          = "WARC-Concurrent-To"
	WarcDate                  = "WARC-Date"
	WarcFilename              = "WARC-Filename"
	WarcIPAddress             = "WARC-IP-Address"
	WarcIdentifiedPayloadType = "WARC-Identified-Payload-Type"
	WarcPayloadDigest         = "WARC-Payload-Digest"
	WarcProfile               = "WARC-Profile"
	WarcRecordID              = "WARC-Record-ID"
	WarcRefersTo              = "WARC-Refers-To"
	WarcRefersToDate          = "WARC-Refers-To-Date"
	WarcRefersToTargetURI     = "WARC-Refers-To-Target-URI"
	WarcSegmentNumber         = "WARC-Segment-Number"
	WarcSegmentOriginID       = "WARC-Segment-Origin-ID"
	WarcSegmentTotalLength    = "WARC-Segment-Total-Length"
	WarcTargetURI             = "WARC-Target-URI"
	WarcTruncated             = "WARC-Truncated"
	WarcType                  = "WARC-Type"
	WarcWarcinfoID            = "WARC-Warcinfo-ID"
	WarcPageID                = "WARC-Page-ID"       // Browsertrix extension field
	WarcResourceType          = "WARC-Resource-Type" // Browsertrix extension field
	WarcJSONMetadata          = "WARC-JSON-Metadata" // Browsertrix extension field
)
View Source
const (
	ErrIgnore errorPolicy = 0 // Ignore the given error.
	ErrWarn   errorPolicy = 1 // Ignore given error, but submit a warning.
	ErrFail   errorPolicy = 2 // Fail on given error.
)
View Source
const (
	// Well known content types
	ApplicationWarcFields = "application/warc-fields"
	ApplicationHttp       = "application/http"
)
View Source
const (
	// Well known revisit profiles
	ProfileIdenticalPayloadDigestV1_1 = "http://netpreserve.org/warc/1.1/revisit/identical-payload-digest"
	ProfileServerNotModifiedV1_1      = "http://netpreserve.org/warc/1.1/revisit/server-not-modified"
	ProfileIdenticalPayloadDigestV1_0 = "http://netpreserve.org/warc/1.0/revisit/identical-payload-digest"
	ProfileServerNotModifiedV1_0      = "http://netpreserve.org/warc/1.0/revisit/server-not-modified"
)

Variables

View Source
var (
	// WARC versions
	V1_0 = &WarcVersion{id: 1, txt: "1.0", major: 1, minor: 0} // WARC 1.0
	V1_1 = &WarcVersion{id: 2, txt: "1.1", major: 1, minor: 1} // WARC 1.1
)

Functions

This section is empty.

Types

type Block

type Block interface {
	// RawBytes returns the bytes of the Block
	RawBytes() (io.Reader, error)
	BlockDigest() string
	Size() int64
	IsCached() bool
	Cache() error
	io.Closer
}

Block is the interface used to represent the content of a WARC record as specified by the WARC specification: https://iipc.github.io/warc-specifications/specifications/warc-format/warc-1.1/#warc-record-content-block

A Block might be cached or non-cached. Calling RawBytes or BlockDigest more than once will fail if the block is not cached.

NOTE: Blocks are not required to be thread safe.

type HeaderFieldError

type HeaderFieldError struct {
	// contains filtered or unexported fields
}

HeaderFieldError is used for violations of WARC header specification

func (*HeaderFieldError) Error

func (e *HeaderFieldError) Error() string

type HttpRequestBlock

type HttpRequestBlock interface {
	PayloadBlock
	ProtocolHeaderBlock
	HttpRequestLine() string
	HttpHeader() *http.Header
}

type HttpResponseBlock

type HttpResponseBlock interface {
	PayloadBlock
	ProtocolHeaderBlock
	HttpStatusLine() string
	HttpStatusCode() int
	HttpHeader() *http.Header
}

type Marshaler

type Marshaler interface {
	Marshal(w io.Writer, record WarcRecord, maxSize int64) (WarcRecord, int64, error)
}

Marshaler is the interface that wraps the Marshal function.

Marshal converts a WARC record to its serialized form and returns the size of the marshalled record or any error encountered.

Depending on implementation, Marshal might return a WarcRecord which is the continuation of the record being written. See the description of record segmentation at https://iipc.github.io/warc-specifications/specifications/warc-format/warc-1.1/#record-segmentation

func NewMarshaler

func NewMarshaler() Marshaler

type PatternNameGenerator

type PatternNameGenerator struct {
	Directory string // Directory to store warcfiles. Defaults to the empty string
	Prefix    string // Prefix available to be used in pattern. Defaults to the empty string
	Serial    int32  // Serial number available for use in pattern. It is atomically increased with every generated file name.
	Pattern   string // Pattern for generated file name. Defaults to: "%{prefix}s%{ts}s-%04{serial}d-%{hostOrIp}s.%{ext}s"
	Extension string // Extension for file name. Defaults to: "warc"
	// contains filtered or unexported fields
}

PatternNameGenerator implements the WarcFileNameGenerator.

New filenames are generated based on a pattern which defaults to the recommendation in the WARC 1.1 standard (https://iipc.github.io/warc-specifications/specifications/warc-format/warc-1.1/#annex-c-informative-warc-file-size-and-name-recommendations). The pattern is like golangs fmt package (https://pkg.go.dev/fmt), but allows for named fields in curly braces. The available predefined names are:

  • prefix - content of the Prefix field
  • ext - content of the Extension field
  • ts - current time as 14-digit GMT Time-stamp
  • serial - atomically increased serial number for every generated file name. Initial value is 0 if Serial field is not set
  • ip - primary IP address of the node
  • host - host name of the node
  • hostOrIp - host name of the node, falling back to IP address if host name could not be resolved

func (*PatternNameGenerator) NewWarcfileName

func (g *PatternNameGenerator) NewWarcfileName() (string, string)

NewWarcfileName returns a directory (might be the empty string for current directory) and a file name

type PayloadBlock

type PayloadBlock interface {
	Block
	PayloadBytes() (io.Reader, error)
	PayloadDigest() string
}

PayloadBlock is a Block with a well-defined payload.

Ref: https://iipc.github.io/warc-specifications/specifications/warc-format/warc-1.1/#warc-record-payload

type ProtocolHeaderBlock

type ProtocolHeaderBlock interface {
	// ProtocolHeaderBytes returns the raw bytes from the protocol's header.
	ProtocolHeaderBytes() []byte
}

ProtocolHeaderBlock is a Block with a well-defined protocol header e.g. http response

type RecordType

type RecordType uint16

RecordType represents the type of a WARC record.

const (
	// WARC record types
	Warcinfo     RecordType = 1
	Response     RecordType = 2
	Resource     RecordType = 4
	Request      RecordType = 8
	Metadata     RecordType = 16
	Revisit      RecordType = 32
	Conversion   RecordType = 64
	Continuation RecordType = 128
)

func (RecordType) String

func (rt RecordType) String() string

String returns a string representation of the record type.

type RevisitRef

type RevisitRef struct {
	Profile        string
	TargetRecordId string
	TargetUri      string
	TargetDate     string
}

type SyntaxError

type SyntaxError struct {
	// contains filtered or unexported fields
}

SyntaxError is used for syntactical errors like wrong line endings

func (*SyntaxError) Error

func (e *SyntaxError) Error() string

func (*SyntaxError) Unwrap

func (e *SyntaxError) Unwrap() error

type Unmarshaler

type Unmarshaler interface {
	Unmarshal(b *bufio.Reader) (WarcRecord, int64, *Validation, error)
}

Unmarshaler is the interface implemented by types that can unmarshal a WARC record. A new instance of Unmarshaler is created by calling NewUnmarshaler. NewUnmarshaler accepts a number of options that can be used to control the unmarshalling process. See WarcRecordOption for details.

Unmarshal parses the WARC record from the given reader and returns:

  • The parsed WARC record. If an error occurred during the parsing, the returned WARC record might be nil.
  • The offset value indicating the number of characters that have been discarded until the start of a new record is found.
  • A pointer to a Validation object that stores any errors or warnings encountered during the parsing process. The validation object is only populated if the error specification is set to ErrWarn or ErrFail.
  • The standard error object in Go. If no error occurred during the parsing, this object is nil. Otherwise, it contains details about the encountered error.

If the reader contains multiple records, Unmarshal parses the first record and returns. If the reader contains no records, Unmarshal returns an io.EOF error.

Example
data := bytes.NewBufferString("  WARC/1.1\r\n" +
	"WARC-Date: 2017-03-06T04:03:53Z\r\n" +
	"WARC-Record-ID: <urn:uuid:e9a0cecc-0221-11e7-adb1-0242ac120008>\r\n" +
	"WARC-Filename: temp-20170306040353.warc.gz\r\n" +
	"WARC-Type: warcinfo\r\n" +
	"Content-Type: application/warc-fields\r\n" +
	"Warc-Block-Digest: sha1:af4d582b4ffc017d07a947d841e392a821f754f3\r\n" +
	"Content-Length: 34\r\n" +
	"\r\n" +
	"format: WARC File Format 1.1\r\n" +
	"\r\n\r\n")
input := bufio.NewReader(data)

// Create a new unmarshaler
unmarshaler := gowarc.NewUnmarshaler(gowarc.WithSpecViolationPolicy(gowarc.ErrWarn), gowarc.WithSyntaxErrorPolicy(gowarc.ErrWarn))
wr, off, validation, err := unmarshaler.Unmarshal(input)
if err == nil {
	fmt.Printf("Offset: %d, %s\n%s", off, wr, validation)
}
Output:

Offset: 2, WARC record: version: WARC/1.1, type: warcinfo, id: urn:uuid:e9a0cecc-0221-11e7-adb1-0242ac120008
gowarc: Validation errors:
  1: gowarc: record was found 2 bytes after expected offset
  2: block: wrong digest: expected sha1:af4d582b4ffc017d07a947d841e392a821f754f3, computed: sha1:8a936f9fd60d664cf95b1ffb40f1c4093e65bb40
  3: too few bytes in end of record marker. Expected "\r\n\r\n", was ""

func NewUnmarshaler

func NewUnmarshaler(opts ...WarcRecordOption) Unmarshaler

type Validation

type Validation []error

Validation contain validation results.

func (*Validation) Error

func (v *Validation) Error() string

func (*Validation) String

func (v *Validation) String() string

func (*Validation) Valid

func (v *Validation) Valid() bool

Valid returns true if no validation errors where found.

type WarcFields

type WarcFields []*nameValue

WarcFields represents the key value pairs in a WARC-record header.

It is also used for representing the record block of records with content-type "application/warc-fields".

All key-manipulating functions take case-insensitive keys and modify them to their canonical form.

func (*WarcFields) Add

func (wf *WarcFields) Add(name string, value string)

Add adds the key, value pair to the header. It appends to any existing values associated with key. The key is case-insensitive.

func (*WarcFields) AddId

func (wf *WarcFields) AddId(name, value string)

AddId adds the key, value pair to the header. It appends to any existing values associated with key. The key is case-insensitive.

The value is surrounded with '<' and '>' if not already present.

func (*WarcFields) AddInt

func (wf *WarcFields) AddInt(name string, value int)

AddInt adds the key, value pair to the header. It appends to any existing values associated with key. The key is case-insensitive.

func (*WarcFields) AddInt64

func (wf *WarcFields) AddInt64(name string, value int64)

AddInt64 adds the key, value pair to the header. It appends to any existing values associated with key. The key is case-insensitive.

func (*WarcFields) AddTime

func (wf *WarcFields) AddTime(name string, value time.Time)

AddTime adds the key, value pair to the header. It appends to any existing values associated with key. The key is case-insensitive.

The value is converted to RFC 3339 format.

func (*WarcFields) CanonicalHeaderKey

func (wf *WarcFields) CanonicalHeaderKey(s string) string

func (*WarcFields) Delete

func (wf *WarcFields) Delete(key string)

Delete deletes the values associated with key. The key is case-insensitive.

func (*WarcFields) Get

func (wf *WarcFields) Get(key string) string

Get gets the first value associated with the given key. It is case-insensitive. If the key doesn't exist or there are no values associated with the key, Get returns the empty string. To access multiple values of a key, use GetAll.

func (*WarcFields) GetAll

func (wf *WarcFields) GetAll(name string) []string

GetAll returns all values associated with the given key. It is case-insensitive.

func (*WarcFields) GetId

func (wf *WarcFields) GetId(name string) string

GetId is like Get, but removes the surrounding '<' and '>' from the field value.

func (*WarcFields) GetInt

func (wf *WarcFields) GetInt(key string) (int, error)

GetInt is like Get, but converts the field value to int.

func (*WarcFields) GetInt64

func (wf *WarcFields) GetInt64(name string) (int64, error)

GetInt64 is like Get, but converts the field value to int64.

func (*WarcFields) GetTime

func (wf *WarcFields) GetTime(name string) (time.Time, error)

GetTime is like Get, but converts the field value to time.Time. The field is expected to be in RFC 3339 format.

func (*WarcFields) Has

func (wf *WarcFields) Has(name string) bool

Has returns true if field exists. This can be used to separate a missing field from a field for which value is the empty string.

func (*WarcFields) Set

func (wf *WarcFields) Set(name string, value string)

Set sets the header entries associated with key to the single element value. It replaces any existing values associated with key. The key is case-insensitive

func (*WarcFields) SetId

func (wf *WarcFields) SetId(name, value string)

SetId sets the header entries associated with key to the single element value. It replaces any existing values associated with key. The key is case-insensitive

The value is surrounded with '<' and '>' if not already present.

func (*WarcFields) SetInt

func (wf *WarcFields) SetInt(name string, value int)

SetInt sets the header entries associated with key to the single element value. It replaces any existing values associated with key. The key is case-insensitive

func (*WarcFields) SetInt64

func (wf *WarcFields) SetInt64(name string, value int64)

SetInt64 sets the header entries associated with key to the single element value. It replaces any existing values associated with key. The key is case-insensitive

func (*WarcFields) SetTime

func (wf *WarcFields) SetTime(name string, value time.Time)

SetTime sets the header entries associated with key to the single element value. It replaces any existing values associated with key. The key is case-insensitive

The value is converted to RFC 3339 format.

func (*WarcFields) Sort

func (wf *WarcFields) Sort()

Sort sorts the fields in lexicographical order.

Only field names are sorted. Order of values for a repeated field is kept as is.

func (*WarcFields) String

func (wf *WarcFields) String() string

func (*WarcFields) Write

func (wf *WarcFields) Write(w io.Writer) (bytesWritten int64, err error)

Write implements the io.Writer interface.

type WarcFieldsBlock

type WarcFieldsBlock interface {
	Block
	WarcFields() *WarcFields
}

type WarcFileNameGenerator

type WarcFileNameGenerator interface {
	// NewWarcfileName returns a directory (might be the empty string for current directory) and a file name
	NewWarcfileName() (string, string)
}

WarcFileNameGenerator is the interface that wraps the NewWarcfileName function.

type WarcFileReader

type WarcFileReader struct {
	// contains filtered or unexported fields
}

WarcFileReader is used to read WARC files. Use NewWarcFileReader to create a new instance.

func NewWarcFileReader

func NewWarcFileReader(filename string, offset int64, opts ...WarcRecordOption) (*WarcFileReader, error)

NewWarcFileReader creates a new WarcFileReader from the supplied filename. If offset is > 0, the reader will start reading from that offset. The WarcFileReader can be configured with options. See WarcRecordOption.

Example
reader, err := gowarc.NewWarcFileReader("test.warc.gz", 0, gowarc.WithStrictValidation())
if err != nil {
	fmt.Println("Error creating warc reader:", err)
	return
}

for {
	record, _, _, err := reader.Next()
	if err == io.EOF {
		break
	}
	if err != nil {
		fmt.Println("Error reading record:", err)
		return
	}
	fmt.Println("Record type:", record.Type().String())
	fmt.Println("Record version:", record.Version())
	// Do more with record as per needs
}
Output:

func NewWarcFileReaderFromStream

func NewWarcFileReaderFromStream(r io.Reader, offset int64, opts ...WarcRecordOption) (*WarcFileReader, error)

NewWarcFileReaderFromStream creates a new WarcFileReader from the supplied io.Reader. The WarcFileReader can be configured with options. See WarcRecordOption.

It is the responsibility of the caller to close the io.Reader.

func (*WarcFileReader) Close

func (wf *WarcFileReader) Close() error

Close closes the WarcFileReader.

func (*WarcFileReader) Next

func (wf *WarcFileReader) Next() (WarcRecord, int64, *Validation, error)

Next reads the next WarcRecord from the WarcFileReader. The method also provides the offset at which the record is found within the file.

The validation and error values that Next produces depend on the errorPolicy options that have been set on the WarcFileReader:

  • ErrIgnore: This setting ignores all errors. A WarcRecord and its offset are returned without any validation. An error is only returned if the file is so badly formatted that nothing meaningful can be parsed.

  • ErrWarn: Similar to ErrIgnore, this setting returns a WarcRecord and its offset. However, the record is validated and all validation errors are collected in a Validation object which can then be examined.

  • ErrFail: If this is set, the method will return an error in the case of a validation error, and WarcRecord might be nil.

  • Mixed Policies: It's possible to set different error policies for different types of errors with the following options: WithSyntaxErrorPolicy, WithSpecViolationPolicy and WithUnknownRecordTypePolicy. The return values of Next would be a mix of the aforementioned scenarios based on the policies set.

When at end of file, returned offset is equal to length of file, WarcRecord is nil and err is io.EOF.

type WarcFileWriter

type WarcFileWriter struct {
	// contains filtered or unexported fields
}

WarcFileWriter is used to write WARC files. Use NewWarcFileWriter to create a new instance.

The WarcFileWriter writes to one or more files simultaneously. The number of files is controlled by the WithMaxConcurrentWriters option. The WarcFileWriter will create a new file when the current file size exceeds the value set by the WithMaxFileSize option. File names are generated by the WarcFileNameGenerator set by the WithFileNameGenerator option. The WarcFileWriter will add a Warcinfo record to each file if the WithWarcInfoFunc option is set.

func NewWarcFileWriter

func NewWarcFileWriter(opts ...WarcFileWriterOption) *WarcFileWriter

NewWarcFileWriter creates a new WarcFileWriter with the supplied options.

Example
nameGenerator := &gowarc.PatternNameGenerator{Directory: "directory-name"}

w := gowarc.NewWarcFileWriter(gowarc.WithFileNameGenerator(nameGenerator))
defer func() {
	w.Close()
}()

builder := gowarc.NewRecordBuilder(gowarc.Response, gowarc.WithStrictValidation())
_, err := builder.WriteString("HTTP/1.1 200 OK\r\nDate: Tue, 19 Sep 2016 17:18:40 GMT\r\nContent-Length: 19 ....")
if err != nil {
	panic(err)
}
builder.AddWarcHeader(gowarc.WarcRecordID, "<urn:uuid:e9a0cecc-0221-11e7-adb1-0242ac120008>")
builder.AddWarcHeader(gowarc.WarcDate, "2006-01-02T15:04:05Z")
builder.AddWarcHeader(gowarc.ContentType, "application/http;msgtype=response")

if wr, _, err := builder.Build(); err == nil {
	w.Write(wr)
}
Output:

func (*WarcFileWriter) Close

func (w *WarcFileWriter) Close() error

Close closes the current file(s) being written to and then releases all resources used by the WarcFileWriter.

Calling Write after Close will panic.

func (*WarcFileWriter) Rotate

func (w *WarcFileWriter) Rotate() error

Rotate closes the current files beeing written to.

A call to Write after Rotate creates new files.

func (*WarcFileWriter) String

func (w *WarcFileWriter) String() string

func (*WarcFileWriter) Write

func (w *WarcFileWriter) Write(record ...WarcRecord) []WriteResponse

Write marshals one or more WarcRecords to file.

If more than one is written, then those will be written sequentially to the same file if size permits. If the writer was created with the WithAddWarcConcurrentToHeader option, each record will have cross-reference headers.

Returns a slice with one WriteResponse for each record written.

type WarcFileWriterOption

type WarcFileWriterOption interface {
	// contains filtered or unexported methods
}

WarcFileWriterOption configures how to write WARC files.

func WithAddWarcConcurrentToHeader

func WithAddWarcConcurrentToHeader(addConcurrentHeader bool) WarcFileWriterOption

WithAddWarcConcurrentToHeader configures if records written in the same call to Write should have WARC-Concurrent-To headers added for cross-reference.

default false

func WithAfterFileCreationHook

func WithAfterFileCreationHook(f func(fileName string, size int64, warcInfoId string) error) WarcFileWriterOption

WithAfterFileCreationHook sets a function to be called after a new file is created.

The function receives the file name of the new file, the size of the file and the WARC-Warcinfo-ID.

func WithBeforeFileCreationHook

func WithBeforeFileCreationHook(f func(fileName string) error) WarcFileWriterOption

WithBeforeFileCreationHook sets a function to be called before a new file is created.

The function receives the file name of the new file.

func WithCompressedFileSuffix

func WithCompressedFileSuffix(suffix string) WarcFileWriterOption

WithCompressedFileSuffix sets a suffix to be added after the name generated by the WarcFileNameGenerator id compression is on.

defaults to ".gz"

func WithCompression

func WithCompression(compress bool) WarcFileWriterOption

WithCompression sets if writer should write gzip compressed WARC files.

defaults to true

func WithCompressionLevel

func WithCompressionLevel(gzipLevel int) WarcFileWriterOption

WithCompressionLevel sets the gzip level (1-9) to use for compression.

defaults to 5

func WithExpectedCompressionRatio

func WithExpectedCompressionRatio(ratio float64) WarcFileWriterOption

WithExpectedCompressionRatio sets the expectd reduction in size when using compression.

This value is used to decide if a record will fit into a Warcfile's MaxFileSize when using compression since it's not possible to know this before the record is written. If the value is far from the actual size reduction, an under- or overfilled file might be the result.

defaults to .5 (half the uncompressed size)

func WithFileNameGenerator

func WithFileNameGenerator(generator WarcFileNameGenerator) WarcFileWriterOption

WithFileNameGenerator sets the WarcFileNameGenerator to use for generating new Warc file names.

Default is to use a PatternNameGenerator with the default pattern.

func WithFlush

func WithFlush(flush bool) WarcFileWriterOption

WithFlush sets if writer should commit each record to stable storage.

defaults to false

func WithMarshaler

func WithMarshaler(marshaler Marshaler) WarcFileWriterOption

WithMarshaler sets the Warc record marshaler to use.

defaults to defaultMarshaler

func WithMaxConcurrentWriters

func WithMaxConcurrentWriters(count int) WarcFileWriterOption

WithMaxConcurrentWriters sets the maximum number of Warc files that can be written simultaneously.

defaults to one

func WithMaxFileSize

func WithMaxFileSize(size int64) WarcFileWriterOption

WithMaxFileSize sets the max size of the Warc file before creating a new one.

defaults to 1 GiB

func WithOpenFileSuffix

func WithOpenFileSuffix(suffix string) WarcFileWriterOption

WithOpenFileSuffix sets a suffix to be added to the file name while the file is open for writing.

The suffix is automatically removed when the file is closed.

defaults to ".open"

func WithRecordOptions

func WithRecordOptions(opts ...WarcRecordOption) WarcFileWriterOption

WithRecordOptions sets the options to use for creating WarcInfo records.

See WithWarcInfoFunc

func WithSegmentation

func WithSegmentation() WarcFileWriterOption

WithSegmentation sets if writer should use segmentation for large WARC records.

defaults to false

func WithWarcInfoFunc

func WithWarcInfoFunc(f func(recordBuilder WarcRecordBuilder) error) WarcFileWriterOption

WithWarcInfoFunc sets a warcinfo-record generator function to be called for every new WARC-file created.

The function receives a WarcRecordBuilder which is prepopulated with WARC-Record-ID, WARC-Type, WARC-Date and Content-Type. After the submitted function returns, Content-Length and WARC-Block-Digest fields are calculated.

When this option is set, records written to the warcfile will have the WARC-Warcinfo-ID automatically set to point to the generated warcinfo record.

Use WithRecordOptions to modify the options used to create the WarcInfo record.

defaults nil (no generation of warcinfo record)

type WarcRecord

type WarcRecord interface {
	// Version returns the WARC version of the record.
	Version() *WarcVersion

	// Type returns the WARC record type.
	Type() RecordType

	// WarcHeader returns the WARC header fields.
	WarcHeader() *WarcFields

	// Block returns the content block of the record.
	Block() Block

	// RecordId returns the WARC-Record-ID header field.
	RecordId() string

	// ContentLength returns the Content-Length header field.
	ContentLength() (int64, error)

	// Date returns the WARC-Date header field.
	Date() (time.Time, error)

	// String returns a string representation of the record.
	String() string

	// Closer closes the record and releases any resources associated with it.
	io.Closer

	// ToRevisitRecord takes RevisitRef referencing the record we want to make a revisit of and returns a revisit record.
	ToRevisitRecord(ref *RevisitRef) (WarcRecord, error)

	// RevisitRef extracts a RevisitRef from the current record if it is a revisit record.
	RevisitRef() (*RevisitRef, error)

	// CreateRevisitRef creates a RevisitRef which references the current record.
	//
	// The RevisitRef might be used by another record's ToRevisitRecord to create a revisit record referencing this record.
	CreateRevisitRef(profile string) (*RevisitRef, error)

	// Merge merges this record with its referenced record(s)
	//
	// It is implemented only for revisit records, but this function will be enhanced to also support segmented records.
	Merge(record ...WarcRecord) (WarcRecord, error)

	// ValidateDigest validates block and payload digests if present.
	//
	// If option FixDigest is set, an invalid or missing digest will be corrected in the header.
	// Digest validation requires the whole content block to be read. As a side effect the Content-Length field is also validated
	// and if option FixContentLength is set, a wrong content length will be corrected in the header.
	//
	// If the record is not cached, it might not be possible to read any content from this record after validation.
	//
	// The result is dependent on the SpecViolationPolicy option:
	//   ErrIgnore: only fatal errors are returned.
	//   ErrWarn: all errors found will be added to the Validation.
	//   ErrFail: the first error is returned and no more validation is done.
	ValidateDigest(validation *Validation) error
}

WarcRecord is the interface implemented by types that can represent a WARC record. A new instance of WarcRecord is created by a WarcRecordBuilder.

type WarcRecordBuilder

type WarcRecordBuilder interface {
	io.Writer
	io.StringWriter
	io.ReaderFrom
	io.Closer
	AddWarcHeader(name string, value string)
	AddWarcHeaderInt(name string, value int)
	AddWarcHeaderInt64(name string, value int64)
	AddWarcHeaderTime(name string, value time.Time)
	Build() (WarcRecord, *Validation, error)
	Size() int64
	SetRecordType(recordType RecordType)
}

func NewRecordBuilder

func NewRecordBuilder(recordType RecordType, opts ...WarcRecordOption) WarcRecordBuilder

NewRecordBuilder initializes a WarcRecordBuilder used for creating a new record.

WarcRecordBuilder implements io.Writer for adding the content block. recordType might be 0, but then SetRecordType or AddWarcHeader(WarcType, "myRecordType") must be called before Build is called.

When finished with adding headers and writing content, call Build on the WarcRecordBuilder to create a WarcRecord.

Example
builder := gowarc.NewRecordBuilder(gowarc.Response)
_, err := builder.WriteString("HTTP/1.1 200 OK\nDate: Tue, 19 Sep 2016 17:18:40 GMT\nServer: Apache/2.0.54 (Ubuntu)\n" +
	"Last-Modified: Mon, 16 Jun 2013 22:28:51 GMT\nETag: \"3e45-67e-2ed02ec0\"\nAccept-Ranges: bytes\n" +
	"Content-Length: 19\nConnection: close\nContent-Type: text/plain\n\nThis is the content")
if err != nil {
	panic(err)
}
builder.AddWarcHeader(gowarc.WarcRecordID, "<urn:uuid:e9a0cecc-0221-11e7-adb1-0242ac120008>")
builder.AddWarcHeader(gowarc.WarcDate, "2006-01-02T15:04:05Z")
builder.AddWarcHeader(gowarc.ContentLength, "257")
builder.AddWarcHeader(gowarc.ContentType, "application/http;msgtype=response")
builder.AddWarcHeader(gowarc.WarcBlockDigest, "sha1:B285747AD7CC57AA74BCE2E30B453C8D1CB71BA4")

if wr, v, err := builder.Build(); err == nil {
	fmt.Println(wr, v)
}
Output:

WARC record: version: WARC/1.1, type: response, id: urn:uuid:e9a0cecc-0221-11e7-adb1-0242ac120008

type WarcRecordOption

type WarcRecordOption interface {
	// contains filtered or unexported methods
}

WarcRecordOption configures validation, marshaling and unmarshaling of WARC records.

func WithAddMissingContentLength

func WithAddMissingContentLength(addMissingContentLength bool) WarcRecordOption

WithAddMissingContentLength sets if missing Content-Length header should be calculated.

defaults to true

func WithAddMissingDigest

func WithAddMissingDigest(addMissingDigest bool) WarcRecordOption

WithAddMissingDigest sets if missing Block digest and eventually Payload digest header fields should be calculated.

Only fields which can be generated automatically are added. That includes WarcRecordID, ContentLength, BlockDigest and PayloadDigest.

defaults to true

func WithAddMissingRecordId

func WithAddMissingRecordId(addMissingRecordId bool) WarcRecordOption

WithAddMissingRecordId sets if missing WARC-Record-ID header should be generated.

defaults to true

func WithBlockErrorPolicy

func WithBlockErrorPolicy(policy errorPolicy) WarcRecordOption

WithBlockErrorPolicy sets the policy for handling errors in block parsing.

For most records this is the content fetched from the original source and errors here should be ignored.

defaults to ErrIgnore

func WithBufferMaxMemBytes

func WithBufferMaxMemBytes(size int64) WarcRecordOption

WithBufferMaxMemBytes sets the maximum amount of memory a buffer is allowed to use before overflowing to disk.

defaults to 1 MiB

func WithBufferTmpDir

func WithBufferTmpDir(dir string) WarcRecordOption

WithBufferTmpDir sets the directory to use for temporary files.

If not set or dir is the empty string then the default directory for temporary files is used (see os.TempDir).

func WithDefaultDigestAlgorithm

func WithDefaultDigestAlgorithm(defaultDigestAlgorithm string) WarcRecordOption

WithDefaultDigestAlgorithm sets which algorithm to use for digest generation.

Valid values: 'md5', 'sha1', 'sha256' and 'sha512'.

defaults to sha1

func WithDefaultDigestEncoding

func WithDefaultDigestEncoding(defaultDigestEncoding digestEncoding) WarcRecordOption

WithDefaultDigestEncoding sets which encoding to use for digest generation.

Valid values: Base16, Base32 and Base64.

defaults to Base32

func WithFixContentLength

func WithFixContentLength(fixContentLength bool) WarcRecordOption

WithFixContentLength sets if a ContentLength header with value which do not match the actual content length should be set to the real value.

This will not have any impact if SpecViolationPolicy is ErrIgnore

defaults to true

func WithFixDigest

func WithFixDigest(fixDigest bool) WarcRecordOption

WithFixDigest sets if a BlockDigest header or a PayloadDigest header with a value which do not match the actual content should be recalculated.

This will not have any impact if SpecViolationPolicy is ErrIgnore

defaults to true

func WithFixSyntaxErrors

func WithFixSyntaxErrors(fixSyntaxErrors bool) WarcRecordOption

WithFixSyntaxErrors sets if an attempt to fix syntax errors should be done when those are detected.

This will not have any impact if SyntaxErrorPolicy is ErrIgnore

defaults to true

func WithFixWarcFieldsBlockErrors

func WithFixWarcFieldsBlockErrors(fixWarcFieldsBlockErrors bool) WarcRecordOption

WithFixWarcFieldsBlockErrors sets if an attempt to fix syntax errors in warcfields block should be done when those are detected.

A warcfields block is typically generated by a web crawler. An error in this context suggests a potential bug in the crawler's WARC writer.

defaults to false

func WithNoValidation

func WithNoValidation() WarcRecordOption

WithNoValidation sets the parser to do as little validation as possible.

This option is for parsing as fast as possible and being as lenient as possible. Settings implied by this option are:

SyntaxErrorPolicy = ErrIgnore
SpecViolationPolicy = ErrIgnore
UnknownRecordPolicy = ErrIgnore
SkipParseBlock = true

func WithRecordIdFunc

func WithRecordIdFunc(recordIdFunc func() (string, error)) WarcRecordOption

WithRecordIdFunc sets a function for generating WARC-Record-ID if AddMissingRecordId is true.

Expected output is a valid URI without the surrounding '<' and '>' as described in the WARC spec (https://iipc.github.io/warc-specifications/specifications/warc-format/warc-1.1/#warc-record-id-mandatory)

defaults to generating uuid

func WithSkipParseBlock

func WithSkipParseBlock() WarcRecordOption

WithSkipParseBlock sets parser to skip detecting known block types.

This implies that no payload digest can be computed.

func WithSpecViolationPolicy

func WithSpecViolationPolicy(policy errorPolicy) WarcRecordOption

WithSpecViolationPolicy sets the policy for handling violations of the WARC specification in WARC records.

defaults to ErrWarn

func WithStrictValidation

func WithStrictValidation() WarcRecordOption

WithStrictValidation sets the parser to fail on first error or violation of WARC specification.

Settings implied by this option are:

SyntaxErrorPolicy = ErrFail
SpecViolationPolicy = ErrFail
UnknownRecordPolicy = ErrFail
SkipParseBlock = false

func WithSyntaxErrorPolicy

func WithSyntaxErrorPolicy(policy errorPolicy) WarcRecordOption

WithSyntaxErrorPolicy sets the policy for handling syntax errors in WARC records.

defaults to ErrWarn

func WithUnknownRecordTypePolicy

func WithUnknownRecordTypePolicy(policy errorPolicy) WarcRecordOption

WithUnknownRecordTypePolicy sets the policy for handling unknown record types.

defaults to ErrWarn

func WithVersion

func WithVersion(version *WarcVersion) WarcRecordOption

WithVersion sets the WARC version to use for new records.

defaults to WARC/1.1

type WarcVersion

type WarcVersion struct {
	// contains filtered or unexported fields
}

WarcVersion represents a WARC specification version.

For record creation, only WARC 1.0 and 1.1 are supported which are represented by the constants V1_0 and V1_1. During parsing of a record, the WarcVersion will take on the version value found in the record itself.

func (*WarcVersion) Major

func (v *WarcVersion) Major() uint8

func (*WarcVersion) Minor

func (v *WarcVersion) Minor() uint8

func (*WarcVersion) String

func (v *WarcVersion) String() string

String returns a string representation of the WARC version in the format used by WARC files i.e. 'WARC/1.0' or 'WARC/1.1'.

type WriteResponse

type WriteResponse struct {
	FileName     string // filename
	FileOffset   int64  // the offset in file
	BytesWritten int64  // number of uncompressed bytes written
	Err          error  // eventual error
}

Directories

Path Synopsis

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL