warc

package
v0.0.0-...-a9ba9a5 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Aug 4, 2019 License: GPL-2.0 Imports: 10 Imported by: 2

Documentation

Index

Constants

This section is empty.

Variables

View Source
var CONTENT_TYPES map[string]string = map[string]string{
	"warcinfo": "application/warc-fields",
	"response": "application/http; msgtype=response",
	"request":  "application/http; msgtype=request",
	"metadata": "application/warc-fields",
}
View Source
var KNOWN_HEADERS map[string]string = map[string]string{
	"type":           "WARC-Type",
	"date":           "WARC-Date",
	"record_id":      "WARC-Record-ID",
	"ip_address":     "WARC-IP-Address",
	"target_uri":     "WARC-Target-URI",
	"warcinfo_id":    "WARC-Warcinfo-ID",
	"request_uri":    "WARC-Request-URI",
	"content_type":   "Content-Type",
	"content_length": "Content-Length",
}
View Source
var RE_HEADER *regexp.Regexp = regexp.MustCompile("([a-zA-Z_\\-]+): *(.*)\r\n")
View Source
var RE_VERSION *regexp.Regexp = regexp.MustCompile("WARC/(\\d+.\\d+)\r\n")
View Source
var SUPPORTED_VERSIONS map[string]bool = map[string]bool{"1.0": true}

Functions

This section is empty.

Types

type WARCFile

type WARCFile struct {
	// contains filtered or unexported fields
}

func NewWARCFile

func NewWARCFile(reader io.ReadCloser) (*WARCFile, error)

Creates a new WARCFile input should be a handle to a gzipped WARC file

func (*WARCFile) Close

func (wf *WARCFile) Close() error

func (*WARCFile) GetReader

func (wf *WARCFile) GetReader() *WARCReader

func (*WARCFile) ReadRecord

func (wf *WARCFile) ReadRecord() (*WARCRecord, error)

type WARCHeader

type WARCHeader struct {
	*utils.CIStringMap
	// contains filtered or unexported fields
}

The WARC Header object represents the headers of a WARC record. It provides dictionary like interface for accessing the headers.

The following mandatory fields are accessible also as get/set methods.

  • h.GetRecordId() == h.Get('WARC-Record-ID')
  • h.GetContentLength() == h.Get("Content-Length") // converted to int
  • h.GetDate() == h.Get("WARC-Date")
  • h.GetType() == h.Get("WARC-Type")

:params headers: map[string]string of headers. :params defaults: If true, important headers like WARC-Record-ID,

WARC-Date, Content-Type and Content-Length are
initialized to automatically if not already present.
TODO: add this param back for read/write

func NewWARCHeader

func NewWARCHeader(headers map[string]string) *WARCHeader

TODO: restore 'defaults' arg for read/write

func (*WARCHeader) GetContentLength

func (wh *WARCHeader) GetContentLength() int

The Content-Length header as int.

func (*WARCHeader) GetDate

func (wh *WARCHeader) GetDate() string

The value of WARC-Date header.

func (*WARCHeader) GetRecordId

func (wh *WARCHeader) GetRecordId() string

The value of WARC-Record-ID header.

func (*WARCHeader) GetType

func (wh *WARCHeader) GetType() string

The value of WARC-Type header.

func (*WARCHeader) String

func (wh *WARCHeader) String() string

func (*WARCHeader) WriteTo

func (wh *WARCHeader) WriteTo(f io.Writer)

Writes this header to a file, in the format specified by WARC.

type WARCReader

type WARCReader struct {
	// contains filtered or unexported fields
}

func NewWARCReader

func NewWARCReader(filehandle io.Reader, gzipfile *gzip.Reader) *WARCReader

func (*WARCReader) Expect

func (wr *WARCReader) Expect(reader *bufio.Reader, expectedLine string, message string) error

func (*WARCReader) Iterate

func (wr *WARCReader) Iterate(callback func(*WARCRecord, error))

func (*WARCReader) ReadHeader

func (wr *WARCReader) ReadHeader(reader *bufio.Reader) (*WARCHeader, error)

func (*WARCReader) ReadRecord

func (wr *WARCReader) ReadRecord() (*WARCRecord, error)

type WARCRecord

type WARCRecord struct {
	// contains filtered or unexported fields
}

The WARCRecord object represents a WARC Record.

func NewWARCRecord

func NewWARCRecord(header *WARCHeader, payload *utils.FilePart, headers map[string]string) *WARCRecord

Creates a new WARC record.

func (*WARCRecord) Get

func (wr *WARCRecord) Get(name string) (string, bool)

func (*WARCRecord) GetChecksum

func (wr *WARCRecord) GetChecksum() string

func (*WARCRecord) GetDate

func (wr *WARCRecord) GetDate() string

UTC timestamp of the record.

func (*WARCRecord) GetHeader

func (wr *WARCRecord) GetHeader() *WARCHeader

func (*WARCRecord) GetIpAddress

func (wr *WARCRecord) GetIpAddress() string

The IP address of the host contacted to retrieve the content of this record. This value is available from the WARC-IP-Address header.

func (*WARCRecord) GetPayload

func (wr *WARCRecord) GetPayload() *utils.FilePart

func (*WARCRecord) GetType

func (wr *WARCRecord) GetType() string

Record type

func (*WARCRecord) GetUrl

func (wr *WARCRecord) GetUrl() string

The value of the WARC-Target-URI header if the record is of type "response".

func (*WARCRecord) Offset

func (wr *WARCRecord) Offset() int

Offset of this record in the warc file from which this record is read. TODO: not yet implemented. Currently hard-coded to -1

func (*WARCRecord) Set

func (wr *WARCRecord) Set(name string, value string)

Directories

Path Synopsis

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL