zap

package
v0.6.0 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Jan 5, 2018 License: Apache-2.0 Imports: 20 Imported by: 0

README

zap file format

The file is written in the reverse order that we typically access data. This helps us write in one pass since later sections of the file require file offsets of things we've already written.

Current usage:

  • mmap the entire file
  • crc-32 bytes and version are in fixed position at end of the file
  • reading remainder of footer could be version specific
  • remainder of footer gives us:
    • 3 important offsets (docValue , fields index and stored data index)
    • 2 important values (number of docs and chunk factor)
  • field data is processed once and memoized onto the heap so that we never have to go back to disk for it
  • access to stored data by doc number means first navigating to the stored data index, then accessing a fixed position offset into that slice, which gives us the actual address of the data. the first bytes of that section tell us the size of data so that we know where it ends.
  • access to all other indexed data follows the following pattern:
    • first know the field name -> convert to id
    • next navigate to term dictionary for that field
      • some operations stop here and do dictionary ops
    • next use dictionary to navigate to posting list for a specific term
    • walk posting list
    • if necessary, walk posting details as we go
    • if location info is desired, consult location bitmap to see if it is there

stored fields section

  • for each document
    • preparation phase:
      • produce a slice of metadata bytes and data bytes
      • produce these slices in field id order
      • field value is appended to the data slice
      • metadata slice is govarint encoded with the following values for each field value
        • field id (uint16)
        • field type (byte)
        • field value start offset in uncompressed data slice (uint64)
        • field value length (uint64)
        • field number of array positions (uint64)
        • one additional value for each array position (uint64)
        • compress the data slice using snappy
    • file writing phase:
      • remember the start offset for this document
      • write out meta data length (varint uint64)
      • write out compressed data length (varint uint64)
      • write out the metadata bytes
      • write out the compressed data bytes

stored fields idx

  • for each document
    • write start offset (remembered from previous section) of stored data (big endian uint64)

With this index and a known document number, we have direct access to all the stored field data.

posting details (freq/norm) section

  • for each posting list
    • produce a slice containing multiple consecutive chunks (each chunk is govarint stream)
    • produce a slice remembering offsets of where each chunk starts
    • preparation phase:
      • for each hit in the posting list
      • if this hit is in next chunk close out encoding of last chunk and record offset start of next
      • encode term frequency (uint64)
      • encode norm factor (float32)
    • file writing phase:
      • remember start position for this posting list details
      • write out number of chunks that follow (varint uint64)
      • write out length of each chunk (each a varint uint64)
      • write out the byte slice containing all the chunk data

If you know the doc number you're interested in, this format lets you jump to the correct chunk (docNum/chunkFactor) directly and then seek within that chunk until you find it.

posting details (location) section

  • for each posting list
    • produce a slice containing multiple consecutive chunks (each chunk is govarint stream)
    • produce a slice remembering offsets of where each chunk starts
    • preparation phase:
      • for each hit in the posting list
      • if this hit is in next chunk close out encoding of last chunk and record offset start of next
      • encode field (uint16)
      • encode field pos (uint64)
      • encode field start (uint64)
      • encode field end (uint64)
      • encode number of array positions to follow (uint64)
      • encode each array position (each uint64)
    • file writing phase:
      • remember start position for this posting list details
      • write out number of chunks that follow (varint uint64)
      • write out length of each chunk (each a varint uint64)
      • write out the byte slice containing all the chunk data

If you know the doc number you're interested in, this format lets you jump to the correct chunk (docNum/chunkFactor) directly and then seek within that chunk until you find it.

bitmaps of hits with location info

  • for each posting list
    • preparation phase:
      • encode roaring bitmap (inidicating which hits have location details indexed) posting list to bytes (so we know the length)
    • file writing phase:
      • remember the start position for this bitmap
      • write length of encoded roaring bitmap
      • write the serialized roaring bitmap data

postings list section

  • for each posting list
    • preparation phase:
      • encode roaring bitmap posting list to bytes (so we know the length)
    • file writing phase:
      • remember the start position for this posting list
      • write freq/norm details offset (remembered from previous, as varint uint64)
      • write location details offset (remembered from previous, as varint uint64)
      • write location bitmap offset (remembered from pervious, as varint uint64)
      • write length of encoded roaring bitmap
      • write the serialized roaring bitmap data

dictionary

  • for each field
    • preparation phase:
      • encode vellum FST with dictionary data pointing to file offset of posting list (remembered from previous)
    • file writing phase:
      • remember the start position of this persistDictionary
      • write length of vellum data (varint uint64)
      • write out vellum data

fields section

  • for each field
    • file writing phase:
      • remember start offset for each field
      • write dictionary address (remembered from previous) (varint uint64)
      • write length of field name (varint uint64)
      • write field name bytes

fields idx

  • for each field
    • file writing phase:
      • write big endian uint64 of start offset for each field

NOTE: currently we don't know or record the length of this fields index. Instead we rely on the fact that we know it immediately precedes a footer of known size.

fields DocValue

  • for each field
    • preparation phase:
      • produce a slice containing multiple consecutive chunks, where each chunk is composed of a meta section followed by compressed columnar field data
      • produce a slice remembering the length of each chunk
    • file writing phase:
      • remember the start position of this first field DocValue offset in the footer
      • write out number of chunks that follow (varint uint64)
      • write out length of each chunk (each a varint uint64)
      • write out the byte slice containing all the chunk data

NOTE: currently the meta header inside each chunk gives clue to the location offsets and size of the data pertaining to a given docID and any read operation leverage that meta information to extract the document specific data from the file.

  • file writing phase
    • write number of docs (big endian uint64)
    • write stored field index location (big endian uint64)
    • write field index location (big endian uint64)
    • write field docValue location (big endian uint64)
    • write out chunk factor (big endian uint32)
    • write out version (big endian uint32)
    • write out file CRC of everything preceding this (big endian uint32)

Documentation

Index

Constants

View Source
const FooterSize = 4 + 4 + 4 + 8 + 8 + 8 + 8

FooterSize is the size of the footer record in bytes crc + ver + chunk + field offset + stored offset + num docs + docValueOffset

Variables

This section is empty.

Functions

func Merge

func Merge(segments []*Segment, drops []*roaring.Bitmap, path string,
	chunkFactor uint32) ([][]uint64, error)

Merge takes a slice of zap segments, bit masks describing which documents from the may be dropped, and creates a new segment containing the remaining data. This new segment is built at the specified path, with the provided chunkFactor.

func Open

func Open(path string) (segment.Segment, error)

Open returns a zap impl of a segment

func PersistSegment

func PersistSegment(memSegment *mem.Segment, path string, chunkFactor uint32) (err error)

PersistSegment takes the in-memory segment and persists it to the specified path in the zap file format.

Types

type CountHashWriter

type CountHashWriter struct {
	// contains filtered or unexported fields
}

CountHashWriter is a wrapper around a Writer which counts the number of bytes which have been written

func NewCountHashWriter

func NewCountHashWriter(w io.Writer) *CountHashWriter

NewCountHashWriter returns a CountHashWriter which wraps the provided Writer

func (*CountHashWriter) Count

func (c *CountHashWriter) Count() int

Count returns the number of bytes written

func (*CountHashWriter) Sum32

func (c *CountHashWriter) Sum32() uint32

Sum32 returns the CRC-32 hash of the content written to this writer

func (*CountHashWriter) Write

func (c *CountHashWriter) Write(b []byte) (int, error)

Write writes the provided bytes to the wrapped writer and counts the bytes

type Dictionary

type Dictionary struct {
	// contains filtered or unexported fields
}

Dictionary is the zap representation of the term dictionary

func (*Dictionary) Iterator

func (d *Dictionary) Iterator() segment.DictionaryIterator

Iterator returns an iterator for this dictionary

func (*Dictionary) PostingsList

func (d *Dictionary) PostingsList(term string, except *roaring.Bitmap) (segment.PostingsList, error)

PostingsList returns the postings list for the specified term

func (*Dictionary) PrefixIterator

func (d *Dictionary) PrefixIterator(prefix string) segment.DictionaryIterator

PrefixIterator returns an iterator which only visits terms having the the specified prefix

func (*Dictionary) RangeIterator

func (d *Dictionary) RangeIterator(start, end string) segment.DictionaryIterator

RangeIterator returns an iterator which only visits terms between the start and end terms. NOTE: bleve.index API specifies the end is inclusive.

type DictionaryIterator

type DictionaryIterator struct {
	// contains filtered or unexported fields
}

DictionaryIterator is an iterator for term dictionary

func (*DictionaryIterator) Next

func (i *DictionaryIterator) Next() (*index.DictEntry, error)

Next returns the next entry in the dictionary

type Location

type Location struct {
	// contains filtered or unexported fields
}

Location represents the location of a single occurance

func (*Location) ArrayPositions

func (l *Location) ArrayPositions() []uint64

ArrayPositions returns the array position vector associated with this occurance

func (*Location) End

func (l *Location) End() uint64

End returns the end byte offset of this occurance

func (*Location) Field

func (l *Location) Field() string

Field returns the name of the field (useful in composite fields to know which original field the value came from)

func (*Location) Pos

func (l *Location) Pos() uint64

Pos returns the 1-based phrase position of this occurance

func (*Location) Start

func (l *Location) Start() uint64

Start returns the start byte offset of this occurance

type MetaData

type MetaData struct {
	DocID    uint64 // docid of the data inside the chunk
	DocDvLoc uint64 // starting offset for a given docid
	DocDvLen uint64 // length of data inside the chunk for the given docid
}

MetaData represents the data information inside a chunk.

type Posting

type Posting struct {
	// contains filtered or unexported fields
}

Posting is a single entry in a postings list

func (*Posting) Frequency

func (p *Posting) Frequency() uint64

Frequency returns the frequence of occurance of this term in this doc/field

func (*Posting) Locations

func (p *Posting) Locations() []segment.Location

Locations returns the location information for each occurance

func (*Posting) Norm

func (p *Posting) Norm() float64

Norm returns the normalization factor for this posting

func (*Posting) Number

func (p *Posting) Number() uint64

Number returns the document number of this posting in this segment

type PostingsIterator

type PostingsIterator struct {
	// contains filtered or unexported fields
}

PostingsIterator provides a way to iterate through the postings list

func (*PostingsIterator) Next

func (i *PostingsIterator) Next() (segment.Posting, error)

Next returns the next posting on the postings list, or nil at the end

type PostingsList

type PostingsList struct {
	// contains filtered or unexported fields
}

PostingsList is an in-memory represenation of a postings list

func (*PostingsList) Count

func (p *PostingsList) Count() uint64

Count returns the number of items on this postings list

func (*PostingsList) Iterator

func (p *PostingsList) Iterator() segment.PostingsIterator

Iterator returns an iterator for this postings list

type Segment

type Segment struct {
	// contains filtered or unexported fields
}

Segment implements the segment.Segment inteface over top the zap file format

func (*Segment) AddRef

func (s *Segment) AddRef()

func (*Segment) CRC

func (s *Segment) CRC() uint32

CRC returns the CRC value stored in the file footer

func (*Segment) ChunkFactor

func (s *Segment) ChunkFactor() uint32

ChunkFactor returns the chunk factor in the file footer

func (*Segment) Close

func (s *Segment) Close() (err error)

Close releases all resources associated with this segment

func (*Segment) Count

func (s *Segment) Count() uint64

Count returns the number of documents in this segment.

func (*Segment) Data

func (s *Segment) Data() []byte

Data returns the underlying mmaped data slice

func (*Segment) DecRef

func (s *Segment) DecRef() (err error)

func (*Segment) DictAddr

func (s *Segment) DictAddr(field string) (uint64, error)

DictAddr is a helper function to compute the file offset where the dictionary is stored for the specified field.

func (*Segment) Dictionary

func (s *Segment) Dictionary(field string) (segment.TermDictionary, error)

Dictionary returns the term dictionary for the specified field

func (*Segment) DocNumbers

func (s *Segment) DocNumbers(ids []string) (*roaring.Bitmap, error)

DocNumbers returns a bitset corresponding to the doc numbers of all the provided _id strings

func (*Segment) DocValueOffset

func (s *Segment) DocValueOffset() uint64

DocValueOffset returns the docValue offset in the file footer

func (*Segment) Fields

func (s *Segment) Fields() []string

Fields returns the field names used in this segment

func (*Segment) FieldsIndexOffset

func (s *Segment) FieldsIndexOffset() uint64

FieldsIndexOffset returns the fields index offset in the file footer

func (*Segment) NumDocs

func (s *Segment) NumDocs() uint64

NumDocs returns the number of documents in the file footer

func (*Segment) Path

func (s *Segment) Path() string

Path returns the path of this segment on disk

func (*Segment) SizeInBytes

func (s *Segment) SizeInBytes() uint64

func (*Segment) StoredIndexOffset

func (s *Segment) StoredIndexOffset() uint64

StoredIndexOffset returns the stored value index offset in the file footer

func (*Segment) Version

func (s *Segment) Version() uint32

Version returns the file version in the file footer

func (*Segment) VisitDocument

func (s *Segment) VisitDocument(num uint64, visitor segment.DocumentFieldValueVisitor) error

VisitDocument invokes the DocFieldValueVistor for each stored field for the specified doc number

func (*Segment) VisitDocumentFieldTerms

func (s *Segment) VisitDocumentFieldTerms(localDocNum uint64, fields []string,
	visitor index.DocumentFieldTermVisitor) error

VisitDocumentFieldTerms is an implementation of the UnInvertIndex interface

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL