docs

package

v0.15.0-rc.5 Latest Latest Go to latest Published: Apr 5, 2020 License: Apache-2.0 Imports: 8 Imported by: 4

Details

Valid go.mod file
Redistributable license
Tagged version
Stable version
Learn more about best practices

Repository

README ¶

Documents

Two files are used to represent the documents in a segment. The data file contains the data for each document in the segment. The index file contains, for each document, its corresponding offset in the data file.

Data File

The data file contains the fields for each document. The documents are stored serially.

┌───────────────────────────┐
│ ┌───────────────────────┐ │
│ │      Document 1       │ │
│ ├───────────────────────┤ │
│ │          ...          │ │
│ ├───────────────────────┤ │
│ │      Document n       │ │
│ └───────────────────────┘ │
└───────────────────────────┘

Document

Each document is composed of an ID and its fields. The ID is a sequence of valid UTF-8 bytes and it is encoded first by encoding the length of the ID, in bytes, as a variable-sized unsigned integer and then encoding the actual bytes which comprise the ID. Following the ID are the fields. The number of fields in the document is encoded first as a variable-sized unsigned integer and then the fields themselves are encoded.

┌───────────────────────────┐
│ ┌───────────────────────┐ │
│ │     Length of ID      │ │
│ │       (uvarint)       │ │
│ ├───────────────────────┤ │
│ │                       │ │
│ │          ID           │ │
│ │        (bytes)        │ │
│ │                       │ │
│ ├───────────────────────┤ │
│ │   Number of Fields    │ │
│ │       (uvarint)       │ │
│ ├───────────────────────┤ │
│ │                       │ │
│ │        Field 1        │ │
│ │                       │ │
│ ├───────────────────────┤ │
│ │                       │ │
│ │          ...          │ │
│ │                       │ │
│ ├───────────────────────┤ │
│ │                       │ │
│ │        Field n        │ │
│ │                       │ │
│ └───────────────────────┘ │
└───────────────────────────┘

Field

Each field is composed of a name and a value. The name and value are a sequence of valid UTF-8 bytes and they are stored by encoding the length of the name (value), in bytes, as a variable-sized unsigned integer and then encoding the actual bytes which comprise the name (value). The name is encoded first and the value second.

┌───────────────────────────┐
│ ┌───────────────────────┐ │
│ │  Length of Field Name │ │
│ │       (uvarint)       │ │
│ ├───────────────────────┤ │
│ │                       │ │
│ │      Field Name       │ │
│ │        (bytes)        │ │
│ │                       │ │
│ ├───────────────────────┤ │
│ │ Length of Field Value │ │
│ │       (uvarint)       │ │
│ ├───────────────────────┤ │
│ │                       │ │
│ │      Field Value      │ │
│ │        (bytes)        │ │
│ │                       │ │
│ └───────────────────────┘ │
└───────────────────────────┘

Index File

The index file contains, for each postings ID in the segment, the offset of the corresponding document in the data file. The base postings ID is stored at the start of the file as a little-endian uint64. Following it are the actual offsets.

┌───────────────────────────┐
│            Base           │
│          (uint64)         │
├───────────────────────────┤
│                           │
│                           │
│          Offsets          │
│                           │
│                           │
└───────────────────────────┘

Offsets

The offsets are stored serially starting from the offset for the base postings ID. Each offset is a little-endian uint64. Since each offset is of a fixed-size we can access the offset for a given postings ID by calculating its index relative to the start of the offsets. An offset equal to the maximum value for a uint64 indicates that there is no corresponding document for a given postings ID.

┌───────────────────────────┐
│ ┌───────────────────────┐ │
│ │       Offset 1        │ │
│ │       (uint64)        │ │
│ ├───────────────────────┤ │
│ │          ...          │ │
│ ├───────────────────────┤ │
│ │       Offset n        │ │
│ │       (uint64)        │ │
│ └───────────────────────┘ │
└───────────────────────────┘

Documentation ¶

Index ¶

type DataReader
- func NewDataReader(data []byte) *DataReader
- func (r *DataReader) Read(offset uint64) (doc.Document, error)
type DataWriter
- func NewDataWriter(w io.Writer) *DataWriter
- func (w *DataWriter) Reset(wr io.Writer)
- func (w *DataWriter) Write(d doc.Document) (int, error)
type IndexReader
- func NewIndexReader(data []byte) (*IndexReader, error)
type IndexWriter
- func NewIndexWriter(w io.Writer) *IndexWriter
- func (w *IndexWriter) Reset(wr io.Writer)
- func (w *IndexWriter) Write(id postings.ID, offset uint64) error
type SliceReader
- func NewSliceReader(offset postings.ID, docs []doc.Document) *SliceReader

Constants ¶

This section is empty.

Variables ¶

This section is empty.

Functions ¶

This section is empty.

Types ¶

type DataReader ¶

type DataReader struct {
	// contains filtered or unexported fields
}

DataReader is a reader for the data file for documents.

func NewDataReader ¶

func NewDataReader(data []byte) *DataReader

NewDataReader returns a new DataReader.

func (*DataReader) Read ¶

func (r *DataReader) Read(offset uint64) (doc.Document, error)

type DataWriter ¶

type DataWriter struct {
	// contains filtered or unexported fields
}

DataWriter writes the data file for documents.

func NewDataWriter ¶

func NewDataWriter(w io.Writer) *DataWriter

NewDataWriter returns a new DataWriter.

func (*DataWriter) Reset ¶

func (w *DataWriter) Reset(wr io.Writer)

Reset resets the DataWriter.

func (*DataWriter) Write ¶

func (w *DataWriter) Write(d doc.Document) (int, error)

type IndexReader ¶

type IndexReader struct {
	// contains filtered or unexported fields
}

IndexReader is a reader for the index file for documents.

func NewIndexReader ¶

func NewIndexReader(data []byte) (*IndexReader, error)

NewIndexReader returns a new IndexReader.

func (*IndexReader) Base ¶

func (r *IndexReader) Base() postings.ID

Base returns the base postings ID.

func (*IndexReader) Len ¶

func (r *IndexReader) Len() int

Len returns the number of postings IDs.

func (*IndexReader) Read ¶

func (r *IndexReader) Read(id postings.ID) (uint64, error)

type IndexWriter ¶

type IndexWriter struct {
	// contains filtered or unexported fields
}

IndexWriter is a writer for the index file for documents.

func NewIndexWriter ¶

func NewIndexWriter(w io.Writer) *IndexWriter

NewIndexWriter returns a new IndexWriter.

func (*IndexWriter) Reset ¶

func (w *IndexWriter) Reset(wr io.Writer)

Reset resets the IndexWriter.

func (*IndexWriter) Write ¶

func (w *IndexWriter) Write(id postings.ID, offset uint64) error

Write writes the offset for an id. IDs must be written in increasing order but can be non-contiguous.

type SliceReader ¶ added in v0.5.0

type SliceReader struct {
	// contains filtered or unexported fields
}

SliceReader is a docs slice reader for use with documents stored in memory.

func NewSliceReader ¶ added in v0.5.0

func NewSliceReader(offset postings.ID, docs []doc.Document) *SliceReader

NewSliceReader returns a new docs slice reader.

func (*SliceReader) Base ¶ added in v0.5.0

func (r *SliceReader) Base() postings.ID

Base returns the postings ID base offset of the slice reader.

func (*SliceReader) Len ¶ added in v0.5.0

func (r *SliceReader) Len() int

Len returns the number of documents in the slice reader.

func (*SliceReader) Read ¶ added in v0.5.0

func (r *SliceReader) Read(id postings.ID) (doc.Document, error)

Read returns a document from the docs slice reader.

Source Files ¶

View all Source files

?	: This menu
/	: Search site
f or F	: Jump to
y or Y	: Canonical URL