Documentation
¶
Index ¶
- Constants
- Variables
- func LZ4Compress(bytes []byte, out DataOutput, ht *LZ4HashTable) error
- func LZ4Decompress(compressed DataInput, decompressedLen int, dest []byte) (length int, err error)
- func LZ4Decompressor(in DataInput, originalLength, offset, length int, bytes []byte) (res []byte, err error)
- type CompressingStoredFieldsFormat
- func (format *CompressingStoredFieldsFormat) FieldsReader(d store.Directory, si *model.SegmentInfo, fn model.FieldInfos, ...) (r StoredFieldsReader, err error)
- func (format *CompressingStoredFieldsFormat) FieldsWriter(d store.Directory, si *model.SegmentInfo, ctx store.IOContext) (w StoredFieldsWriter, err error)
- func (format *CompressingStoredFieldsFormat) String() string
- type CompressingStoredFieldsIndexReader
- type CompressingStoredFieldsReader
- type CompressingStoredFieldsWriter
- func (w *CompressingStoredFieldsWriter) Abort()
- func (w *CompressingStoredFieldsWriter) Close() error
- func (w *CompressingStoredFieldsWriter) Finish(fis model.FieldInfos, numDocs int) (err error)
- func (w *CompressingStoredFieldsWriter) FinishDocument() error
- func (w *CompressingStoredFieldsWriter) StartDocument() error
- func (w *CompressingStoredFieldsWriter) WriteField(info *model.FieldInfo, field model.IndexableField) error
- type CompressingTermVectorsFormat
- func (vf *CompressingTermVectorsFormat) VectorsReader(d store.Directory, segmentInfo *model.SegmentInfo, ...) (spi.TermVectorsReader, error)
- func (vf *CompressingTermVectorsFormat) VectorsWriter(d store.Directory, segmentInfo *model.SegmentInfo, context store.IOContext) (spi.TermVectorsWriter, error)
- type CompressionMode
- type CompressionModeDefaults
- type Compressor
- type DataInput
- type DataOutput
- type Decompressor
- type GrowableByteArrayDataOutput
- type LZ4HashTable
- type StoredFieldsIndexWriter
Constants ¶
const ( MIN_MATCH = 4 // minimum length of a match MAX_DISTANCE = 1 << 16 // maximum distance of a reference LAST_LITERALS = 5 // the last 5 bytes mus tbe encoded as literals )
const ( STRING = 0x00 BYTE_ARR = 0x01 NUMERIC_INT = 0x02 NUMERIC_FLOAT = 0x03 NUMERIC_LONG = 0x04 NUMERIC_DOUBLE = 0x05 )
const ( CODEC_SFX_IDX = "Index" CODEC_SFX_DAT = "Data" VERSION_START = 0 VERSION_BIG_CHUNKS = 1 VERSION_CHECKSUM = 2 VERSION_CURRENT = VERSION_CHECKSUM )
const BLOCK_SIZE = 1024
number of chunks to serialize at once
const BUFFER_REUSE_THRESHOLD = 1 << 15
Do not reuse the decompression buffer when there is more than 32kb to decompress
const (
COMPRESSION_MODE_FAST = CompressionModeDefaults(1)
)
const MAX_DOCUMENTS_PER_CHUNK = 128
hard limit on the maximum number of documents per chunk
const (
MEMORY_USAGE = 14
)
Variables ¶
var ( TYPE_BITS = packed.BitsRequired(NUMERIC_DOUBLE) TYPE_MASK = int(packed.MaxValue(TYPE_BITS)) )
Functions ¶
func LZ4Compress ¶
func LZ4Compress(bytes []byte, out DataOutput, ht *LZ4HashTable) error
Compress bytes into out using at most 16KB of memory. ht shouldn't be shared across threads but can safely be reused.
func LZ4Decompress ¶
Decompress at least decompressedLen bytes into dest[]. Please note that dest[] must be large enough to be able to hold all decompressed data (meaning that you need to know the total decompressed length)
Types ¶
type CompressingStoredFieldsFormat ¶
type CompressingStoredFieldsFormat struct {
// contains filtered or unexported fields
}
A StoredFieldsFormat that is very similar to Lucene40StoredFieldsFormat but compresses documents in chunks in order to improve the compression ratio.
For a chunk size of chunkSize bytes, this StoredFieldsFormat does not support documents larger than (2^31 - chunkSize) bytes. In case this is a problem, you should use another format, such as Lucene40StoredFieldsFormat.
For optimal performance, you should use a MergePolicy that returns segments that have the biggest byte size first.
func NewCompressingStoredFieldsFormat ¶
func NewCompressingStoredFieldsFormat(formatName, segmentSuffix string, compressionMode CompressionMode, chunkSize int) *CompressingStoredFieldsFormat
Create a new CompressingStoredFieldsFormat
formatName is the name of the format. This name will be used in the file formats to perform CheckHeader().
segmentSuffix is the segment suffix. This suffix is added to the result
file name only if it's not te empty string. The compressionMode parameter allows you to choose between compresison algorithms that have various compression and decompression speeds so that you can pick the one that best fits your indexing and searching throughput. You should never instantiate two CoompressingStoredFieldsFormats that have the same name but different CompressionModes. chunkSize is the minimum byte size of a chunk of documents. A value of 1 can make sense if there is redundancy across fields. In that case, both performance and compression ratio should be better than with Lucene40StoredFieldsFormat with compressed fields. Higher values of chunkSize should improve the compresison ratio but will require more memery at indexing time and might make document loading a little slower (depending on the size of our OS cache compared to the size of your index).
func (*CompressingStoredFieldsFormat) FieldsReader ¶
func (format *CompressingStoredFieldsFormat) FieldsReader(d store.Directory, si *model.SegmentInfo, fn model.FieldInfos, ctx store.IOContext) (r StoredFieldsReader, err error)
func (*CompressingStoredFieldsFormat) FieldsWriter ¶
func (format *CompressingStoredFieldsFormat) FieldsWriter(d store.Directory, si *model.SegmentInfo, ctx store.IOContext) (w StoredFieldsWriter, err error)
func (*CompressingStoredFieldsFormat) String ¶
func (format *CompressingStoredFieldsFormat) String() string
type CompressingStoredFieldsIndexReader ¶
type CompressingStoredFieldsIndexReader struct {
// contains filtered or unexported fields
}
Random-access reader for CompressingStoredFieldsIndexWriter
func (*CompressingStoredFieldsIndexReader) Clone ¶
func (r *CompressingStoredFieldsIndexReader) Clone() *CompressingStoredFieldsIndexReader
type CompressingStoredFieldsReader ¶
type CompressingStoredFieldsReader struct {
// contains filtered or unexported fields
}
StoredFieldsReader impl for CompressingStoredFieldsFormat
func (*CompressingStoredFieldsReader) Clone ¶
func (r *CompressingStoredFieldsReader) Clone() StoredFieldsReader
func (*CompressingStoredFieldsReader) Close ¶
func (r *CompressingStoredFieldsReader) Close() (err error)
Close the underlying IndexInputs
func (*CompressingStoredFieldsReader) VisitDocument ¶
func (r *CompressingStoredFieldsReader) VisitDocument(docID int, visitor StoredFieldVisitor) error
type CompressingStoredFieldsWriter ¶
type CompressingStoredFieldsWriter struct {
// contains filtered or unexported fields
}
StoredFieldsWriter impl for CompressingStoredFieldsFormat
func NewCompressingStoredFieldsWriter ¶
func NewCompressingStoredFieldsWriter(dir store.Directory, si *model.SegmentInfo, segmentSuffix string, ctx store.IOContext, formatName string, compressionMode CompressionMode, chunkSize int) (*CompressingStoredFieldsWriter, error)
func (*CompressingStoredFieldsWriter) Abort ¶
func (w *CompressingStoredFieldsWriter) Abort()
func (*CompressingStoredFieldsWriter) Close ¶
func (w *CompressingStoredFieldsWriter) Close() error
func (*CompressingStoredFieldsWriter) Finish ¶
func (w *CompressingStoredFieldsWriter) Finish(fis model.FieldInfos, numDocs int) (err error)
func (*CompressingStoredFieldsWriter) FinishDocument ¶
func (w *CompressingStoredFieldsWriter) FinishDocument() error
func (*CompressingStoredFieldsWriter) StartDocument ¶
func (w *CompressingStoredFieldsWriter) StartDocument() error
func (*CompressingStoredFieldsWriter) WriteField ¶
func (w *CompressingStoredFieldsWriter) WriteField(info *model.FieldInfo, field model.IndexableField) error
type CompressingTermVectorsFormat ¶
type CompressingTermVectorsFormat struct {
// contains filtered or unexported fields
}
A TermVectorsFormat that compresses chunks of documents together in order to improve the compression ratio.
func NewCompressingTermVectorsFormat ¶
func NewCompressingTermVectorsFormat(formatName, segmentSuffix string, compressionMode CompressionMode, chunkSize int) *CompressingTermVectorsFormat
Create a new CompressingTermVectorsFormat
formatName is the name of the format. This name will be used in the file formats to perform codec header checks.
The compressionMode parameter allows you to choose between compression algorithms that have various compression and decompression speeds so that you can pick the one that best fits your indexing and searching throughput. You should never instantiate two CompressingTermVectorsFormats that have the same name but different CompressionModes.
chunkSize is the minimum byte size of a chunk of documents. Highter values of chunkSize should improve the compression ratio but will require more memory at indexing time and might make document loading a little slower (depending on the size of your OS cache compared to the size of your index).
func (*CompressingTermVectorsFormat) VectorsReader ¶
func (vf *CompressingTermVectorsFormat) VectorsReader(d store.Directory, segmentInfo *model.SegmentInfo, fieldsInfos model.FieldInfos, context store.IOContext) (spi.TermVectorsReader, error)
func (*CompressingTermVectorsFormat) VectorsWriter ¶
func (vf *CompressingTermVectorsFormat) VectorsWriter(d store.Directory, segmentInfo *model.SegmentInfo, context store.IOContext) (spi.TermVectorsWriter, error)
type CompressionMode ¶
type CompressionMode interface { NewCompressor() Compressor NewDecompressor() Decompressor }
type CompressionModeDefaults ¶
type CompressionModeDefaults int
func (CompressionModeDefaults) NewCompressor ¶
func (m CompressionModeDefaults) NewCompressor() Compressor
func (CompressionModeDefaults) NewDecompressor ¶
func (m CompressionModeDefaults) NewDecompressor() Decompressor
type Compressor ¶
type Compressor func(bytes []byte, out DataOutput) error
Compress bytes into out. It is the responsibility of the compressor to add all necessary information so that a Decompressor will know when to stop decompressing bytes from the stream.
type DataOutput ¶
type Decompressor ¶
type Decompressor func(in DataInput, originalLength, offset, length int, bytes []byte) (buf []byte, err error)
Decompress 'bytes' that were stored between [offset:offset+length] in the original stream from the compressed stream 'in' to 'bytes'. The length of the returned bytes (len(buf)) must be equal to 'length'. Implementations of this method are free to resize 'bytes' depending on their needs.
type GrowableByteArrayDataOutput ¶
type GrowableByteArrayDataOutput struct { *util.DataOutputImpl // contains filtered or unexported fields }
A DataOutput that can be used to build a []byte
func (*GrowableByteArrayDataOutput) WriteByte ¶
func (out *GrowableByteArrayDataOutput) WriteByte(b byte) error
func (*GrowableByteArrayDataOutput) WriteBytes ¶
func (out *GrowableByteArrayDataOutput) WriteBytes(b []byte) error
type LZ4HashTable ¶
type LZ4HashTable struct {
// contains filtered or unexported fields
}
type StoredFieldsIndexWriter ¶
type StoredFieldsIndexWriter struct {
// contains filtered or unexported fields
}
Efficient index format for block-based Codecs.
This writer generates a file which be loaded into memory using memory-efficient data structures to quickly locate the block that contains any document.
In order to have a compact in-memory representation, for every block of 1024 chunks, this index computes the average number of bytes per chunk and for every chunk, only stores the difference between
- ${chunk number} * ${average length of a chunk} - and the actual start offset of the chunk
Data is written as follows:
- PackedIntsVersion, <Block>^BlockCount, BlocksEndMarker
- PackedIntsVersion --> VERSION_CURRENT as a vint
- BlocksEndMarker --> 0 as a vint, this marks the end of blocks since blocks are not allowed to start with 0
- Block --> BlockChunks, <Docbases>, <StartPointers>
- BlockChunks --> a vint which is the number of chunks encoded in the block
- DocBases --> DocBase, AvgChunkDocs, BitsPerDocbaseDelta, DocBaseDeltas
- DocBase --> first document ID of the block of chunks, as a vint
- AvgChunkDocs --> average number of documents in a single chunk, as a vint
- BitsPerDocBaseDelta --> number of bits required to represent a delta from the average using ZigZag encoding
- DocBaseDeltas --> packed array of BlockChunks elements of BitsPerDocBaseDelta bits each, representing the deltas from the average doc base using ZigZag encoding.
- StartPointers --> StartointerBase, AveChunkSize, BitsPerStartPointerDelta, StartPointerDeltas
- StartPointerBase --> the first start ointer of the block, as a vint64
- AvgChunkSize --> the average size of a chunk of compressed documents, as a vint64
- BitsPerStartPointerDelta --> number of bits required to represent a delta from the average using ZigZag encoding
- StartPointerDeltas --> packed array of BlockChunks elements of BitsPerStartPointerDelta bits each, representing the deltas from the average start pointer using ZigZag encoding
Notes ¶
- For any block, the doc base of the n-th chunk can be restored with DocBase + AvgChunkDocs * n + DOcBsaeDeltas[n]. - For any block, the start pointer of the n-th chunk can be restored with StartPointerBase + AvgChunkSize * n + StartPointerDeltas[n]. - Once data is loaded into memory, you can lookup the start pointer of any document by performing two binary searches: a first one based on the values of DocBase in order to find the right block, and then inside the block based on DocBaseDeltas (by reconstructing the doc bases for every chunk).
func NewStoredFieldsIndexWriter ¶
func NewStoredFieldsIndexWriter(indexOutput store.IndexOutput) (*StoredFieldsIndexWriter, error)
func (*StoredFieldsIndexWriter) Close ¶
func (w *StoredFieldsIndexWriter) Close() error