compressing

package
v0.0.3 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Dec 24, 2020 License: Apache-2.0 Imports: 10 Imported by: 0

Documentation

Index

Constants

View Source
const (
	MIN_MATCH     = 4       // minimum length of a match
	MAX_DISTANCE  = 1 << 16 // maximum distance of a reference
	LAST_LITERALS = 5       // the last 5 bytes mus tbe encoded as literals
)
View Source
const (
	STRING         = 0x00
	BYTE_ARR       = 0x01
	NUMERIC_INT    = 0x02
	NUMERIC_FLOAT  = 0x03
	NUMERIC_LONG   = 0x04
	NUMERIC_DOUBLE = 0x05
)
View Source
const (
	CODEC_SFX_IDX      = "Index"
	CODEC_SFX_DAT      = "Data"
	VERSION_START      = 0
	VERSION_BIG_CHUNKS = 1
	VERSION_CHECKSUM   = 2
	VERSION_CURRENT    = VERSION_CHECKSUM
)
View Source
const BLOCK_SIZE = 1024

number of chunks to serialize at once

View Source
const BUFFER_REUSE_THRESHOLD = 1 << 15

Do not reuse the decompression buffer when there is more than 32kb to decompress

View Source
const (
	COMPRESSION_MODE_FAST = CompressionModeDefaults(1)
)
View Source
const MAX_DOCUMENTS_PER_CHUNK = 128

hard limit on the maximum number of documents per chunk

View Source
const (
	MEMORY_USAGE = 14
)

Variables

Functions

func LZ4Compress

func LZ4Compress(bytes []byte, out DataOutput, ht *LZ4HashTable) error

Compress bytes into out using at most 16KB of memory. ht shouldn't be shared across threads but can safely be reused.

func LZ4Decompress

func LZ4Decompress(compressed DataInput, decompressedLen int, dest []byte) (length int, err error)

Decompress at least decompressedLen bytes into dest[]. Please note that dest[] must be large enough to be able to hold all decompressed data (meaning that you need to know the total decompressed length)

func LZ4Decompressor

func LZ4Decompressor(in DataInput, originalLength, offset, length int, bytes []byte) (res []byte, err error)

Types

type CompressingStoredFieldsFormat

type CompressingStoredFieldsFormat struct {
	// contains filtered or unexported fields
}

A StoredFieldsFormat that is very similar to Lucene40StoredFieldsFormat but compresses documents in chunks in order to improve the compression ratio.

For a chunk size of chunkSize bytes, this StoredFieldsFormat does not support documents larger than (2^31 - chunkSize) bytes. In case this is a problem, you should use another format, such as Lucene40StoredFieldsFormat.

For optimal performance, you should use a MergePolicy that returns segments that have the biggest byte size first.

func NewCompressingStoredFieldsFormat

func NewCompressingStoredFieldsFormat(formatName, segmentSuffix string,
	compressionMode CompressionMode, chunkSize int) *CompressingStoredFieldsFormat

Create a new CompressingStoredFieldsFormat

formatName is the name of the format. This name will be used in the file formats to perform CheckHeader().

segmentSuffix is the segment suffix. This suffix is added to the result

file name only if it's not te empty string.

The compressionMode parameter allows you to choose between compresison
algorithms that have various compression and decompression speeds so
that you can pick the one that best fits your indexing and searching
throughput. You should never instantiate two CoompressingStoredFieldsFormats
that have the same name but different CompressionModes.

chunkSize is the minimum byte size of a chunk of documents. A value
of 1 can make sense if there is redundancy across fields. In that
case, both performance and compression ratio should be better than
with Lucene40StoredFieldsFormat with compressed fields.

Higher values of chunkSize should improve the compresison ratio but
will require more memery at indexing time and might make document
loading a little slower (depending on the size of our OS cache compared
to the size of your index).

func (*CompressingStoredFieldsFormat) FieldsReader

func (format *CompressingStoredFieldsFormat) FieldsReader(d store.Directory,
	si *model.SegmentInfo, fn model.FieldInfos, ctx store.IOContext) (r StoredFieldsReader, err error)

func (*CompressingStoredFieldsFormat) FieldsWriter

func (format *CompressingStoredFieldsFormat) FieldsWriter(d store.Directory,
	si *model.SegmentInfo, ctx store.IOContext) (w StoredFieldsWriter, err error)

func (*CompressingStoredFieldsFormat) String

func (format *CompressingStoredFieldsFormat) String() string

type CompressingStoredFieldsIndexReader

type CompressingStoredFieldsIndexReader struct {
	// contains filtered or unexported fields
}

Random-access reader for CompressingStoredFieldsIndexWriter

func (*CompressingStoredFieldsIndexReader) Clone

type CompressingStoredFieldsReader

type CompressingStoredFieldsReader struct {
	// contains filtered or unexported fields
}

StoredFieldsReader impl for CompressingStoredFieldsFormat

func (*CompressingStoredFieldsReader) Clone

func (r *CompressingStoredFieldsReader) Clone() StoredFieldsReader

func (*CompressingStoredFieldsReader) Close

func (r *CompressingStoredFieldsReader) Close() (err error)

Close the underlying IndexInputs

func (*CompressingStoredFieldsReader) VisitDocument

func (r *CompressingStoredFieldsReader) VisitDocument(docID int, visitor StoredFieldVisitor) error

type CompressingStoredFieldsWriter

type CompressingStoredFieldsWriter struct {
	// contains filtered or unexported fields
}

StoredFieldsWriter impl for CompressingStoredFieldsFormat

func NewCompressingStoredFieldsWriter

func NewCompressingStoredFieldsWriter(dir store.Directory, si *model.SegmentInfo,
	segmentSuffix string, ctx store.IOContext, formatName string,
	compressionMode CompressionMode, chunkSize int) (*CompressingStoredFieldsWriter, error)

func (*CompressingStoredFieldsWriter) Abort

func (w *CompressingStoredFieldsWriter) Abort()

func (*CompressingStoredFieldsWriter) Close

func (*CompressingStoredFieldsWriter) Finish

func (w *CompressingStoredFieldsWriter) Finish(fis model.FieldInfos, numDocs int) (err error)

func (*CompressingStoredFieldsWriter) FinishDocument

func (w *CompressingStoredFieldsWriter) FinishDocument() error

func (*CompressingStoredFieldsWriter) StartDocument

func (w *CompressingStoredFieldsWriter) StartDocument() error

func (*CompressingStoredFieldsWriter) WriteField

type CompressingTermVectorsFormat

type CompressingTermVectorsFormat struct {
	// contains filtered or unexported fields
}

A TermVectorsFormat that compresses chunks of documents together in order to improve the compression ratio.

func NewCompressingTermVectorsFormat

func NewCompressingTermVectorsFormat(formatName, segmentSuffix string,
	compressionMode CompressionMode, chunkSize int) *CompressingTermVectorsFormat

Create a new CompressingTermVectorsFormat

formatName is the name of the format. This name will be used in the file formats to perform codec header checks.

The compressionMode parameter allows you to choose between compression algorithms that have various compression and decompression speeds so that you can pick the one that best fits your indexing and searching throughput. You should never instantiate two CompressingTermVectorsFormats that have the same name but different CompressionModes.

chunkSize is the minimum byte size of a chunk of documents. Highter values of chunkSize should improve the compression ratio but will require more memory at indexing time and might make document loading a little slower (depending on the size of your OS cache compared to the size of your index).

func (*CompressingTermVectorsFormat) VectorsReader

func (vf *CompressingTermVectorsFormat) VectorsReader(d store.Directory,
	segmentInfo *model.SegmentInfo, fieldsInfos model.FieldInfos,
	context store.IOContext) (spi.TermVectorsReader, error)

func (*CompressingTermVectorsFormat) VectorsWriter

func (vf *CompressingTermVectorsFormat) VectorsWriter(d store.Directory,
	segmentInfo *model.SegmentInfo,
	context store.IOContext) (spi.TermVectorsWriter, error)

type CompressionMode

type CompressionMode interface {
	NewCompressor() Compressor
	NewDecompressor() Decompressor
}

type CompressionModeDefaults

type CompressionModeDefaults int

func (CompressionModeDefaults) NewCompressor

func (m CompressionModeDefaults) NewCompressor() Compressor

func (CompressionModeDefaults) NewDecompressor

func (m CompressionModeDefaults) NewDecompressor() Decompressor

type Compressor

type Compressor func(bytes []byte, out DataOutput) error

Compress bytes into out. It is the responsibility of the compressor to add all necessary information so that a Decompressor will know when to stop decompressing bytes from the stream.

type DataInput

type DataInput interface {
	ReadByte() (b byte, err error)
	ReadBytes(buf []byte) error
}

type DataOutput

type DataOutput interface {
	WriteByte(b byte) error
	WriteBytes(buf []byte) error
	WriteInt(i int32) error
	WriteVInt(i int32) error
	WriteString(string) error
}

type Decompressor

type Decompressor func(in DataInput, originalLength, offset, length int, bytes []byte) (buf []byte, err error)

Decompress 'bytes' that were stored between [offset:offset+length] in the original stream from the compressed stream 'in' to 'bytes'. The length of the returned bytes (len(buf)) must be equal to 'length'. Implementations of this method are free to resize 'bytes' depending on their needs.

type GrowableByteArrayDataOutput

type GrowableByteArrayDataOutput struct {
	*util.DataOutputImpl
	// contains filtered or unexported fields
}

A DataOutput that can be used to build a []byte

func (*GrowableByteArrayDataOutput) WriteByte

func (out *GrowableByteArrayDataOutput) WriteByte(b byte) error

func (*GrowableByteArrayDataOutput) WriteBytes

func (out *GrowableByteArrayDataOutput) WriteBytes(b []byte) error

type LZ4HashTable

type LZ4HashTable struct {
	// contains filtered or unexported fields
}

type StoredFieldsIndexWriter

type StoredFieldsIndexWriter struct {
	// contains filtered or unexported fields
}

Efficient index format for block-based Codecs.

This writer generates a file which be loaded into memory using memory-efficient data structures to quickly locate the block that contains any document.

In order to have a compact in-memory representation, for every block of 1024 chunks, this index computes the average number of bytes per chunk and for every chunk, only stores the difference between

- ${chunk number} * ${average length of a chunk} - and the actual start offset of the chunk

Data is written as follows:

  • PackedIntsVersion, <Block>^BlockCount, BlocksEndMarker
  • PackedIntsVersion --> VERSION_CURRENT as a vint
  • BlocksEndMarker --> 0 as a vint, this marks the end of blocks since blocks are not allowed to start with 0
  • Block --> BlockChunks, <Docbases>, <StartPointers>
  • BlockChunks --> a vint which is the number of chunks encoded in the block
  • DocBases --> DocBase, AvgChunkDocs, BitsPerDocbaseDelta, DocBaseDeltas
  • DocBase --> first document ID of the block of chunks, as a vint
  • AvgChunkDocs --> average number of documents in a single chunk, as a vint
  • BitsPerDocBaseDelta --> number of bits required to represent a delta from the average using ZigZag encoding
  • DocBaseDeltas --> packed array of BlockChunks elements of BitsPerDocBaseDelta bits each, representing the deltas from the average doc base using ZigZag encoding.
  • StartPointers --> StartointerBase, AveChunkSize, BitsPerStartPointerDelta, StartPointerDeltas
  • StartPointerBase --> the first start ointer of the block, as a vint64
  • AvgChunkSize --> the average size of a chunk of compressed documents, as a vint64
  • BitsPerStartPointerDelta --> number of bits required to represent a delta from the average using ZigZag encoding
  • StartPointerDeltas --> packed array of BlockChunks elements of BitsPerStartPointerDelta bits each, representing the deltas from the average start pointer using ZigZag encoding

Notes

- For any block, the doc base of the n-th chunk can be restored with DocBase + AvgChunkDocs * n + DOcBsaeDeltas[n]. - For any block, the start pointer of the n-th chunk can be restored with StartPointerBase + AvgChunkSize * n + StartPointerDeltas[n]. - Once data is loaded into memory, you can lookup the start pointer of any document by performing two binary searches: a first one based on the values of DocBase in order to find the right block, and then inside the block based on DocBaseDeltas (by reconstructing the doc bases for every chunk).

func NewStoredFieldsIndexWriter

func NewStoredFieldsIndexWriter(indexOutput store.IndexOutput) (*StoredFieldsIndexWriter, error)

func (*StoredFieldsIndexWriter) Close

func (w *StoredFieldsIndexWriter) Close() error

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL