compressing

package

v0.0.3 Latest Latest Go to latest Published: Dec 24, 2020 License: Apache-2.0 Imports: 10 Imported by: 0

Details

Valid go.mod file
Redistributable license
Tagged version
Stable version
Learn more about best practices

Repository

github.com/gzg1984/golucene

Links

Open Source Insights

Documentation ¶

Index ¶

Constants
Variables
func LZ4Compress(bytes []byte, out DataOutput, ht *LZ4HashTable) error
func LZ4Decompress(compressed DataInput, decompressedLen int, dest []byte) (length int, err error)
func LZ4Decompressor(in DataInput, originalLength, offset, length int, bytes []byte) (res []byte, err error)
type CompressingStoredFieldsFormat
- func NewCompressingStoredFieldsFormat(formatName, segmentSuffix string, compressionMode CompressionMode, ...) *CompressingStoredFieldsFormat
- func (format *CompressingStoredFieldsFormat) FieldsReader(d store.Directory, si *model.SegmentInfo, fn model.FieldInfos, ...) (r StoredFieldsReader, err error)
- func (format *CompressingStoredFieldsFormat) FieldsWriter(d store.Directory, si *model.SegmentInfo, ctx store.IOContext) (w StoredFieldsWriter, err error)
- func (format *CompressingStoredFieldsFormat) String() string
type CompressingStoredFieldsIndexReader
- func (r *CompressingStoredFieldsIndexReader) Clone() *CompressingStoredFieldsIndexReader
type CompressingStoredFieldsReader
- func (r *CompressingStoredFieldsReader) Clone() StoredFieldsReader
- func (r *CompressingStoredFieldsReader) Close() (err error)
- func (r *CompressingStoredFieldsReader) VisitDocument(docID int, visitor StoredFieldVisitor) error
type CompressingStoredFieldsWriter
- func NewCompressingStoredFieldsWriter(dir store.Directory, si *model.SegmentInfo, segmentSuffix string, ...) (*CompressingStoredFieldsWriter, error)
- func (w *CompressingStoredFieldsWriter) Abort()
- func (w *CompressingStoredFieldsWriter) Close() error
- func (w *CompressingStoredFieldsWriter) Finish(fis model.FieldInfos, numDocs int) (err error)
- func (w *CompressingStoredFieldsWriter) FinishDocument() error
- func (w *CompressingStoredFieldsWriter) StartDocument() error
- func (w *CompressingStoredFieldsWriter) WriteField(info *model.FieldInfo, field model.IndexableField) error
type CompressingTermVectorsFormat
- func NewCompressingTermVectorsFormat(formatName, segmentSuffix string, compressionMode CompressionMode, ...) *CompressingTermVectorsFormat
- func (vf *CompressingTermVectorsFormat) VectorsReader(d store.Directory, segmentInfo *model.SegmentInfo, ...) (spi.TermVectorsReader, error)
- func (vf *CompressingTermVectorsFormat) VectorsWriter(d store.Directory, segmentInfo *model.SegmentInfo, context store.IOContext) (spi.TermVectorsWriter, error)
type CompressionMode
type CompressionModeDefaults
- func (m CompressionModeDefaults) NewCompressor() Compressor
- func (m CompressionModeDefaults) NewDecompressor() Decompressor
type Compressor
type DataInput
type DataOutput
type Decompressor
type GrowableByteArrayDataOutput
- func (out *GrowableByteArrayDataOutput) WriteByte(b byte) error
- func (out *GrowableByteArrayDataOutput) WriteBytes(b []byte) error
type LZ4HashTable
type StoredFieldsIndexWriter
- func NewStoredFieldsIndexWriter(indexOutput store.IndexOutput) (*StoredFieldsIndexWriter, error)
- func (w *StoredFieldsIndexWriter) Close() error

Constants ¶

View Source

const (
	MIN_MATCH     = 4       // minimum length of a match
	MAX_DISTANCE  = 1 << 16 // maximum distance of a reference
	LAST_LITERALS = 5       // the last 5 bytes mus tbe encoded as literals
)

View Source

const (
	STRING         = 0x00
	BYTE_ARR       = 0x01
	NUMERIC_INT    = 0x02
	NUMERIC_FLOAT  = 0x03
	NUMERIC_LONG   = 0x04
	NUMERIC_DOUBLE = 0x05
)

View Source

const (
	CODEC_SFX_IDX      = "Index"
	CODEC_SFX_DAT      = "Data"
	VERSION_START      = 0
	VERSION_BIG_CHUNKS = 1
	VERSION_CHECKSUM   = 2
	VERSION_CURRENT    = VERSION_CHECKSUM
)

View Source

const BLOCK_SIZE = 1024

number of chunks to serialize at once

View Source

const BUFFER_REUSE_THRESHOLD = 1 << 15

Do not reuse the decompression buffer when there is more than 32kb to decompress

View Source

const (
	COMPRESSION_MODE_FAST = CompressionModeDefaults(1)
)

View Source

const MAX_DOCUMENTS_PER_CHUNK = 128

hard limit on the maximum number of documents per chunk

View Source

const (
	MEMORY_USAGE = 14
)

Variables ¶

View Source

var (
	TYPE_BITS = packed.BitsRequired(NUMERIC_DOUBLE)
	TYPE_MASK = int(packed.MaxValue(TYPE_BITS))
)

Functions ¶

func LZ4Compress ¶

func LZ4Compress(bytes []byte, out DataOutput, ht *LZ4HashTable) error

Compress bytes into out using at most 16KB of memory. ht shouldn't be shared across threads but can safely be reused.

func LZ4Decompress ¶

func LZ4Decompress(compressed DataInput, decompressedLen int, dest []byte) (length int, err error)

Decompress at least decompressedLen bytes into dest[]. Please note that dest[] must be large enough to be able to hold all decompressed data (meaning that you need to know the total decompressed length)

func LZ4Decompressor ¶

func LZ4Decompressor(in DataInput, originalLength, offset, length int, bytes []byte) (res []byte, err error)

Types ¶

type CompressingStoredFieldsFormat ¶

type CompressingStoredFieldsFormat struct {
	// contains filtered or unexported fields
}

A StoredFieldsFormat that is very similar to Lucene40StoredFieldsFormat but compresses documents in chunks in order to improve the compression ratio.

For a chunk size of chunkSize bytes, this StoredFieldsFormat does not support documents larger than (2^31 - chunkSize) bytes. In case this is a problem, you should use another format, such as Lucene40StoredFieldsFormat.

For optimal performance, you should use a MergePolicy that returns segments that have the biggest byte size first.

func NewCompressingStoredFieldsFormat ¶

func NewCompressingStoredFieldsFormat(formatName, segmentSuffix string,
	compressionMode CompressionMode, chunkSize int) *CompressingStoredFieldsFormat

Create a new CompressingStoredFieldsFormat

formatName is the name of the format. This name will be used in the file formats to perform CheckHeader().

segmentSuffix is the segment suffix. This suffix is added to the result

file name only if it's not te empty string.

The compressionMode parameter allows you to choose between compresison
algorithms that have various compression and decompression speeds so
that you can pick the one that best fits your indexing and searching
throughput. You should never instantiate two CoompressingStoredFieldsFormats
that have the same name but different CompressionModes.

chunkSize is the minimum byte size of a chunk of documents. A value
of 1 can make sense if there is redundancy across fields. In that
case, both performance and compression ratio should be better than
with Lucene40StoredFieldsFormat with compressed fields.

Higher values of chunkSize should improve the compresison ratio but
will require more memery at indexing time and might make document
loading a little slower (depending on the size of our OS cache compared
to the size of your index).

func (*CompressingStoredFieldsFormat) FieldsReader ¶

func (format *CompressingStoredFieldsFormat) FieldsReader(d store.Directory,
	si *model.SegmentInfo, fn model.FieldInfos, ctx store.IOContext) (r StoredFieldsReader, err error)

func (*CompressingStoredFieldsFormat) FieldsWriter ¶

func (format *CompressingStoredFieldsFormat) FieldsWriter(d store.Directory,
	si *model.SegmentInfo, ctx store.IOContext) (w StoredFieldsWriter, err error)

func (*CompressingStoredFieldsFormat) String ¶

func (format *CompressingStoredFieldsFormat) String() string

type CompressingStoredFieldsIndexReader ¶

type CompressingStoredFieldsIndexReader struct {
	// contains filtered or unexported fields
}

Random-access reader for CompressingStoredFieldsIndexWriter

func (*CompressingStoredFieldsIndexReader) Clone ¶

func (r *CompressingStoredFieldsIndexReader) Clone() *CompressingStoredFieldsIndexReader

type CompressingStoredFieldsReader ¶

type CompressingStoredFieldsReader struct {
	// contains filtered or unexported fields
}

StoredFieldsReader impl for CompressingStoredFieldsFormat

func (*CompressingStoredFieldsReader) Clone ¶

func (r *CompressingStoredFieldsReader) Clone() StoredFieldsReader

func (*CompressingStoredFieldsReader) Close ¶

func (r *CompressingStoredFieldsReader) Close() (err error)

Close the underlying IndexInputs

func (*CompressingStoredFieldsReader) VisitDocument ¶

func (r *CompressingStoredFieldsReader) VisitDocument(docID int, visitor StoredFieldVisitor) error

type CompressingStoredFieldsWriter ¶

type CompressingStoredFieldsWriter struct {
	// contains filtered or unexported fields
}

StoredFieldsWriter impl for CompressingStoredFieldsFormat

func NewCompressingStoredFieldsWriter ¶

func NewCompressingStoredFieldsWriter(dir store.Directory, si *model.SegmentInfo,
	segmentSuffix string, ctx store.IOContext, formatName string,
	compressionMode CompressionMode, chunkSize int) (*CompressingStoredFieldsWriter, error)

func (*CompressingStoredFieldsWriter) Abort ¶

func (w *CompressingStoredFieldsWriter) Abort()

func (*CompressingStoredFieldsWriter) Close ¶

func (w *CompressingStoredFieldsWriter) Close() error

func (*CompressingStoredFieldsWriter) Finish ¶

func (w *CompressingStoredFieldsWriter) Finish(fis model.FieldInfos, numDocs int) (err error)

func (*CompressingStoredFieldsWriter) FinishDocument ¶

func (w *CompressingStoredFieldsWriter) FinishDocument() error

func (*CompressingStoredFieldsWriter) StartDocument ¶

func (w *CompressingStoredFieldsWriter) StartDocument() error

func (*CompressingStoredFieldsWriter) WriteField ¶

func (w *CompressingStoredFieldsWriter) WriteField(info *model.FieldInfo, field model.IndexableField) error

type CompressingTermVectorsFormat ¶

type CompressingTermVectorsFormat struct {
	// contains filtered or unexported fields
}

A TermVectorsFormat that compresses chunks of documents together in order to improve the compression ratio.

func NewCompressingTermVectorsFormat ¶

func NewCompressingTermVectorsFormat(formatName, segmentSuffix string,
	compressionMode CompressionMode, chunkSize int) *CompressingTermVectorsFormat

Create a new CompressingTermVectorsFormat

formatName is the name of the format. This name will be used in the file formats to perform codec header checks.

The compressionMode parameter allows you to choose between compression algorithms that have various compression and decompression speeds so that you can pick the one that best fits your indexing and searching throughput. You should never instantiate two CompressingTermVectorsFormats that have the same name but different CompressionModes.

chunkSize is the minimum byte size of a chunk of documents. Highter values of chunkSize should improve the compression ratio but will require more memory at indexing time and might make document loading a little slower (depending on the size of your OS cache compared to the size of your index).

func (*CompressingTermVectorsFormat) VectorsReader ¶

func (vf *CompressingTermVectorsFormat) VectorsReader(d store.Directory,
	segmentInfo *model.SegmentInfo, fieldsInfos model.FieldInfos,
	context store.IOContext) (spi.TermVectorsReader, error)

func (*CompressingTermVectorsFormat) VectorsWriter ¶

func (vf *CompressingTermVectorsFormat) VectorsWriter(d store.Directory,
	segmentInfo *model.SegmentInfo,
	context store.IOContext) (spi.TermVectorsWriter, error)

type CompressionMode ¶

type CompressionMode interface {
	NewCompressor() Compressor
	NewDecompressor() Decompressor
}

type CompressionModeDefaults ¶

type CompressionModeDefaults int

func (CompressionModeDefaults) NewCompressor ¶

func (m CompressionModeDefaults) NewCompressor() Compressor

func (CompressionModeDefaults) NewDecompressor ¶

func (m CompressionModeDefaults) NewDecompressor() Decompressor

type Compressor ¶

type Compressor func(bytes []byte, out DataOutput) error

Compress bytes into out. It is the responsibility of the compressor to add all necessary information so that a Decompressor will know when to stop decompressing bytes from the stream.

type DataInput ¶

type DataInput interface {
	ReadByte() (b byte, err error)
	ReadBytes(buf []byte) error
}

type DataOutput ¶

type DataOutput interface {
	WriteByte(b byte) error
	WriteBytes(buf []byte) error
	WriteInt(i int32) error
	WriteVInt(i int32) error
	WriteString(string) error
}

type Decompressor ¶

type Decompressor func(in DataInput, originalLength, offset, length int, bytes []byte) (buf []byte, err error)

Decompress 'bytes' that were stored between [offset:offset+length] in the original stream from the compressed stream 'in' to 'bytes'. The length of the returned bytes (len(buf)) must be equal to 'length'. Implementations of this method are free to resize 'bytes' depending on their needs.

type GrowableByteArrayDataOutput ¶

type GrowableByteArrayDataOutput struct {
	*util.DataOutputImpl
	// contains filtered or unexported fields
}

A DataOutput that can be used to build a []byte

func (*GrowableByteArrayDataOutput) WriteByte ¶

func (out *GrowableByteArrayDataOutput) WriteByte(b byte) error

func (*GrowableByteArrayDataOutput) WriteBytes ¶

func (out *GrowableByteArrayDataOutput) WriteBytes(b []byte) error

type LZ4HashTable ¶

type LZ4HashTable struct {
	// contains filtered or unexported fields
}

type StoredFieldsIndexWriter ¶

type StoredFieldsIndexWriter struct {
	// contains filtered or unexported fields
}

Efficient index format for block-based Codecs.

This writer generates a file which be loaded into memory using memory-efficient data structures to quickly locate the block that contains any document.

In order to have a compact in-memory representation, for every block of 1024 chunks, this index computes the average number of bytes per chunk and for every chunk, only stores the difference between

- ${chunk number} * ${average length of a chunk} - and the actual start offset of the chunk

Data is written as follows:

PackedIntsVersion, <Block>^BlockCount, BlocksEndMarker
PackedIntsVersion --> VERSION_CURRENT as a vint
BlocksEndMarker --> 0 as a vint, this marks the end of blocks since blocks are not allowed to start with 0
Block --> BlockChunks, <Docbases>, <StartPointers>
BlockChunks --> a vint which is the number of chunks encoded in the block
DocBases --> DocBase, AvgChunkDocs, BitsPerDocbaseDelta, DocBaseDeltas
DocBase --> first document ID of the block of chunks, as a vint
AvgChunkDocs --> average number of documents in a single chunk, as a vint
BitsPerDocBaseDelta --> number of bits required to represent a delta from the average using ZigZag encoding
DocBaseDeltas --> packed array of BlockChunks elements of BitsPerDocBaseDelta bits each, representing the deltas from the average doc base using ZigZag encoding.
StartPointers --> StartointerBase, AveChunkSize, BitsPerStartPointerDelta, StartPointerDeltas
StartPointerBase --> the first start ointer of the block, as a vint64
AvgChunkSize --> the average size of a chunk of compressed documents, as a vint64
BitsPerStartPointerDelta --> number of bits required to represent a delta from the average using ZigZag encoding
StartPointerDeltas --> packed array of BlockChunks elements of BitsPerStartPointerDelta bits each, representing the deltas from the average start pointer using ZigZag encoding

Notes ¶

- For any block, the doc base of the n-th chunk can be restored with DocBase + AvgChunkDocs * n + DOcBsaeDeltas[n]. - For any block, the start pointer of the n-th chunk can be restored with StartPointerBase + AvgChunkSize * n + StartPointerDeltas[n]. - Once data is loaded into memory, you can lookup the start pointer of any document by performing two binary searches: a first one based on the values of DocBase in order to find the right block, and then inside the block based on DocBaseDeltas (by reconstructing the doc bases for every chunk).

func NewStoredFieldsIndexWriter ¶

func NewStoredFieldsIndexWriter(indexOutput store.IndexOutput) (*StoredFieldsIndexWriter, error)

func (*StoredFieldsIndexWriter) Close ¶

func (w *StoredFieldsIndexWriter) Close() error

Source Files ¶

View all Source files

?	: This menu
/	: Search site
f or F	: Jump to
y or Y	: Canonical URL