fse

package
v0.0.0-...-8e4d6a8 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Oct 19, 2024 License: Apache-2.0, BSD-3-Clause, MIT Imports: 5 Imported by: 0

README

Finite State Entropy

This package provides Finite State Entropy encoding and decoding.

Finite State Entropy (also referenced as tANS) encoding provides a fast near-optimal symbol encoding/decoding for byte blocks as implemented in zstandard.

This can be used for compressing input with a lot of similar input values to the smallest number of bytes. This does not perform any multi-byte dictionary coding as LZ coders, but it can be used as a secondary step to compressors (like Snappy) that does not do entropy encoding.

News

  • Feb 2018: First implementation released. Consider this beta software for now.

Usage

This package provides a low level interface that allows to compress single independent blocks.

Each block is separate, and there is no built in integrity checks. This means that the caller should keep track of block sizes and also do checksums if needed.

Compressing a block is done via the Compress function. You must provide input and will receive the output and maybe an error.

These error values can be returned:

Error Description
<nil> Everything ok, output is returned
ErrIncompressible Returned when input is judged to be too hard to compress
ErrUseRLE Returned from the compressor when the input is a single byte value repeated
(error) An internal error occurred.

As can be seen above there are errors that will be returned even under normal operation so it is important to handle these.

To reduce allocations you can provide a Scratch object that can be re-used for successive calls. Both compression and decompression accepts a Scratch object, and the same object can be used for both.

Be aware, that when re-using a Scratch object that the output buffer is also re-used, so if you are still using this you must set the Out field in the scratch to nil. The same buffer is used for compression and decompression output.

Decompressing is done by calling the Decompress function. You must provide the output from the compression stage, at exactly the size you got back. If you receive an error back your input was likely corrupted.

It is important to note that a successful decoding does not mean your output matches your original input. There are no integrity checks, so relying on errors from the decompressor does not assure your data is valid.

For more detailed usage, see examples in the godoc documentation.

Performance

A lot of factors are affecting speed. Block sizes and compressibility of the material are primary factors.
All compression functions are currently only running on the calling goroutine so only one core will be used per block.

The compressor is significantly faster if symbols are kept as small as possible. The highest byte value of the input is used to reduce some of the processing, so if all your input is above byte value 64 for instance, it may be beneficial to transpose all your input values down by 64.

With moderate block sizes around 64k speed are typically 200MB/s per core for compression and around 300MB/s decompression speed.

The same hardware typically does Huffman (deflate) encoding at 125MB/s and decompression at 100MB/s.

Plans

At one point, more internals will be exposed to facilitate more "expert" usage of the components.

A streaming interface is also likely to be implemented. Likely compatible with FSE stream format.

Contributing

Contributions are always welcome. Be aware that adding public functions will require good justification and breaking changes will likely not be accepted. If in doubt open an issue before writing the PR.

Documentation

Overview

Package fse provides Finite State Entropy encoding and decoding.

Finite State Entropy encoding provides a fast near-optimal symbol encoding/decoding for byte blocks as implemented in zstd.

See https://github.com/klauspost/compress/tree/master/fse for more information.

Index

Examples

Constants

This section is empty.

Variables

View Source
var (
	// ErrIncompressible is returned when input is judged to be too hard to compress.
	ErrIncompressible = errors.New("input is not compressible")

	// ErrUseRLE is returned from the compressor when the input is a single byte value repeated.
	ErrUseRLE = errors.New("input is single value repeated")
)

Functions

func Compress

func Compress(in []byte, s *Scratch) ([]byte, error)

Compress the input bytes. Input must be < 2GB. Provide a Scratch buffer to avoid memory allocations. Note that the output is also kept in the scratch buffer. If input is too hard to compress, ErrIncompressible is returned. If input is a single byte value repeated ErrUseRLE is returned.

Example
// Read data
data, err := os.ReadFile("../testdata/e.txt")
if err != nil {
	panic(err)
}

// Create re-usable scratch buffer.
var s Scratch
b, err := Compress(data, &s)
if err != nil {
	panic(err)
}
fmt.Printf("Compress: %d -> %d bytes (%.2f:1)\n", len(data), len(b), float64(len(data))/float64(len(b)))
Output:

Compress: 100003 -> 41564 bytes (2.41:1)

func Decompress

func Decompress(b []byte, s *Scratch) ([]byte, error)

Decompress a block of data. You can provide a scratch buffer to avoid allocations. If nil is provided a temporary one will be allocated. It is possible, but by no way guaranteed that corrupt data will return an error. It is up to the caller to verify integrity of the returned data. Use a predefined Scrach to set maximum acceptable output size.

Example
// Read data
data, err := os.ReadFile("../testdata/e.txt")
if err != nil {
	panic(err)
}

// Create re-usable scratch buffer.
var s Scratch
b, err := Compress(data, &s)
if err != nil {
	panic(err)
}

// Since we use the output of compression, it cannot be used as output for decompression.
s.Out = make([]byte, 0, len(data))
d, err := Decompress(b, &s)
if err != nil {
	panic(err)
}
fmt.Printf("Input matches: %t\n", bytes.Equal(d, data))
Output:

Input matches: true

Types

type Scratch

type Scratch struct {

	// Out is output buffer.
	// If the scratch is re-used before the caller is done processing the output,
	// set this field to nil.
	// Otherwise the output buffer will be re-used for next Compression/Decompression step
	// and allocation will be avoided.
	Out []byte

	// DecompressLimit limits the maximum decoded size acceptable.
	// If > 0 decompression will stop when approximately this many bytes
	// has been decoded.
	// If 0, maximum size will be 2GB.
	DecompressLimit int

	// MaxSymbolValue will override the maximum symbol value of the next block.
	MaxSymbolValue uint8

	// TableLog will attempt to override the tablelog for the next block.
	TableLog uint8
	// contains filtered or unexported fields
}

Scratch provides temporary storage for compression and decompression.

func (*Scratch) Histogram

func (s *Scratch) Histogram() []uint32

Histogram allows to populate the histogram and skip that step in the compression, It otherwise allows to inspect the histogram when compression is done. To indicate that you have populated the histogram call HistogramFinished with the value of the highest populated symbol, as well as the number of entries in the most populated entry. These are accepted at face value. The returned slice will always be length 256.

func (*Scratch) HistogramFinished

func (s *Scratch) HistogramFinished(maxSymbol uint8, maxCount int)

HistogramFinished can be called to indicate that the histogram has been populated. maxSymbol is the index of the highest set symbol of the next data segment. maxCount is the number of entries in the most populated entry. These are accepted at face value.

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL