bm

package
v0.0.0-...-bb1a4e7 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: May 9, 2020 License: GPL-3.0 Imports: 4 Imported by: 0

README

Bentley–McIlroy data compression algorithm.

Reference implementation: https://github.com/cloudflare/bm

Implementation of Bentley-McIlroy's data compression scheme to encode compressed versions of strings, and compute deltas between source and target.

About Bentley-McIlroy

The Bentley-McIlroy compression scheme is an algorithm for compressing a string by finding long common substrings. The algorithm and its properties are described in greater detail in their 1999 paper. The technique, with a source dictionary and a target string, is used in Google's implementation of a VCDIFF encoder, open-vcdiff, as part of encoding deltas.

To give a brief summary, the algorithm works by fixing a window of block size b and then sliding over the string, storing the fingerprint of every b-th window. These stored fingerprints are then used to detect repetitions later on in the string.

The algorithm in pseudocode, as given in the paper is:

initialize fp
for (i = b; i < n; i++)
  if (i % b == 0)
    store(fp, i)
  update fp to include a[i] and exclude a[i-b]
  checkformatch(fp, i)

In the algorithm above, checkformatch(fp, i) looks up the fingerprint fp in a hash table and then encodes a match if one is found.

checkformatch(fp, i) is the core piece of this algorithm, and "encodes a match" is not fully described in the paper. The rest of the algorithm simply describes moving through the string with a sliding window, looking at substrings and storing fingerprints whenever we cross a block boundary.

As described in the paper, suppose b = 100 and that the current block matches block 56 (i.e., bytes 5600 through to 5699). This current block could then be encoded as <5600,100>.

There are two similar improvements which can be made, so as to prevent "ababab" from compressing into "ab<0,2><0,2>", both of which are also in the paper. When we know that the current block matches block 56, we can extend the match as far back as possible, not exceeding b - 1 bytes. Similarly, we can move the match far forward as possible without limitation.

The reason there is a limit of b-1 bytes when moving backwards is that if there were more to match beyond b-1 bytes, it would've been found in a previous iteration of the loop.

This library implementation moves matches forward, but does not move matches backwards.

To be more explicit about what extending the match means, consider

xabcdabcdy  (the string)
0123456789  (indices)

with a block size of b = 2. Moving left to right, the fingerprints of "xa", "ab", "bc", ..., are computed, but only "xa", "bc", "da", ... are stored. When "ab" is seen at 5..6, there is no corresponding entry in the hash table, so nothing is done, yet. On the next substring of length 2, "bc", at positions 6..7, there is a corresponding entry in the hash table, so there's a match, which we could encode as <2, 2>, say. However, we'd like to actually produce <1, 4>, which is more efficient. So starting with <2, 2>, we move the match back 1 character for both the "bc" at 6..7 and the "bc" at 2..3, then check if 1..3 matches 5..7, which it does. This is moving the match backwards.

For moving the match forwards, simply do the same thing. Check if 1..4 matches 6..8, which it does. 1..5 does not match 6..9, so we use <1, 4> and we're done.

The resulting string, with backward- and forward-extension is xabcd<1, 4>y. In the case of no backward extensions, it is xabcda<2, 3>y.

References

Documentation

Index

Constants

This section is empty.

Variables

This section is empty.

Functions

func DeserializeDictionary

func DeserializeDictionary(o []byte, m map[uint32]uint32) error

DeserializeDictionary reads the H part of the Dictionary from a []byte previously created with SerializeDictionary

Types

type Compressor

type Compressor struct {
	// contains filtered or unexported fields
}

A Compressor is a complete instance of the compressor

func NewCompressor

func NewCompressor() *Compressor

NewCompressor creates a new compressor. The Compressor implements io.Writer and so calling Write() compress and writes to the actual output. Note that you must call SetWriter and SetDictionary before doing any compression to set the output writer.

func (*Compressor) Close

func (c *Compressor) Close() error

Close tells the compressor that all the data has been written. This does not close the underlying io.Writer. This is where the Bentley/McIlroy and Rabin/Karp algorithms are implemented. Reference those papers for a full explanation.

func (*Compressor) CompressedSize

func (c *Compressor) CompressedSize() int

Get the size in bytes of the last compressed output. Only makes sense after Close() has been called.

func (*Compressor) GetDictionary

func (c *Compressor) GetDictionary() *Dictionary

GetDictionary retrieves the dictionary structure for serialization

func (*Compressor) InputSize

func (c *Compressor) InputSize() int

Get the size in bytes of the last uncompressed input. Only makes sense after Close() has been called.

func (*Compressor) Ratio

func (c *Compressor) Ratio() int

Ratio retrieves the compression ratio of the last compression performed. Only makes sense after Close() has been called. The returned value is an integer representing the size of the output as a percentage of the input * 100. If the return value is -1 then it indicates that there was no input.

func (*Compressor) SerializeDictionary

func (c *Compressor) SerializeDictionary() ([]byte, error)

SerializeDictionary turns H (the map part of the Dictionary) into a []byte for easy storage in memcached or elsewhere.

func (*Compressor) SetDictionary

func (c *Compressor) SetDictionary(dict *Dictionary)

SetDictionary sets a dictionary. When a dictionary has been loaded references are made to the dictionary (rather than internally in the compressed data itself)

func (*Compressor) SetWriter

func (c *Compressor) SetWriter(w io.Writer)

SetWriter sets the writer to which the compressed output will be written. This must be called otherwise an error will occur.

func (*Compressor) Write

func (c *Compressor) Write(p []byte) (int, error)

Write implements the io.Writer interface. To compress data Write repeatedly and it will be compressed. When done it is necessary to call Close() where the actual compression occurs.

type Dictionary

type Dictionary struct {
	Dict []byte // Bytes to compress against
	H    map[uint32]uint32
}

A Dictionary contains both the raw data being compressed against and the hash table built using the Rabin/Karp procedure

type Expander

type Expander struct {
	// contains filtered or unexported fields
}

An Expander is the complete state of the expander returned by NewExpander

func NewExpander

func NewExpander(r io.Reader, dict []byte) *Expander

NewExpander creates a new decompressor. Pass in an io.Reader that can be used to read the raw compressed data. The Expander implements io.Reader and so calling Read() decompress data and reads the actual input.

func (*Expander) Expand

func (e *Expander) Expand(p []byte) (q []byte, err error)

Expand expands the compressed data into a buffer

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL