bm

package

v0.0.0-...-bb1a4e7 Latest Latest Go to latest Published: May 9, 2020 License: GPL-3.0 Imports: 4 Imported by: 0

Details

Valid go.mod file

The Go module system was introduced in Go 1.11 and is the official dependency management solution for Go.
Redistributable license

Redistributable licenses place minimal restrictions on how software can be used, modified, and redistributed.
Tagged version

Modules with tagged versions give importers more predictable builds.
Stable version

When a project reaches major version v1 it is considered stable.
Learn more about best practices

Repository

github.com/zerjioang/gotools

README ¶

Bentley–McIlroy data compression algorithm.

Reference implementation: https://github.com/cloudflare/bm

Implementation of Bentley-McIlroy's data compression scheme to encode compressed versions of strings, and compute deltas between source and target.

About Bentley-McIlroy

The Bentley-McIlroy compression scheme is an algorithm for compressing a string by finding long common substrings. The algorithm and its properties are described in greater detail in their 1999 paper. The technique, with a source dictionary and a target string, is used in Google's implementation of a VCDIFF encoder, open-vcdiff, as part of encoding deltas.

To give a brief summary, the algorithm works by fixing a window of block size b and then sliding over the string, storing the fingerprint of every b-th window. These stored fingerprints are then used to detect repetitions later on in the string.

The algorithm in pseudocode, as given in the paper is:

initialize fp
for (i = b; i < n; i++)
  if (i % b == 0)
    store(fp, i)
  update fp to include a[i] and exclude a[i-b]
  checkformatch(fp, i)

In the algorithm above, checkformatch(fp, i) looks up the fingerprint fp in a hash table and then encodes a match if one is found.

checkformatch(fp, i) is the core piece of this algorithm, and "encodes a match" is not fully described in the paper. The rest of the algorithm simply describes moving through the string with a sliding window, looking at substrings and storing fingerprints whenever we cross a block boundary.

As described in the paper, suppose b = 100 and that the current block matches block 56 (i.e., bytes 5600 through to 5699). This current block could then be encoded as <5600,100>.

There are two similar improvements which can be made, so as to prevent "ababab" from compressing into "ab<0,2><0,2>", both of which are also in the paper. When we know that the current block matches block 56, we can extend the match as far back as possible, not exceeding b - 1 bytes. Similarly, we can move the match far forward as possible without limitation.

The reason there is a limit of b-1 bytes when moving backwards is that if there were more to match beyond b-1 bytes, it would've been found in a previous iteration of the loop.

This library implementation moves matches forward, but does not move matches backwards.

To be more explicit about what extending the match means, consider

xabcdabcdy  (the string)
0123456789  (indices)

with a block size of b = 2. Moving left to right, the fingerprints of "xa", "ab", "bc", ..., are computed, but only "xa", "bc", "da", ... are stored. When "ab" is seen at 5..6, there is no corresponding entry in the hash table, so nothing is done, yet. On the next substring of length 2, "bc", at positions 6..7, there is a corresponding entry in the hash table, so there's a match, which we could encode as <2, 2>, say. However, we'd like to actually produce <1, 4>, which is more efficient. So starting with <2, 2>, we move the match back 1 character for both the "bc" at 6..7 and the "bc" at 2..3, then check if 1..3 matches 5..7, which it does. This is moving the match backwards.

For moving the match forwards, simply do the same thing. Check if 1..4 matches 6..8, which it does. 1..5 does not match 6..9, so we use <1, 4> and we're done.

The resulting string, with backward- and forward-extension is xabcd<1, 4>y. In the case of no backward extensions, it is xabcda<2, 3>y.

References

Documentation ¶

Index ¶

func DeserializeDictionary(o []byte, m map[uint32]uint32) error
type Compressor
- func NewCompressor() *Compressor
type Dictionary
type Expander
- func NewExpander(r io.Reader, dict []byte) *Expander
- func (e *Expander) Expand(p []byte) (q []byte, err error)

Constants ¶

This section is empty.

Variables ¶

This section is empty.

Functions ¶

func DeserializeDictionary ¶

func DeserializeDictionary(o []byte, m map[uint32]uint32) error

DeserializeDictionary reads the H part of the Dictionary from a []byte previously created with SerializeDictionary

Types ¶

type Compressor ¶

type Compressor struct {
	// contains filtered or unexported fields
}

A Compressor is a complete instance of the compressor

func NewCompressor ¶

func NewCompressor() *Compressor

NewCompressor creates a new compressor. The Compressor implements io.Writer and so calling Write() compress and writes to the actual output. Note that you must call SetWriter and SetDictionary before doing any compression to set the output writer.

func (*Compressor) Close ¶

func (c *Compressor) Close() error

Close tells the compressor that all the data has been written. This does not close the underlying io.Writer. This is where the Bentley/McIlroy and Rabin/Karp algorithms are implemented. Reference those papers for a full explanation.

func (*Compressor) CompressedSize ¶

func (c *Compressor) CompressedSize() int

Get the size in bytes of the last compressed output. Only makes sense after Close() has been called.

func (*Compressor) GetDictionary ¶

func (c *Compressor) GetDictionary() *Dictionary

GetDictionary retrieves the dictionary structure for serialization

func (*Compressor) InputSize ¶

func (c *Compressor) InputSize() int

Get the size in bytes of the last uncompressed input. Only makes sense after Close() has been called.

func (*Compressor) Ratio ¶

func (c *Compressor) Ratio() int

Ratio retrieves the compression ratio of the last compression performed. Only makes sense after Close() has been called. The returned value is an integer representing the size of the output as a percentage of the input * 100. If the return value is -1 then it indicates that there was no input.

func (*Compressor) SerializeDictionary ¶

func (c *Compressor) SerializeDictionary() ([]byte, error)

SerializeDictionary turns H (the map part of the Dictionary) into a []byte for easy storage in memcached or elsewhere.

func (*Compressor) SetDictionary ¶

func (c *Compressor) SetDictionary(dict *Dictionary)

SetDictionary sets a dictionary. When a dictionary has been loaded references are made to the dictionary (rather than internally in the compressed data itself)

func (*Compressor) SetWriter ¶

func (c *Compressor) SetWriter(w io.Writer)

SetWriter sets the writer to which the compressed output will be written. This must be called otherwise an error will occur.

func (*Compressor) Write ¶

func (c *Compressor) Write(p []byte) (int, error)

Write implements the io.Writer interface. To compress data Write repeatedly and it will be compressed. When done it is necessary to call Close() where the actual compression occurs.

type Dictionary ¶

type Dictionary struct {
	Dict []byte // Bytes to compress against
	H    map[uint32]uint32
}

A Dictionary contains both the raw data being compressed against and the hash table built using the Rabin/Karp procedure

type Expander ¶

type Expander struct {
	// contains filtered or unexported fields
}

An Expander is the complete state of the expander returned by NewExpander

func NewExpander ¶

func NewExpander(r io.Reader, dict []byte) *Expander

NewExpander creates a new decompressor. Pass in an io.Reader that can be used to read the raw compressed data. The Expander implements io.Reader and so calling Read() decompress data and reads the actual input.

func (*Expander) Expand ¶

func (e *Expander) Expand(p []byte) (q []byte, err error)

Expand expands the compressed data into a buffer

Source Files ¶

View all Source files

bm.go

?	: This menu
/	: Search site
f or F	: Jump to
y or Y	: Canonical URL