sloppy

package
v1.0.5 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Feb 22, 2024 License: Apache-2.0 Imports: 1 Imported by: 0

Documentation

Index

Constants

This section is empty.

Variables

This section is empty.

Functions

This section is empty.

Types

type Sloppy

type Sloppy struct {
	// contains filtered or unexported fields
}

Sloppy is a logical variant of Snappy. Its purpose to provide a kind of estimate of how a given byte sequence *will be* compressed by Snappy. It is useful when a byte stream is fed into a rolling hash with the goal of achieving a given average chunk byte length *after compression*. Sloppy is logically similar to snappy, but prefers "copies" which are closer to the repeated byte sequence (snappy prefers to refer to the *first* instance of a repeated byte sequence). This is important for mitigating the likelihood that altering any byte in an input stream will cause chunk boundaries to be redrawn downstream.

The high-level approach is to maintain a logical mapping between four-byte sequences which have been observed in the stream so-far and the integer offset of observed sequence (the mapping is done with a "cheap" hash-function which permits false-positives because they can be trivial filtered out). In the non-matched state, for each new byte consumed, a uint32 is computed from the next 4 bytes and then a look-up is performed to check for a matching 4 bytes earlier in the stream. Snappy and sloppy behave roughly identical thus far.

When in the "matched state" (attempting to extend the current match), Snappy does not re-index new 4-byte sequences, but Sloppy does. The reason for this is that Sloppy would like match the most recent occurence as it moves forward.

Lastly, Sloppy adds two novel heuritics, both aimed at further mitigating the chance of chunk boundaries being redrawn because of byte value changes:

1) During the first 2 bytes of match, it *continues* to look for closer matches (effectively prefering a closer but shorter copy to a further but longer one). The reason for this is that when sequences repeat frequently in a byte stream, randomness provides for a good chance that a one or two byte prefix on a repeated sequence will match "far away". E.g.

"23hello my friend, 12hello my friend, 01hello my friend, 23hello my friend"

In the above sequence, sloppy would prefer to copy the final "hello my friend" 19 bytes backwards rather than "23hello my friend" quite a bit further.

2) Sloppy will only emit copies which are "worth it". I.e. The longer the reference back, the longer the length of the copy must be.

func New

func New(f func(b byte) bool) *Sloppy

New returns a new sloppy encoder which will encode to |f|. If |f| ever returns false, then encoding ends immediately. |f| is a callback because the primary use is that the "encoded" byte stream is fed byte-by-byte into a rolling hash function.

func (*Sloppy) Reset

func (sl *Sloppy) Reset()

func (*Sloppy) Update

func (sl *Sloppy) Update(src []byte)

Update continues the encoding of a given input stream. The caller is expected to call update after having (ONLY) appended bytes to |src|. When |Update| returns, sloppy will have emitted 0 or more literals or copies by calling the |sf.f|. Note that sloppy will ALWAYS buffer the final three bytes of input.

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL