rabin

package
v1.0.3007 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Jul 14, 2024 License: Apache-2.0 Imports: 4 Imported by: 0

Documentation

Overview

Package rabin implements Rabin hashing (fingerprinting).

A given Rabin hash function is defined by a polynomial over GF(2):

p(x) = ... + p₂x² + p₁x + p₀   where pₙ ∈ GF(2)

The message to be hashed is likewise interpreted as a polynomial over GF(2), where the coefficients are the bits of the message in left-to-right most-significant-bit-first order. Given a message polynomial m(x) and a hashing polynomial p(x), the Rabin hash is simply the coefficients of m(x) mod p(x).

Rabin hashing has the unusual property that it can efficiently compute a "rolling hash" of a stream of data, where the hash value reflects only the most recent w bytes of the stream, for some window size w. This property makes it ideal for "content-defined chunking", which sub-divides sequential data on boundaries that are robust to insertions and deletions.

The details of Rabin hashing are described in Rabin, Michael (1981). "Fingerprinting by Random Polynomials." Center for Research in Computing Technology, Harvard University. Tech Report TR-CSE-03-01.

Index

Constants

View Source
const Poly64 = 0xbfe6b8a5bf378d83

Poly64 is an 64-bit (degree 63) irreducible polynomial over GF(2).

This is a convenient polynomial to use for computing 64-bit Rabin hashes.

Variables

This section is empty.

Functions

This section is empty.

Types

type Chunker

type Chunker struct {
	// contains filtered or unexported fields
}

A Chunker performs content-defined chunking. It divides a sequence of bytes into chunks such that insertions and deletions in the sequence will only affect chunk boundaries near those modifications.

func NewChunker

func NewChunker(table *Table, r io.Reader, minBytes, avgBytes, maxBytes int) *Chunker

NewChunker returns a content-defined chunker for data read from r using the Rabin hash defined by table. The chunks produced by this Chunker will be at least minBytes and at most maxBytes large and will, on average, be avgBytes large.

The Chunker buffers data from the Reader internally, so the Reader need not buffer itself. The caller may seek the reader, but if it does, it must only seek to a known chunk boundary and it must call Reset on the Chunker.

If the Reader additionally implements Discarder, the Chunker will use this to skip over bytes more efficiently.

The hash function defined by table must have a non-zero window size.

minBytes must be >= the window size. This ensures that chunk boundary n+1 does not depend on data from before chunk boundary n.

avgBytes must be a power of two.

func (*Chunker) Next

func (c *Chunker) Next() (int, error)

Next returns the length in bytes of the next chunk. If there are no more chunks, it returns 0, io.EOF. If the underlying Reader returns some other error, it passes that error on to the caller.

func (*Chunker) Reset

func (c *Chunker) Reset()

Reset resets c and clears its internal buffer. The caller must ensure that the underlying Reader is at a chunk boundary when calling Reset.

This is useful if the caller has knowledge of where an already-chunked stream is being modified. It can start at the chunk boundary before the modified point and re-chunk the stream until a new chunk boundary lines up with a boundary in the previous version of the stream.

type Discarder

type Discarder interface {
	// Discard skips the next n bytes, returning the number of
	// bytes discarded.
	//
	// If Discard skips fewer than n bytes, it also returns an
	// error. Discard must not skip beyond the end of the file.
	Discard(n int) (discarded int, err error)
}

A Discarder supports discarding bytes from an input stream.

type Hash

type Hash struct {
	// contains filtered or unexported fields
}

Hash computes Rabin hashes (often called fingerprints).

Hash implements hash.Hash64.

func New

func New(table *Table) *Hash

New returns a new Rabin hash using the polynomial and window size represented by table.

func (*Hash) BlockSize

func (h *Hash) BlockSize() int

BlockSize returns the window size if a window is configured, and otherwise returns 1.

This satisfies the hash.Hash interface and indicates that Write is most efficient if writes are a multiple of the returned size.

func (*Hash) Reset

func (h *Hash) Reset()

Reset resets h to its initial state.

func (*Hash) Size

func (h *Hash) Size() int

Size returns the number of bytes Sum will append. This is the minimum number of bytes necessary to represent the hash.

func (*Hash) Sum

func (h *Hash) Sum(b []byte) []byte

Sum appends the least-significant byte first representation of the current hash to b and returns the resulting slice.

func (*Hash) Sum64

func (h *Hash) Sum64() uint64

Sum64 returns the hash of all bytes written to h.

func (*Hash) Write

func (h *Hash) Write(p []byte) (n int, err error)

Write adds p to the running hash h.

If h is windowed, this may also expire previously written bytes from the running hash so that h represents the hash of only the most recently written window bytes.

It always returns len(p), nil.

type Table

type Table struct {
	// contains filtered or unexported fields
}

Table is a set of pre-computed tables for computing Rabin fingerprints for a given polynomial and window size.

func NewTable

func NewTable(polynomial uint64, window int) *Table

NewTable returns a Table for constructing Rabin hashes using the polynomial

p(x) = ... + p₂x² + p₁x + p₀   where pₙ ∈ GF(2)

where pₙ = (polynomial >> n) & 1. This polynomial must be irreducible and must have degree >= 8. The number of bits in the resulting hash values will be the same as the number of bits in polynomial.

This package defines Poly64 as a convenient 64-bit irreducible polynomial that can be used with this function.

If window > 0, hashes constructed from this Table will be rolling hash over only the most recently written window bytes of data.

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL