Documentation ¶
Overview ¶
Package block implements content-sensitive partitioning of a stream of byte data into blocks, using a rolling hash function.
The algorithm used to split data into blocks is based on the one from LBFS:
http://pdos.csail.mit.edu/lbfs/
As described in the SOSP 2001 paper "A Low-Bandwidth Network File System":
https://pdos.csail.mit.edu/papers/lbfs:sosp01/lbfs.pdf
This package provides an implementation of the Rabin-Karp modular rolling hash algorithm; other algorithms can be plugged in by implementing the Hasher and Hash interfaces.
Example ¶
package main import ( "fmt" "log" "strings" "github.com/creachadair/ffs/block" ) func main() { // Note that these two strings are similar, but the second one has been // edited. The edit changes the block splits around the point of the edit, // but the sliding window allows them to resynchronize after the break. const input1 = `abcdefg-hijklmnop-qrstuv-wxyz-abcdefg-hijklmnop-qrstuv-wxyz-abcdefghijklmnopqrstuv` const input2 = `abcdefg-hijklmnop-qrstuv-wxyz-*-abcdefg-hijklmnop-qrstuv-wxyz-abcdefghijklmnopqrstuv` opts := &block.SplitConfig{ Min: 5, // no blocks shorter than this Size: 10, // desired mean block size Max: 20, // no blocks longer than this Hasher: block.RabinKarpHasher(23, 997, 13), } for _, v := range []string{input1, input2} { s := block.NewSplitter(strings.NewReader(v), opts) var i int if err := s.Split(func(data []byte) error { i++ fmt.Printf("%d. %s\n", i, string(data)) return nil }); err != nil { log.Fatal(err) } fmt.Println() } }
Output: 1. abcdefg-h 2. ijklmnop-qrstu 3. v-wxyz-abcdefg 4. -hijklmnop-qrstu 5. v-wxyz-abcdefghijklm 6. nopqrstuv 1. abcdefg-h 2. ijklmnop-qrstu 3. v-wxyz-*- 4. abcdefg-hi 5. jklmnop-qrstu 6. v-wxyz-abcdefghijklm 7. nopqrstuv
Index ¶
Examples ¶
Constants ¶
const ( // DefaultMin is the default minimum block size, in bytes. DefaultMin = 2048 // DefaultSize is the default target block size, in bytes. DefaultSize = 16384 // DefaultMax is the default maximum block size, in bytes. DefaultMax = 65536 )
These values are the defaults used if none are specified in the config.
Variables ¶
var DefaultHasher = RabinKarpHasher(1031, 2147483659, 48)
DefaultHasher is used by a Splitter if no hasher is set in its config.
Functions ¶
This section is empty.
Types ¶
type Hash ¶
type Hash interface { // Add a byte to the rolling hash, and return the updated value. Update(byte) uint64 }
A Hash implements a rolling hash.
type Hasher ¶
type Hasher interface { // Hash returns a fresh Hash instance using the settings from the Hasher. // Instances are independent and can be safely used concurrently. Hash() Hash }
A Hasher constructs rolling hash instances. Use the Hash method to obtain a fresh instance.
func RabinKarpHasher ¶
RabinKarpHasher returns a Rabin-Karp rolling hasher using the given base, modulus, and window size. The base and modulus must be coprime and the modulus should be prime (but note that the constructor does not check this).
type SplitConfig ¶
type SplitConfig struct { // The rolling hash to use. If nil, uses DefaultHasher. Hasher // Minimum block size, in bytes. The splitter will not split a block until // it is at least this size. Min int // Desired block size, in bytes. The splitter will attempt to generate // blocks of approximately this average size. Size int // Maximum block size, in bytes. The splitter will split any block that // exceeds this size, even if the rolling hash does not find a break. Max int }
A SplitConfig contains the settings to construct a splitter.
func (*SplitConfig) Hash ¶
func (c *SplitConfig) Hash() Hash
Hash implements the Hasher interface for a SplitConfig.
type Splitter ¶
type Splitter struct {
// contains filtered or unexported fields
}
A Splitter wraps an underlying io.Reader to split the data from the reader into blocks using a rolling hash.
func NewSplitter ¶
func NewSplitter(r io.Reader, c *SplitConfig) *Splitter
NewSplitter constructs a Splitter that reads its data from r and partitions it into blocks using the rolling hash from c. A nil *SplitConfig is ready for use with default sizes and hash settings.
func (*Splitter) Config ¶
func (s *Splitter) Config() *SplitConfig
Config returns the SplitConfig used to construct s, which may be nil.
func (*Splitter) Next ¶
Next returns the next available block, or an error. The slice returned is only valid until a subsequent call of Next. Returns nil, io.EOF when no further blocks are available.
func (*Splitter) Split ¶
Split splits blocks from s and passes each block in sequence to f, until there are no further blocks or until f returns an error. If f returns an error, processing stops and that error is returned to the caller of Split.
The slice passed to f is only valid while f is active; if f wishes to store a block for later use, it must be copied.