block

package
v0.4.2 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: May 8, 2024 License: Apache-2.0 Imports: 2 Imported by: 0

Documentation

Overview

Package block implements content-sensitive partitioning of a stream of byte data into blocks, using a rolling hash function.

The algorithm used to split data into blocks is based on the one from LBFS:

http://pdos.csail.mit.edu/lbfs/

As described in the SOSP 2001 paper "A Low-Bandwidth Network File System":

https://pdos.csail.mit.edu/papers/lbfs:sosp01/lbfs.pdf

This package provides an implementation of the Rabin-Karp modular rolling hash algorithm; other algorithms can be plugged in by implementing the Hasher and Hash interfaces.

Example
package main

import (
	"fmt"
	"log"
	"strings"

	"github.com/creachadair/ffs/block"
)

func main() {
	// Note that these two strings are similar, but the second one has been
	// edited. The edit changes the block splits around the point of the edit,
	// but the sliding window allows them to resynchronize after the break.
	const input1 = `abcdefg-hijklmnop-qrstuv-wxyz-abcdefg-hijklmnop-qrstuv-wxyz-abcdefghijklmnopqrstuv`
	const input2 = `abcdefg-hijklmnop-qrstuv-wxyz-*-abcdefg-hijklmnop-qrstuv-wxyz-abcdefghijklmnopqrstuv`

	opts := &block.SplitConfig{
		Min:    5,  // no blocks shorter than this
		Size:   10, // desired mean block size
		Max:    20, // no blocks longer than this
		Hasher: block.RabinKarpHasher(23, 997, 13),
	}

	for _, v := range []string{input1, input2} {
		s := block.NewSplitter(strings.NewReader(v), opts)
		var i int
		if err := s.Split(func(data []byte) error {
			i++
			fmt.Printf("%d. %s\n", i, string(data))
			return nil
		}); err != nil {
			log.Fatal(err)
		}
		fmt.Println()
	}

}
Output:


1. abcdefg-h
2. ijklmnop-qrstu
3. v-wxyz-abcdefg
4. -hijklmnop-qrstu
5. v-wxyz-abcdefghijklm
6. nopqrstuv

1. abcdefg-h
2. ijklmnop-qrstu
3. v-wxyz-*-
4. abcdefg-hi
5. jklmnop-qrstu
6. v-wxyz-abcdefghijklm
7. nopqrstuv

Index

Examples

Constants

View Source
const (
	// DefaultMin is the default minimum block size, in bytes.
	DefaultMin = 2048

	// DefaultSize is the default target block size, in bytes.
	DefaultSize = 16384

	// DefaultMax is the default maximum block size, in bytes.
	DefaultMax = 65536
)

These values are the defaults used if none are specified in the config.

Variables

View Source
var DefaultHasher = RabinKarpHasher(1031, 2147483659, 48)

DefaultHasher is used by a Splitter if no hasher is set in its config.

Functions

This section is empty.

Types

type Hash

type Hash interface {
	// Add a byte to the rolling hash, and return the updated value.
	Update(byte) uint64
}

A Hash implements a rolling hash.

type Hasher

type Hasher interface {
	// Hash returns a fresh Hash instance using the settings from the Hasher.
	// Instances are independent and can be safely used concurrently.
	Hash() Hash
}

A Hasher constructs rolling hash instances. Use the Hash method to obtain a fresh instance.

func RabinKarpHasher

func RabinKarpHasher(base, modulus int64, windowSize int) Hasher

RabinKarpHasher returns a Rabin-Karp rolling hasher using the given base, modulus, and window size. The base and modulus must be coprime and the modulus should be prime (but note that the constructor does not check this).

type SplitConfig

type SplitConfig struct {
	// The rolling hash to use. If nil, uses DefaultHasher.
	Hasher

	// Minimum block size, in bytes. The splitter will not split a block until
	// it is at least this size.
	Min int

	// Desired block size, in bytes. The splitter will attempt to generate
	// blocks of approximately this average size.
	Size int

	// Maximum block size, in bytes. The splitter will split any block that
	// exceeds this size, even if the rolling hash does not find a break.
	Max int
}

A SplitConfig contains the settings to construct a splitter.

func (*SplitConfig) Hash

func (c *SplitConfig) Hash() Hash

Hash implements the Hasher interface for a SplitConfig.

type Splitter

type Splitter struct {
	// contains filtered or unexported fields
}

A Splitter wraps an underlying io.Reader to split the data from the reader into blocks using a rolling hash.

func NewSplitter

func NewSplitter(r io.Reader, c *SplitConfig) *Splitter

NewSplitter constructs a Splitter that reads its data from r and partitions it into blocks using the rolling hash from c. A nil *SplitConfig is ready for use with default sizes and hash settings.

func (*Splitter) Config

func (s *Splitter) Config() *SplitConfig

Config returns the SplitConfig used to construct s, which may be nil.

func (*Splitter) Next

func (s *Splitter) Next() ([]byte, error)

Next returns the next available block, or an error. The slice returned is only valid until a subsequent call of Next. Returns nil, io.EOF when no further blocks are available.

func (*Splitter) Split

func (s *Splitter) Split(f func(data []byte) error) error

Split splits blocks from s and passes each block in sequence to f, until there are no further blocks or until f returns an error. If f returns an error, processing stops and that error is returned to the caller of Split.

The slice passed to f is only valid while f is active; if f wishes to store a block for later use, it must be copied.

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL