hashsplit

package module
v2.0.0-...-1e80874 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Aug 4, 2021 License: MIT Imports: 4 Imported by: 0

README

Hashsplit - content-based splitting of byte streams

Go Reference Go Report Card Tests

Hashsplitting is a way of dividing a byte stream into pieces based on the stream's content rather than on any predetermined chunk size. As the Splitter reads the stream it maintains a rolling checksum of the last several bytes. A chunk boundary occurs when the rolling checksum has enough trailing bits set (where “enough” is a configurable setting that determines the average chunk size).

Hashsplitting has benefits when it comes to representing multiple, slightly different versions of the same data. Consider, for example, the problem of adding EXIF tags to a JPEG image file. The tags appear near the beginning of the file, and the bulk of the image data follows. If the file were divided into chunks at (say) 8-kilobyte boundaries, then adding EXIF data near the beginning would alter every following chunk (except in the lucky case where the size of the added data is an exact multiple of 8kb). With hashsplitting, only the chunks in the vicinity of the change are affected.

A sequence of hashsplit chunks can additionally be organized into a tree for even better compaction. Each chunk has a “level” L determined by the rolling checksum, and each node in the tree has a level N. Tree nodes at level 0 collect chunks at level 0, up to and including a chunk at level L>0; then a new level-0 node begins. Tree nodes at level N collect nodes at level N-1 up to and including a chunk at level L>N; then a new level-N node begins.

Hashsplitting is used to dramatically reduce storage and bandwidth requirements in projects like rsync, bup, and perkeep. More information, and a proposed standard, can be found at github.com/hashsplit/hashsplit-spec.

Note: an earlier version of this package included a Splitter.Split method, which allowed a Splitter s to consume all of the input from an io.Reader r. This has been removed. The same behavior can be obtained simply by doing:

_, err := io.Copy(s, r)
if err != nil { ... }
err = s.Close()

Documentation

Overview

Package hashsplit implements content-based splitting of byte streams.

Index

Examples

Constants

This section is empty.

Variables

View Source
var ErrNotFound = errors.New("not found")

Functions

func Split

func Split(r io.Reader, f func([]byte, uint) error) error

Split hashsplits its input using a default Splitter and the given callback to process chunks. See NewSplitter for details about the callback.

Types

type Node

type Node interface {
	// Offset gives the position in the original byte stream that is the first byte represented by this node.
	Offset() uint64

	// Size gives the number of bytes in the original byte stream that this node represents.
	Size() uint64

	// NumChildren gives the number of subnodes of this node.
	// This is only for interior nodes of the tree (level 1 and higher).
	// For leaf nodes (level 0) this must return zero.
	NumChildren() int

	// Child returns the subnode with the given index from 0 through NumChildren()-1.
	Child(int) (Node, error)
}

Node is the abstract type of a node in a hashsplit tree. See TreeBuilder for details.

func Seek

func Seek(n Node, pos uint64) (Node, error)

Seek finds the level-0 node representing the given byte position (i.e., the one where Offset <= pos < Offset+Size).

type Splitter

type Splitter struct {
	// MinSize is the minimum chunk size.
	// Only the final chunk may be smaller than this.
	// This should always be >= 64,
	// which is the rolling checksum "window size."
	// If it's less than the size of the checksum window,
	// then the same window can span multiple chunks,
	// meaning a chunk boundary is not independent of the preceding chunk.
	// If you leave this set to zero,
	// 64 is what you'll get.
	// If you really mean "I want no minimum,"
	// set this to 1.
	MinSize int

	// SplitBits is the number of trailing zero bits in the rolling checksum required to produce a chunk.
	// The default (what you get if you leave it set to zero) is 13,
	// which means a chunk boundary occurs on average once every 8,192 bytes.
	//
	// (But thanks to math, that doesn't mean that 8,192 is the median chunk size.
	// The median chunk size is actually the logarithm, base (2^SplitBits-1)/(2^SplitBits), of 0.5.
	// That makes the median chunk size 5,678 when SplitBits==13.)
	SplitBits uint
	// contains filtered or unexported fields
}

Splitter hashsplits a byte sequence into chunks. It implements the io.WriteCloser interface. Create a new Splitter with NewSplitter.

Hashsplitting is a way of dividing a byte stream into pieces based on the stream's content rather than on any predetermined chunk size. As the Splitter reads the stream it maintains a rolling checksum of the last several bytes. A chunk boundary occurs when the rolling checksum has enough trailing bits set to zero (where "enough" is a configurable setting that determines the average chunk size).

Hashsplitting has benefits when it comes to representing multiple, slightly different versions of the same data. Consider, for example, the problem of adding EXIF tags to a JPEG image file. The tags appear near the beginning of the file, and the bulk of the image data follows. If the file were divided into chunks at (say) 8-kilobyte boundaries, then adding EXIF data near the beginning would alter every following chunk, except in the lucky case where the size of the added data is an exact multiple of 8kb. With hashsplitting, only the chunks in the vicinity of the change are affected.

Hashsplitting is used to dramatically reduce storage and bandwidth requirements in projects like rsync, bup, and perkeep.

func NewSplitter

func NewSplitter(f func([]byte, uint) error) *Splitter

NewSplitter produces a new Splitter with the given callback. The Splitter is an io.WriteCloser. As bytes are written to it, it finds chunk boundaries and calls the callback.

The callback receives the bytes of the chunk, and the chunk's "level," which is the number of extra trailing zeroes in the rolling checksum (in excess of Splitter.SplitBits).

Do not forget to call Close on the Splitter to flush any remaining chunk from its internal buffer.

func (*Splitter) Close

func (s *Splitter) Close() error

Close implements io.Closer. It is necessary to call Close to flush any buffered chunk remaining. Calling Close may result in a call to the callback in s. It is an error to call Write after a call to Close. Close is idempotent: it can safely be called multiple times without error (and without producing the final chunk multiple times).

func (*Splitter) Write

func (s *Splitter) Write(inp []byte) (int, error)

Write implements io.Writer. It may produce one or more calls to the callback in s, as chunks are discovered. Any error from the callback will cause Write to return early with that error.

type TreeBuilder

type TreeBuilder struct {
	// F is an optional function for transforming the TreeBuilder's node representation
	// (*TreeBuilderNode)
	// into any other type implementing the Node interface.
	// This is called on each node as it is completed and added to its parent as a new child.
	//
	// Callers may wish to perform this transformation when it is not necessary or desirable to keep the full input in memory
	// (i.e., the chunks in the leaf nodes),
	// such as when the input may be very large.
	//
	// F is guaranteed to be called exactly once on each node.
	//
	// If F is nil,
	// all nodes in the tree remain *TreeBuilderNode objects.
	//
	// If this callback return an error,
	// the enclosing function -
	// Add or Root -
	// returns early with that error.
	// In that case the TreeBuilder is left in an inconsistent state
	// and no further calls to Add or Root are possible.
	F func(*TreeBuilderNode) (Node, error)
	// contains filtered or unexported fields
}

TreeBuilder assembles a sequence of chunks into a hashsplit tree.

A hashsplit tree provides another level of space-and-bandwidth savings over and above what Split gives you. Consider, again, the example of adding EXIF tags to a JPEG file. Although most chunks of the hashsplitted file will be the same before and after adding tags, the _list_ needed to reassemble those chunks into the original file will be very different: all the unaffected chunks must shift position to accommodate the new EXIF-containing chunks.

A hashsplit tree organizes that list into a tree instead, whose shape is determined by the content of the chunks, just as the chunk boundaries are. It has the property that only the tree nodes in the vicinity of the change will be affected. Most subtrees will remain the same.

Just as each chunk has a level L determined by the rolling checksum (see NewSplitter), so does each node in the tree have a level, N. Tree nodes at level 0 collect chunks at level 0, up to and including a chunk at level L>0; then a new level-0 node begins. Tree nodes at level N>0 collect nodes at level N-1 up to and including a chunk at level L>N; then a new level-N node begins.

Example
var r io.Reader // Represents the source of some data.

var tb hashsplit.TreeBuilder
err := hashsplit.Split(r, tb.Add)
if err != nil {
	panic(err)
}
// Get the root of the tree with tb.Root().
Output:

Example (FanOut)
var r io.Reader // Represents the source of some data.

var tb hashsplit.TreeBuilder
err := hashsplit.Split(r, func(bytes []byte, level uint) error {
	// Map level to a smaller range for wider fan-out
	// (more children per tree node).
	return tb.Add(bytes, level/4)
})
if err != nil {
	panic(err)
}
// Get the root of the tree with tb.Root().
Output:

Example (SaveAside)
var r io.Reader // Represents the source of some data.

// Represents any function that replaces a chunk with a compact representation of that chunk
// (like a hash or a lookup key).
var saveAside func([]byte) ([]byte, error)

tb := hashsplit.TreeBuilder{
	F: func(node *hashsplit.TreeBuilderNode) (hashsplit.Node, error) {
		for i, chunk := range node.Chunks {
			repr, err := saveAside(chunk)
			if err != nil {
				return nil, err
			}
			node.Chunks[i] = repr
		}
		return node, nil
	},
}
err := hashsplit.Split(r, tb.Add)
if err != nil {
	panic(err)
}
// Get the root of the tree with tb.Root().
Output:

func (*TreeBuilder) Add

func (tb *TreeBuilder) Add(bytes []byte, level uint) error

Add adds a new chunk to the TreeBuilder. It is typical to call this function in the callback of Split as each chunk is produced.

The level of a chunk is normally the level value passed to the Split callback. It results in the creation of a new node at the given level. However, this produces a tree with an average branching factor of 2. For wider fan-out (more children per node), the caller can reduce the value of level.

func (*TreeBuilder) Root

func (tb *TreeBuilder) Root() (Node, error)

Root produces the root of the tree after all nodes have been added with calls to Add. Root may only be called one time. If the tree is empty, Root returns a nil Node. It is an error to call Add after a call to Root.

The return value of Root is the interface type Node. If tb.F is nil, the concrete type will be *TreeBuilderNode.

type TreeBuilderNode

type TreeBuilderNode struct {
	// Nodes is the list of subnodes.
	// This is empty for leaf nodes (level 0) and non-empty for interior nodes (level 1 and higher).
	Nodes []Node

	// Chunks is a list of chunks.
	// This is non-empty for leaf nodes (level 0) and empty for interior nodes (level 1 and higher).
	Chunks [][]byte
	// contains filtered or unexported fields
}

TreeBuilderNode is the concrete type implementing the Node interface that is used internally by TreeBuilder. Callers may transform this into any other node type during tree construction using the TreeBuilder.F callback.

A interior node ("level 1" and higher) contains one or more subnodes as children. A leaf node ("level 0") contains one or more byte slices, which are hashsplit chunks of the input. Exactly one of Nodes and Chunks is non-empty.

func (*TreeBuilderNode) Child

func (n *TreeBuilderNode) Child(i int) (Node, error)

Child implements Node.Child.

func (*TreeBuilderNode) NumChildren

func (n *TreeBuilderNode) NumChildren() int

NumChildren implements Node.NumChildren.

func (*TreeBuilderNode) Offset

func (n *TreeBuilderNode) Offset() uint64

Offset implements Node.Offset.

func (*TreeBuilderNode) Size

func (n *TreeBuilderNode) Size() uint64

Size implements Node.Size.

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL