hashsplit

package
v0.0.0-...-78347b9 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Aug 19, 2023 License: MIT, MIT Imports: 4 Imported by: 0

README

Hashsplit - content-based splitting of byte streams

Go Reference Go Report Card Tests Coverage Status

Hashsplitting is a way of dividing a byte stream into pieces based on the stream's content rather than on any predetermined chunk size. As the Splitter reads the stream it maintains a rolling checksum of the last several bytes. A chunk boundary occurs when the rolling checksum has enough trailing bits set (where “enough” is a configurable setting that determines the average chunk size).

Usage

You can split the input from r, an io.Reader, like this:

err := Split(r, f)

...where f is a func([]byte, uint) error that receives each consecutive chunk and its “level” (which can be thought of as how badly the splitter wanted to make a boundary at the end of the chunk). These chunks can be arranged in a “hashsplit tree” like this:

var tb TreeBuilder
err := Split(r, tb.Add)
if err != nil { ... }
root, err := tb.Root()

...and now root is the root of a tree whose leaves contain consecutive chunks of the input.

What is it all about?

Hashsplitting has benefits when it comes to representing multiple, slightly different versions of the same data. Consider, for example, the problem of adding EXIF tags to a JPEG image file. The tags appear near the beginning of the file, and the bulk of the image data follows. If the file were divided into chunks at (say) 8-kilobyte boundaries, then adding EXIF data near the beginning would alter every following chunk (except in the lucky case where the size of the added data is an exact multiple of 8kb). With hashsplitting, only the chunks in the vicinity of the change are affected.

A sequence of hashsplit chunks can additionally be organized into a tree for even better compaction. Each chunk has a “level” L determined by the rolling checksum, and each node in the tree has a level N. Tree nodes at level 0 collect chunks at level 0, up to and including a chunk at level L>0; then a new level-0 node begins. Tree nodes at level N collect nodes at level N-1 up to and including a chunk at level L>N; then a new level-N node begins.

Hashsplitting is used to dramatically reduce storage and bandwidth requirements in projects like rsync, bup, and perkeep. More information, and a proposed standard, can be found at github.com/hashsplit/hashsplit-spec.

Compatibility note

An earlier version of this package included a Splitter.Split method, which allowed a Splitter s to consume all of the input from an io.Reader r. This has been removed. The same behavior can be obtained simply by doing:

_, err := io.Copy(s, r)
if err != nil { ... }
err = s.Close()

Documentation

Index

Constants

This section is empty.

Variables

View Source
var ErrNotFound = errors.New("not found")

ErrNotFound is the error returned by Seek when the seek position lies outside the given node's range.

Functions

func Split

func Split(r io.Reader, f func([]byte, uint) error) error

Split hashsplits its input using a default Splitter and the given callback to process chunks. See NewSplitter for details about the callback.

Types

type Node

type Node interface {
	// Offset gives the position in the original byte stream that is the first byte represented by this node.
	Offset() uint64

	// Size gives the number of bytes in the original byte stream that this node represents.
	Size() uint64

	// NumChildren gives the number of subnodes of this node.
	// This is only for interior nodes of the tree (level 1 and higher).
	// For leaf nodes (level 0) this must return zero.
	NumChildren() int

	// Child returns the subnode with the given index from 0 through NumChildren()-1.
	Child(int) (Node, error)
}

Node is the abstract type of a node in a hashsplit tree. See TreeBuilder for details.

func Seek

func Seek(n Node, pos uint64) (Node, error)

Seek finds the level-0 node representing the given byte position (i.e., the one where Offset <= pos < Offset+Size).

type Splitter

type Splitter struct {
	// MinSize is the minimum chunk size.
	// Only the final chunk may be smaller than this.
	// This should always be >= 64,
	// which is the rolling checksum "window size."
	// If it's less than the size of the checksum window,
	// then the same window can span multiple chunks,
	// meaning a chunk boundary is not independent of the preceding chunk.
	// If you leave this set to zero,
	// 64 is what you'll get.
	// If you really mean "I want no minimum,"
	// set this to 1.
	MinSize int

	// SplitBits is the number of trailing zero bits in the rolling checksum required to produce a chunk.
	// The default (what you get if you leave it set to zero) is 13,
	// which means a chunk boundary occurs on average once every 8,192 bytes.
	//
	// (But thanks to math, that doesn't mean that 8,192 is the median chunk size.
	// The median chunk size is actually the logarithm, base (2^SplitBits-1)/(2^SplitBits), of 0.5.
	// That makes the median chunk size 5,678 when SplitBits==13.)
	SplitBits uint

	// MinBytesOfPart is the minimum file part size.
	MinBytesOfPart int
	// contains filtered or unexported fields
}

Splitter hashsplits a byte sequence into chunks. It implements the io.WriteCloser interface. Create a new Splitter with NewSplitter.

Hashsplitting is a way of dividing a byte stream into pieces based on the stream's content rather than on any predetermined chunk size. As the Splitter reads the stream it maintains a rolling checksum of the last several bytes. A chunk boundary occurs when the rolling checksum has enough trailing bits set to zero (where "enough" is a configurable setting that determines the average chunk size).

Hashsplitting has benefits when it comes to representing multiple, slightly different versions of the same data. Consider, for example, the problem of adding EXIF tags to a JPEG image file. The tags appear near the beginning of the file, and the bulk of the image data follows. If the file were divided into chunks at (say) 8-kilobyte boundaries, then adding EXIF data near the beginning would alter every following chunk, except in the lucky case where the size of the added data is an exact multiple of 8kb. With hashsplitting, only the chunks in the vicinity of the change are affected.

Hashsplitting is used to dramatically reduce storage and bandwidth requirements in projects like rsync, bup, and perkeep.

func NewSplitter

func NewSplitter(f func([]byte, uint) error) *Splitter

NewSplitter produces a new Splitter with the given callback. The Splitter is an io.WriteCloser. As bytes are written to it, it finds chunk boundaries and calls the callback.

The callback receives the bytes of the chunk, and the chunk's "level," which is the number of extra trailing zeroes in the rolling checksum (in excess of Splitter.SplitBits).

Do not forget to call Close on the Splitter to flush any remaining chunk from its internal buffer.

func (*Splitter) Close

func (s *Splitter) Close() error

Close implements io.Closer. It is necessary to call Close to flush any buffered chunk remaining. Calling Close may result in a call to the callback in s. It is an error to call Write after a call to Close. Close is idempotent: it can safely be called multiple times without error (and without producing the final chunk multiple times).

func (*Splitter) Write

func (s *Splitter) Write(inp []byte) (int, error)

Write implements io.Writer. It may produce one or more calls to the callback in s, as chunks are discovered. Any error from the callback will cause Write to return early with that error.

type TreeBuilder

type TreeBuilder struct {
	// F is an optional function for transforming the TreeBuilder's node representation
	// (*TreeBuilderNode)
	// into any other type implementing the Node interface.
	// This is called on each node as it is completed and added to its parent as a new child.
	//
	// Callers may wish to perform this transformation when it is not necessary or desirable to keep the full input in memory
	// (i.e., the chunks in the leaf nodes),
	// such as when the input may be very large.
	//
	// F is guaranteed to be called exactly once on each node.
	//
	// If F is nil,
	// all nodes in the tree remain *TreeBuilderNode objects.
	//
	// If this callback return an error,
	// the enclosing function -
	// Add or Root -
	// returns early with that error.
	// In that case the TreeBuilder is left in an inconsistent state
	// and no further calls to Add or Root are possible.
	F func(*TreeBuilderNode) (Node, error)
	// contains filtered or unexported fields
}

TreeBuilder assembles a sequence of chunks into a hashsplit tree.

A hashsplit tree provides another level of space-and-bandwidth savings over and above what Split gives you. Consider, again, the example of adding EXIF tags to a JPEG file. Although most chunks of the hashsplitted file will be the same before and after adding tags, the _list_ needed to reassemble those chunks into the original file will be very different: all the unaffected chunks must shift position to accommodate the new EXIF-containing chunks.

A hashsplit tree organizes that list into a tree instead, whose shape is determined by the content of the chunks, just as the chunk boundaries are. It has the property that only the tree nodes in the vicinity of the change will be affected. Most subtrees will remain the same.

Just as each chunk has a level L determined by the rolling checksum (see NewSplitter), so does each node in the tree have a level, N. Tree nodes at level 0 collect chunks at level 0, up to and including a chunk at level L>0; then a new level-0 node begins. Tree nodes at level N>0 collect nodes at level N-1 up to and including a chunk at level L>N; then a new level-N node begins.

func (*TreeBuilder) Add

func (tb *TreeBuilder) Add(bytes []byte, level uint) error

Add adds a new chunk to the TreeBuilder. It is typical to call this function in the callback of Split as each chunk is produced.

The level of a chunk is normally the level value passed to the Split callback. It results in the creation of a new node at the given level. However, this produces a tree with an average branching factor of 2. For wider fan-out (more children per node), the caller can reduce the value of level.

func (*TreeBuilder) Root

func (tb *TreeBuilder) Root() (Node, error)

Root produces the root of the tree after all nodes have been added with calls to Add. Root may only be called one time. If the tree is empty, Root returns a nil Node. It is an error to call Add after a call to Root.

The return value of Root is the interface type Node. If tb.F is nil, the concrete type will be *TreeBuilderNode.

type TreeBuilderNode

type TreeBuilderNode struct {
	// Nodes is the list of subnodes.
	// This is empty for leaf nodes (level 0) and non-empty for interior nodes (level 1 and higher).
	Nodes []Node

	// Chunks is a list of chunks.
	// This is non-empty for leaf nodes (level 0) and empty for interior nodes (level 1 and higher).
	Chunks [][]byte
	// contains filtered or unexported fields
}

TreeBuilderNode is the concrete type implementing the Node interface that is used internally by TreeBuilder. Callers may transform this into any other node type during tree construction using the TreeBuilder.F callback.

A interior node ("level 1" and higher) contains one or more subnodes as children. A leaf node ("level 0") contains one or more byte slices, which are hashsplit chunks of the input. Exactly one of Nodes and Chunks is non-empty.

func (*TreeBuilderNode) Child

func (n *TreeBuilderNode) Child(i int) (Node, error)

Child implements Node.Child.

func (*TreeBuilderNode) NumChildren

func (n *TreeBuilderNode) NumChildren() int

NumChildren implements Node.NumChildren, the number of child nodes.

func (*TreeBuilderNode) Offset

func (n *TreeBuilderNode) Offset() uint64

Offset implements Node.Offset, the position of the first byte of the underlying input represented by this node.

func (*TreeBuilderNode) Size

func (n *TreeBuilderNode) Size() uint64

Size implements Node.Size, the number of bytes of the underlying input represented by this node.

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL