hashsplit

package module

v2.1.0 Latest Latest Go to latest Published: Sep 30, 2024 License: MIT Imports: 6 Imported by: 0

Details

Valid go.mod file
Redistributable license
Tagged version
Stable version
Learn more about best practices

Repository

github.com/bobg/hashsplit

Links

Open Source Insights

README ¶

Hashsplit - content-based splitting of byte streams

Hashsplitting is a way of dividing a byte stream into pieces based on the stream's content rather than on any predetermined chunk size. As the Splitter reads the stream it maintains a rolling checksum of the last several bytes. A chunk boundary occurs when the rolling checksum has enough trailing bits set to zero (where “enough” is a configurable setting that determines the average chunk size).

Usage

You can split the input from r, an io.Reader, like this:

split, errptr := hashsplit.Split(r)
for chunk := range split {
  // ...handle the contents of r one chunk at a time...
}
if err := *errptr; err != nil {
  // ...handle an error reading from r...
}

Chunks can be arranged in a “hashsplit tree” like this:

split, errptr := hashsplit.Split(r)
tree := hashsplit.Tree(split)
root := hashsplit.Root(tree)
if err := *errptr; err != nil {
  // ...handle an error reading from r...
}

...and now root is the root of a tree whose leaves contain consecutive chunks of the input.

What is it all about?

Hashsplitting has benefits when it comes to representing multiple, slightly different versions of the same data. Consider, for example, the problem of adding EXIF tags to a JPEG image file. The tags appear near the beginning of the file, and the bulk of the image data follows. If the file were divided into chunks at (say) 8-kilobyte boundaries, then adding EXIF data near the beginning would alter every following chunk (except in the lucky case where the size of the added data is an exact multiple of 8kb). With hashsplitting, only the chunks in the vicinity of the change are affected.

A sequence of hashsplit chunks can additionally be organized into a tree for even better compaction. Each chunk has a “level” L determined by the rolling checksum, and each node in the tree has a level N. Tree nodes at level 0 collect chunks at level 0, up to and including a chunk at level L>0; then a new level-0 node begins. Tree nodes at level N collect nodes at level N-1 up to and including a chunk at level L>N; then a new level-N node begins.

Hashsplitting is used to dramatically reduce storage and bandwidth requirements in projects like rsync, bup, and perkeep. More information, and a proposed standard, can be found at github.com/hashsplit/hashsplit-spec.

Documentation ¶

Overview ¶

Package hashsplit implements content-based splitting of byte streams.

Index ¶

Variables
func Split(r io.Reader) (iter.Seq2[[]byte, int], *error)
func Tree(inp iter.Seq2[[]byte, int]) iter.Seq2[*TreeNode, int]
type Splitter
- func NewSplitter() *Splitter
- func (s *Splitter) Split(r io.Reader) (iter.Seq2[[]byte, int], *error)
type TreeNode
- func Root(tree iter.Seq2[*TreeNode, int]) *TreeNode
- func Seek(n *TreeNode, pos uint64) (*TreeNode, error)
- func (n *TreeNode) AllChunks() iter.Seq[[]byte]

Constants ¶

This section is empty.

Variables ¶

View Source

var ErrNotFound = errors.New("not found")

ErrNotFound is the error returned by Seek when the seek position lies outside the given node's range.

Functions ¶

func Split ¶

func Split(r io.Reader) (iter.Seq2[[]byte, int], *error)

Split hashsplits its input using a default Splitter. See Splitter.Split.

func Tree ¶

func Tree(inp iter.Seq2[[]byte, int]) iter.Seq2[*TreeNode, int]

Tree arranges a sequence of chunks produced by a Splitter into a "hashsplit tree." The result is an iterator of TreeNode/level pairs in a particular order; see details below. The final pair in the sequence is the root of the tree.

A hashsplit tree provides another level of space-and-bandwidth savings over and above what Split gives you. Consider, again, the example of adding EXIF tags to a JPEG file. Although most chunks of the hashsplitted file will be the same before and after adding tags, the _list_ needed to reassemble those chunks into the original file will be very different: all the unaffected chunks must shift position to accommodate the new EXIF-containing chunks.

A hashsplit tree organizes that list into a tree instead, whose shape is determined by the content of the chunks, just as the chunk boundaries are. It has the property that only the tree nodes in the vicinity of the change will be affected. Most subtrees will remain the same.

Just as each chunk has a level L determined by the rolling checksum (see Splitter.Split), so does each node in the tree have a level, N. Tree nodes at level 0 collect chunks at level 0, up to and including a chunk at level L>0; then a new level-0 node begins. Tree nodes at level N>0 collect nodes at level N-1 up to and including a chunk at level L>N; then a new level-N node begins.

ITERATOR DETAILS ¶

Each time Tree consumes a chunk at level L>0, it yields the nodes in the partially constructed tree that were previously pending, from levels 0 through L-1.

After the final chunk is consumed, Tree yields the remaining nodes pending, from level 0 up to the root.

For example, Tree may produce a sequence of nodes at these levels:

0, 1, 0, 1, 2, 0, 1, 0, 1, 2, 3

The level-3 node at the end is the root of the tree.

Callers may discard all but the last node. The rest of the tree, and the chunks at the leaves, are reachable from the root via the Children and Chunks fields.

TRANSFORMING NODES ¶

The caller may transform nodes as they are produced by the iterator. For example, if the input is very large, it may not be desirable to keep all the chunks in memory (in the leaves of the tree). Instead, it might be preferable to "save them aside" to external storage and replace them in the tree with some compact representation, like a hash or a lookup key. Here's how this might be done:

split, errptr := hashsplit.Split(input)
tree := hashsplit.Tree(split)
var root *hashsplit.TreeNode
for node := range tree {
  for i, chunk := range node.Chunks {
    saved, err := saveAside(chunk)
    if err != nil {
      panic(err)
    }
    node.Chunks[i] = saved
  }
  root = node
}
if err := *errptr; err != nil {
  panic(err)
}

After this, root is the root of the tree, and the leaves no longer have the original input chunks, but instead have their saved-aside representations.

FAN-OUT ¶

The chunk levels produced by Split result in a tree with an average branching factor of 2 (i.e., each node has 2 children on average). This results in tall trees. If you want a wider tree, with more children per node, you have to narrow the range of chunk levels seen by Tree. Here's how that might look:

split, errptr := hashsplit.Split(input)
reducedLevels := seqs.Map2(split, func(chunk []byte, level int) ([]byte, int) { return chunk, level/4 })
tree := hashsplit.Tree(reducedLevels)
var root *hashsplit.TreeNode
for node := range tree {
  root = node
}
if err := *errptr; err != nil {
  panic(err)
}

(See https://pkg.go.dev/github.com/bobg/seqs#Map2 for an explanation of the Map2 function used here.)

Types ¶

type Splitter ¶

type Splitter struct {
	// MinSize is the minimum chunk size.
	// Only the final chunk may be smaller than this.
	// This should always be >= 64,
	// which is the rolling checksum "window size."
	// If it's less than the size of the checksum window,
	// then the same window can span multiple chunks,
	// meaning a chunk boundary is not independent of the preceding chunk.
	// If you leave this set to zero,
	// 64 is what you'll get.
	// If you really mean "I want no minimum,"
	// set this to 1.
	MinSize int

	// SplitBits is the number of trailing zero bits in the rolling checksum required to produce a chunk.
	// The default (what you get if you leave it set to zero) is 13,
	// which means a chunk boundary occurs on average once every 8,192 bytes.
	//
	// (But thanks to math, that doesn't mean that 8,192 is the median chunk size.
	// The median chunk size is actually the logarithm, base (2^SplitBits-1)/(2^SplitBits), of 0.5.
	// That makes the median chunk size 5,678 when SplitBits==13.)
	SplitBits int
	// contains filtered or unexported fields
}

Splitter hashsplits a byte stream into chunks. Create a new Splitter with NewSplitter.

Hashsplitting is a way of dividing a byte stream into pieces based on the stream's content rather than on any predetermined chunk size. As the Splitter reads the stream it maintains a rolling checksum of the last several bytes. A chunk boundary occurs when the rolling checksum has enough trailing bits set to zero (where "enough" is a configurable setting that determines the average chunk size).

Hashsplitting has benefits when it comes to representing multiple, slightly different versions of the same data. Consider, for example, the problem of adding EXIF tags to a JPEG image file. The tags appear near the beginning of the file, and the bulk of the image data follows. If the file were divided into chunks at (say) 8-kilobyte boundaries, then adding EXIF data near the beginning would alter every following chunk, except in the lucky case where the size of the added data is an exact multiple of 8kb. With hashsplitting, only the chunks in the vicinity of the change are affected.

Hashsplitting is used to dramatically reduce storage and bandwidth requirements in projects like rsync, bup, and perkeep.

func NewSplitter ¶

func NewSplitter() *Splitter

NewSplitter produces a new Splitter. It may be customized before use by setting its MinSize and SplitBits fields.

func (*Splitter) Split ¶

func (s *Splitter) Split(r io.Reader) (iter.Seq2[[]byte, int], *error)

Split consumes the given input stream and hashsplits it, producing an iterator of chunk/level pairs.

Each chunk is a subsequence of the input stream. Concatenating the chunks in order produces the original stream.

A chunk's "level" is a number that indicates how many trailing zero bits the chunk's hash has, in excess of the number required to split the chunk (as determined by the Splitter's SplitBits setting). You can think of this as a measure of how badly the Splitter wanted to put a chunk boundary here. This number is used in constructing a hashsplit tree; see Tree.

After consuming the returned iterator, the caller should dereference the returned error pointer to see if any call to the underlying reader produced an error.

The Splitter should not be reused for another stream after this call.

type TreeNode ¶

type TreeNode struct {
	// Offset and Size describe the range of the original input stream encompassed by this node.
	Offset, Size uint64

	// Children is the list of subnodes for a node at level N>0.
	// This list is empty for level 0 nodes.
	Children []*TreeNode

	// Chunks is the list of chunks for a node at level 0.
	// This list is empty for nodes at higher levels.
	Chunks [][]byte
}

TreeNode is the type of a node in the tree produced by Tree.

func Root ¶ added in v2.1.0

func Root(tree iter.Seq2[*TreeNode, int]) *TreeNode

Root takes the output of Tree and returns the root of the tree. It does this by consuming the iterator and returning the last node, discarding everything else.

func Seek ¶

func Seek(n *TreeNode, pos uint64) (*TreeNode, error)

Seek finds the level-0 node representing the given byte position (i.e., the one where Offset <= pos < Offset+Size).

func (*TreeNode) AllChunks ¶

func (n *TreeNode) AllChunks() iter.Seq[[]byte]

AllChunks produces an iterator over all the chunks in the tree. It does this with a recursive tree walk starting at n.

Source Files ¶

View all Source files

hashsplit.go

?	: This menu
/	: Search site
f or F	: Jump to
y or Y	: Canonical URL