Documentation ¶
Index ¶
Constants ¶
This section is empty.
Variables ¶
var ErrNotFound = errors.New("not found")
ErrNotFound is the error returned by Seek when the seek position lies outside the given node's range.
Functions ¶
Types ¶
type Node ¶
type Node interface { // Offset gives the position in the original byte stream that is the first byte represented by this node. Offset() uint64 // Size gives the number of bytes in the original byte stream that this node represents. Size() uint64 // NumChildren gives the number of subnodes of this node. // This is only for interior nodes of the tree (level 1 and higher). // For leaf nodes (level 0) this must return zero. NumChildren() int // Child returns the subnode with the given index from 0 through NumChildren()-1. Child(int) (Node, error) }
Node is the abstract type of a node in a hashsplit tree. See TreeBuilder for details.
type Splitter ¶
type Splitter struct { // MinSize is the minimum chunk size. // Only the final chunk may be smaller than this. // This should always be >= 64, // which is the rolling checksum "window size." // If it's less than the size of the checksum window, // then the same window can span multiple chunks, // meaning a chunk boundary is not independent of the preceding chunk. // If you leave this set to zero, // 64 is what you'll get. // If you really mean "I want no minimum," // set this to 1. MinSize int // SplitBits is the number of trailing zero bits in the rolling checksum required to produce a chunk. // The default (what you get if you leave it set to zero) is 13, // which means a chunk boundary occurs on average once every 8,192 bytes. // // (But thanks to math, that doesn't mean that 8,192 is the median chunk size. // The median chunk size is actually the logarithm, base (2^SplitBits-1)/(2^SplitBits), of 0.5. // That makes the median chunk size 5,678 when SplitBits==13.) SplitBits uint // MinBytesOfPart is the minimum file part size. MinBytesOfPart int // contains filtered or unexported fields }
Splitter hashsplits a byte sequence into chunks. It implements the io.WriteCloser interface. Create a new Splitter with NewSplitter.
Hashsplitting is a way of dividing a byte stream into pieces based on the stream's content rather than on any predetermined chunk size. As the Splitter reads the stream it maintains a rolling checksum of the last several bytes. A chunk boundary occurs when the rolling checksum has enough trailing bits set to zero (where "enough" is a configurable setting that determines the average chunk size).
Hashsplitting has benefits when it comes to representing multiple, slightly different versions of the same data. Consider, for example, the problem of adding EXIF tags to a JPEG image file. The tags appear near the beginning of the file, and the bulk of the image data follows. If the file were divided into chunks at (say) 8-kilobyte boundaries, then adding EXIF data near the beginning would alter every following chunk, except in the lucky case where the size of the added data is an exact multiple of 8kb. With hashsplitting, only the chunks in the vicinity of the change are affected.
Hashsplitting is used to dramatically reduce storage and bandwidth requirements in projects like rsync, bup, and perkeep.
func NewSplitter ¶
NewSplitter produces a new Splitter with the given callback. The Splitter is an io.WriteCloser. As bytes are written to it, it finds chunk boundaries and calls the callback.
The callback receives the bytes of the chunk, and the chunk's "level," which is the number of extra trailing zeroes in the rolling checksum (in excess of Splitter.SplitBits).
Do not forget to call Close on the Splitter to flush any remaining chunk from its internal buffer.
func (*Splitter) Close ¶
Close implements io.Closer. It is necessary to call Close to flush any buffered chunk remaining. Calling Close may result in a call to the callback in s. It is an error to call Write after a call to Close. Close is idempotent: it can safely be called multiple times without error (and without producing the final chunk multiple times).
type TreeBuilder ¶
type TreeBuilder struct { // F is an optional function for transforming the TreeBuilder's node representation // (*TreeBuilderNode) // into any other type implementing the Node interface. // This is called on each node as it is completed and added to its parent as a new child. // // Callers may wish to perform this transformation when it is not necessary or desirable to keep the full input in memory // (i.e., the chunks in the leaf nodes), // such as when the input may be very large. // // F is guaranteed to be called exactly once on each node. // // If F is nil, // all nodes in the tree remain *TreeBuilderNode objects. // // If this callback return an error, // the enclosing function - // Add or Root - // returns early with that error. // In that case the TreeBuilder is left in an inconsistent state // and no further calls to Add or Root are possible. F func(*TreeBuilderNode) (Node, error) // contains filtered or unexported fields }
TreeBuilder assembles a sequence of chunks into a hashsplit tree.
A hashsplit tree provides another level of space-and-bandwidth savings over and above what Split gives you. Consider, again, the example of adding EXIF tags to a JPEG file. Although most chunks of the hashsplitted file will be the same before and after adding tags, the _list_ needed to reassemble those chunks into the original file will be very different: all the unaffected chunks must shift position to accommodate the new EXIF-containing chunks.
A hashsplit tree organizes that list into a tree instead, whose shape is determined by the content of the chunks, just as the chunk boundaries are. It has the property that only the tree nodes in the vicinity of the change will be affected. Most subtrees will remain the same.
Just as each chunk has a level L determined by the rolling checksum (see NewSplitter), so does each node in the tree have a level, N. Tree nodes at level 0 collect chunks at level 0, up to and including a chunk at level L>0; then a new level-0 node begins. Tree nodes at level N>0 collect nodes at level N-1 up to and including a chunk at level L>N; then a new level-N node begins.
func (*TreeBuilder) Add ¶
func (tb *TreeBuilder) Add(bytes []byte, level uint) error
Add adds a new chunk to the TreeBuilder. It is typical to call this function in the callback of Split as each chunk is produced.
The level of a chunk is normally the level value passed to the Split callback. It results in the creation of a new node at the given level. However, this produces a tree with an average branching factor of 2. For wider fan-out (more children per node), the caller can reduce the value of level.
func (*TreeBuilder) Root ¶
func (tb *TreeBuilder) Root() (Node, error)
Root produces the root of the tree after all nodes have been added with calls to Add. Root may only be called one time. If the tree is empty, Root returns a nil Node. It is an error to call Add after a call to Root.
The return value of Root is the interface type Node. If tb.F is nil, the concrete type will be *TreeBuilderNode.
type TreeBuilderNode ¶
type TreeBuilderNode struct { // Nodes is the list of subnodes. // This is empty for leaf nodes (level 0) and non-empty for interior nodes (level 1 and higher). Nodes []Node // Chunks is a list of chunks. // This is non-empty for leaf nodes (level 0) and empty for interior nodes (level 1 and higher). Chunks [][]byte // contains filtered or unexported fields }
TreeBuilderNode is the concrete type implementing the Node interface that is used internally by TreeBuilder. Callers may transform this into any other node type during tree construction using the TreeBuilder.F callback.
A interior node ("level 1" and higher) contains one or more subnodes as children. A leaf node ("level 0") contains one or more byte slices, which are hashsplit chunks of the input. Exactly one of Nodes and Chunks is non-empty.
func (*TreeBuilderNode) Child ¶
func (n *TreeBuilderNode) Child(i int) (Node, error)
Child implements Node.Child.
func (*TreeBuilderNode) NumChildren ¶
func (n *TreeBuilderNode) NumChildren() int
NumChildren implements Node.NumChildren, the number of child nodes.
func (*TreeBuilderNode) Offset ¶
func (n *TreeBuilderNode) Offset() uint64
Offset implements Node.Offset, the position of the first byte of the underlying input represented by this node.
func (*TreeBuilderNode) Size ¶
func (n *TreeBuilderNode) Size() uint64
Size implements Node.Size, the number of bytes of the underlying input represented by this node.