Documentation
¶
Overview ¶
Implementation of the Merkle Search Tree (MST) data structure for atproto.
## Terminology
node: any node in the tree. nodes can contain multiple entries. they should never be entirely "empty", unless the entire tree is a single empty node, but they might only contain "child" pointers
entry: nodes contain multiple entries. these can include both key/CID pairs, or pointers to child nodes. entries are always lexically sorted, with "child" entries pointing to nodes containing (recursively) keys in the appropriate lexical range. there should never be multiple "child" entries adjacent in a single node (they should be merged instead)
tree: an overall tree of nodes
## Tricky Bits
When inserting:
- the inserted key might be on a "higher" layer than the current top of the tree, in which case new parent tree nodes need to be created - "parent" or "child" insertions might be multiple layers away from the starting node, with intermediate nodes created - inserting a "value" entry in a node might require "splitting" a child node, if the key on the current layer would have fallen within the lexical range of the child
When removing:
- deleting an entry from a node might result in a "merge" of two child nodes which are no longer "split" - removing a "value" entry from the top of the tree might make it a simple pointer down to a child. in this case the top of the tree should be "trimmed" (this might involve multiple layers of trimming)
When inverting operations:
- need additional "proof" blocks to invert deletions. basically need the proof blocks for any keys (at any layer) directly adjacent to the deleted block - if an entry is removed from the top of a partial tree and results in "trimming", and the child node is not available, the overall tree root CID might still be known
## Hacking
Be careful with go slices. Need to avoid creating multiple references (slices) of the same underlying array, which can lead to "mutation at a distance" in ways that are hard to debug.
Index ¶
- Constants
- Variables
- func CountPrefixLen(a, b []byte) int
- func DebugPrintTree(n *Node, depth int)
- func HeightForKey(key []byte) (height int)
- func IsValidKey(key []byte) bool
- type EntryData
- type Node
- type NodeData
- type NodeEntry
- type Tree
- func (t *Tree) Copy() Tree
- func (t *Tree) Get(key []byte) (*cid.Cid, error)
- func (t *Tree) Insert(key []byte, val cid.Cid) (*cid.Cid, error)
- func (t *Tree) IsEmpty() bool
- func (t *Tree) IsPartial() bool
- func (t *Tree) Remove(key []byte) (*cid.Cid, error)
- func (t *Tree) RootCID() (*cid.Cid, error)
- func (t *Tree) Verify() error
- func (t *Tree) WriteDiffBlocks(ctx context.Context, bs blockstore.Blockstore) (*cid.Cid, error)
- func (t *Tree) WriteToMap(m map[string]cid.Cid) error
Constants ¶
const (
// Maximum length, in bytes, of a key in the tree. Note that the atproto specifications imply a repo path maximum length, but don't say anything directly about MST key, other than they can not be empty (zero-length).
MAX_KEY_BYTES = 1024
)
Variables ¶
var ErrInvalidKey = errors.New("bytestring not a valid MST key")
var ErrInvalidTree = errors.New("invalid MST structure")
var ErrPartialTree = errors.New("MST is not complete")
Functions ¶
func CountPrefixLen ¶
Computes the common prefix length between two bytestrings.
Used when compacting node entry lists for encoding.
func DebugPrintTree ¶
This function is not very well implemented or correct. Should probably switch to Devin's `goat repo mst` code.
func HeightForKey ¶
Computes the MST "height" for a key (bytestring). Layers are counted from the "bottom" of the tree, starting with zero.
For atproto repository v3, uses SHA-256 as the hashing function and counts two bits at a time, for an MST "fanout" value of 16.
func IsValidKey ¶
Types ¶
type EntryData ¶
type EntryData struct { PrefixLen int64 `cborgen:"p"` // count of characters shared with previous path/key in tree KeySuffix []byte `cborgen:"k"` // remaining part of path/key (appended to "previous key") Value cid.Cid `cborgen:"v"` // CID pointer at this path/key Right *cid.Cid `cborgen:"t"` // [nullable] pointer to lower-level subtree to the "right" of this path/key entry }
CBOR serialization struct for a single entry within a `NodeData` entry list.
type Node ¶
type Node struct { // array of key/value pairs and pointers to child nodes. entry arrays must always be in correct/valid order at any point in time: sorted by 'key', and at most one 'pointer' entry between 'value' entries. Entries []NodeEntry // "height" or "layer" of MST tree this node is at (with zero at the "bottom" and root/top of tree the "highest") Height int // if true, the cached CID of this node is out of date Dirty bool // optionally, the last computed CID of this Node (when expressed as NodeData) CID *cid.Cid // if true, this is an empty/incomplete node which just represents the CID of the tree. only used as part of MST inversion Stub bool }
Represents a node in a Merkle Search Tree (MST). If this is the "root" or "top" of the tree, it effectively is the tree itself.
Trees may be "partial" if they contain references to child nodes by CID, but not pointers to `Node` representations.
type NodeData ¶
type NodeData struct { Left *cid.Cid `cborgen:"l"` // [nullable] pointer to lower-level subtree to the "left" of this path/key Entries []EntryData `cborgen:"e"` // ordered list of entries at this node }
CBOR serialization struct for a MST tree node. MST tree node as gets serialized to CBOR. Note that the CBOR fields are all single-character.
func NodeDataFromCBOR ¶
Parses CBOR bytes in to `NodeData` struct
func (*NodeData) Bytes ¶
Encodes a single `NodeData` struct as CBOR bytes. Does not recursively encode or update children.
type NodeEntry ¶
type NodeEntry struct { Key []byte Value *cid.Cid ChildCID *cid.Cid Child *Node // tracks whether anything about this entry has changed since `Node` CID was computed Dirty bool }
Represents an entry in an MST `Node`, which could either be a direct path/value entry, or a pointer do a child tree node. Note that these are *not* one-to-one with `EntryData`.
Either the Key and Value fields should be non-zero; or the Child and/or ChildCID field should be non-zero. If ChildCID is present, but Child is not, then this is part of a "partial" tree.
type Tree ¶
type Tree struct {
Root *Node
}
High-level API for an MST, as a decoded in-memory data structure.
This might be an entire tree (all child nodes in-memory), or might be a partial tree with some nodes as CID links. Operations on the tree do not persist to any backing storage automatically.
Errors when operating on the tree may leave the tree in a partially modified or invalid/corrupt state.
func LoadTreeFromMap ¶
Creates a new Tree by loading key/value pairs from a map.
func LoadTreeFromStore ¶
func LoadTreeFromStore(ctx context.Context, bs blockstore.Blockstore, root cid.Cid) (*Tree, error)
func NewEmptyTree ¶
func NewEmptyTree() Tree
func (*Tree) Get ¶
Reads the value (CID) corresponding to the key.
If key is not in the tree, returns nil, not an error.
key: key or path being inserted. must not be empty/nil
func (*Tree) Insert ¶
Adds a key/value to the tree, and returns any previously existing value (CID).
Caller can inspect the previous value to determine if the behavior was a "creation" (key didn't exist), an "update" (key existed with a different value), or no-op (key existed with current value).
key: key or path being inserted. must not be empty/nil val: CID value being inserted
func (*Tree) IsPartial ¶
Returns false if all nodes in the tree are available in-memory in decoded format; otherwise returns true. Does not consider record data, only MST nodes.
func (*Tree) Remove ¶
Removes key/value from the sub-tree provided. Return the previous CID value, if any. If key was not found, returns nil (which is not an error).
key: key or path being inserted. must not be empty/nil
func (*Tree) RootCID ¶
Returns the overall root-node CID for the MST.
If possible, lazily returned a known value. If necessary, recursively encodes tree nodes to compute CIDs.
NOTE: will mark the tree "clean" (clear any dirty flags).
func (*Tree) WriteDiffBlocks ¶
func (t *Tree) WriteDiffBlocks(ctx context.Context, bs blockstore.Blockstore) (*cid.Cid, error)
Walks the tree, encodes any "dirty" nodes as CBOR data, and writes that data as blocks to the provided blockstore. Returns root CID.
func (*Tree) WriteToMap ¶
Recursively walks the tree and writes key/value pairs to map `m`
The map (`m`) is mutated in place (by reference); the map must be initialized before calling.