Documentation ¶
Overview ¶
Package hamt provides a reference implementation of the IPLD HAMT used in the Filecoin blockchain. It includes some optional flexibility such that it may be used for other purposes outside of Filecoin.
HAMT is a "hash array mapped trie" https://en.wikipedia.org/wiki/Hash_array_mapped_trie. This implementation extends the standard form by including buckets for the key/value pairs at storage leaves and CHAMP mutation semantics https://michael.steindorfer.name/publications/oopsla15.pdf. The CHAMP invariant and mutation rules provide us with the ability to maintain canonical forms given any set of keys and their values, regardless of insertion order and intermediate data insertion and deletion. Therefore, for any given set of keys and their values, a HAMT using the same parameters and CHAMP semantics, the root node should always produce the same content identifier (CID).
Algorithm Overview ¶
The HAMT algorithm hashes incoming keys and uses incrementing subsections of that hash digest at each level of its tree structure to determine the placement of either the entry or a link to a child node of the tree. A `bitWidth` determines the number of bits of the hash to use for index calculation at each level of the tree such that the root node takes the first `bitWidth` bits of the hash to calculate an index and as we move lower in the tree, we move along the hash by `depth x bitWidth` bits. In this way, a sufficiently randomizing hash function will generate a hash that provides a new index at each level of the data structure. An index comprising `bitWidth` bits will generate index values of `[ 0, 2^bitWidth )`. So a `bitWidth` of 8 will generate indexes of 0 to 255 inclusive.
Each node in the tree can therefore hold up to `2^bitWidth` elements of data, which we store in an array. In the this HAMT and the IPLD HashMap we store entries in buckets. A `Set(key, value)` mutation where the index generated at the root node for the hash of key denotes an array index that does not yet contain an entry, we create a new bucket and insert the key / value pair entry. In this way, a single node can theoretically hold up to `2^bitWidth x bucketSize` entries, where `bucketSize` is the maximum number of elements a bucket is allowed to contain ("collisions"). In practice, indexes do not distribute with perfect randomness so this maximum is theoretical. Entries stored in the node's buckets are stored in key-sorted order.
Parameters ¶
This HAMT implementation:
• Fixes the `bucketSize` to 3.
• Defaults the `bitWidth` to 8, however within Filecoin it uses 5
• Defaults the hash algorithm to the 64-bit variant of Murmur3-x64
Further Reading ¶
The algorithm used here is identical to that of the IPLD HashMap algorithm specified at https://github.com/ipld/specs/blob/master/data-structures/hashmap.md. The specific parameters used by Filecoin and the DAG-CBOR block layout differ from the specification and are defined at https://github.com/ipld/specs/blob/master/data-structures/hashmap.md#Appendix-Filecoin-hamt-variant.
Index ¶
- Constants
- Variables
- type HashFunc
- type KV
- type Node
- func (n *Node) Copy() *Node
- func (n *Node) Delete(ctx context.Context, k string) (bool, error)
- func (n *Node) Find(ctx context.Context, k string, out cbg.CBORUnmarshaler) (bool, error)
- func (n *Node) FindRaw(ctx context.Context, k string) (bool, []byte, error)
- func (n *Node) Flush(ctx context.Context) error
- func (n *Node) ForEach(ctx context.Context, f func(k string, val *cbg.Deferred) error) error
- func (t *Node) MarshalCBOR(w io.Writer) error
- func (n *Node) Set(ctx context.Context, k string, v cbg.CBORMarshaler) error
- func (n *Node) SetIfAbsent(ctx context.Context, k string, v cbg.CBORMarshaler) (bool, error)
- func (n *Node) SetRaw(ctx context.Context, k string, raw []byte) error
- func (t *Node) UnmarshalCBOR(r io.Reader) error
- type Option
- type Pointer
Constants ¶
const ( // use OVERWRITE for modifyValue operations that overwrite existing values OVERWRITE = overwrite(true) // use NOVERWRITE for modifyValue operations that cannot overwrite existing values NOVERWRITE = overwrite(false) )
const ( // return MODIFIED when a key value mapping is overwritten MODIFIED = modified(true) // return UNMODIFIED when a no key value mappings are overwritten UNMODIFIED = modified(false) )
Variables ¶
var ErrMalformedHamt = fmt.Errorf("HAMT node was malformed")
ErrMalformedHamt is returned whenever a block intended as a HAMT node does not conform to the expected form that a block may take. This can occur during block-load where initial validation takes place or during traversal where certain conditions are expected to be met.
var ErrMaxDepth = fmt.Errorf("attempted to traverse HAMT beyond max-depth")
ErrMaxDepth is returned when the HAMT spans further than the hash function is capable of representing. This can occur when sufficient hash collisions (e.g. from a weak hash function and attacker-provided keys) extend leaf nodes beyond the number of bits that a hash can represent. Or this can occur on extremely large (likely impractical) HAMTs that are unable to be represented with the hash function used. Hash functions with larger byte output increase the maximum theoretical depth of a HAMT.
Functions ¶
This section is empty.
Types ¶
type KV ¶
KV represents leaf storage within a HAMT node. A Pointer may hold up to `bucketSize` KV elements, where each KV contains a key and value pair stored by the user.
Keys are represented as bytes.
The IPLD Schema representation of this data structure is as follows:
type KV struct { key Bytes value Any } representation tuple
type Node ¶
type Node struct { Bitfield *big.Int Pointers []*Pointer // contains filtered or unexported fields }
Node is a single point in the HAMT, encoded as an IPLD tuple in DAG-CBOR of shape:
[bytes, [Pointer...]]
where 'bytes' is the big.Int#Bytes() and the Pointers array is between 1 and `2^bitWidth`.
The Bitfield provides us with a mechanism to store a compacted array of Pointers. Each bit in the Bitfield represents an element in a sparse array where `1` indicates the element is present in the Pointers array and `0` indicates it is omitted. To look-up a specific index in the Pointers array you must first make a count of the number of `1`s (popcount) up to the element you are looking for. e.g. a Bitfield of `10010110000` shows that we have a 4 element Pointers array. Indexes `[1]` and `[2]` are not present, but index `[3]` is at the second position of our Pointers array.
The IPLD Schema representation of this data structure is as follows:
type Node struct { bitfield Bytes pointers [Pointer] } representation tuple
func LoadNode ¶
LoadNode loads a HAMT Node from the IpldStore and configures it according to any specified Option parameters. Where the parameters of this HAMT vary from the defaults (hash function and bitWidth), those variations _must_ be supplied here via Options otherwise the HAMT will not be readable.
Users should consider how their HAMT parameters are stored or specified along with their HAMT where the data is expected to have a long shelf-life as future users will need to know the parameters of a HAMT being loaded in order to decode it. Users should also NOT rely on the default parameters of this library to remain the defaults long-term and have strategies in place to manage variations.
func NewNode ¶
NewNode creates a new IPLD HAMT Node with the given IPLD store and any additional options (bitWidth and hash function).
This function creates a new HAMT that you can use directly and is also used internally to create child nodes.
func (*Node) Copy ¶
Copy a HAMT node and all of its contents. May be useful for mutation operations where the original needs to be preserved in memory.
This operation will also recursively clone any child nodes that are attached as cached nodes.
func (*Node) Delete ¶
Delete removes an entry from the HAMT structure.
Returns true if the key was found and deleted, false if the key was absent.
This operation will result in the modification of _at least_ one IPLD block via the IpldStore. Depending on the contents of the leaf node, this operation may result in a node collapse to shrink the HAMT into its canonical form for the remaining data. For an insufficiently random collection of keys at the relevant leaf nodes such a collapse may cascade to further nodes.
func (*Node) Find ¶
Find navigates through the HAMT structure to where key `k` should exist. If the key is not found, returns false. If the key is found, returns true, and if the `out` parameter has an UnmarshalCBOR(Reader) method, the value is decoded into it. The `out` parameter may be nil to test for existence without decoding.
Depending on the size of the HAMT, this method may load a large number of child nodes via the HAMT's IpldStore.
func (*Node) FindRaw ¶
FindRaw performs the same function as Find, but returns the raw bytes found at the key's location (which may or may not be DAG-CBOR, see also SetRaw).
func (*Node) Flush ¶
Flush has two effectis, it (partially!) persists data and resets dirty flag
Flush operates recursively, telling each "cache" child node to flush; Put'ing that "cache" node to the store; updating this node's Link to the CID resulting from the store Put; clearing the dirty flag of that pointer to flase and then returning. Flush doesn't operate unless there's a "cache" node.
"cache" nodes were previously storing either updated values, or, simply storing previously loaded data; these are disambiguated by the dirty flag which is true when the cache node's data has not been persisted
Notice that Flush _does not_ Put _this node_. To fully persist changes, the caller still needs to Put this node to the store themselves, and store the new resulting Link wherever they expect the updated HAMT to be seen.
func (*Node) ForEach ¶
ForEach recursively calls function f on each k / val pair found in the HAMT. This performs a full traversal of the graph and for large HAMTs can cause a large number of loads from the underlying store. The values are returned as raw bytes, not decoded.
func (*Node) Set ¶
Set key k to value v, where v is has a MarshalCBOR(bytes.Buffer) method to encode it.
To fully commit the change, it is necessary to Flush the root Node, and then additionally Put the root node to the store itself, and save the resulting CID wherever you expect the HAMT root to persist.
func (*Node) SetIfAbsent ¶
SetIfAbsent sets key k to value v only if k is not already set to some value. Returns true if the value mapped to k is changed by this operation false otherwise.
type Option ¶
type Option func(*config) error
Option is a function that configures a HAMT.
func UseHashFunction ¶
UseHashFunction allows you to set the hash function used for internal indexing by the HAMT.
Passing in the returned Option to NewNode will generate a new HAMT that uses the specified hash function.
The default hash function is murmur3-x64 but you should use a cryptographically secure function such as SHA2-256 if an attacker may be able to pick the keys in order to avoid potential hash collision (tree explosion) attacks.
func UseTreeBitWidth ¶
UseTreeBitWidth allows you to set a custom bitWidth of the HAMT in bits (from 1-8).
Passing in the returned Option to NewNode will generate a new HAMT that uses the specified bitWidth.
The default bitWidth is 8.
type Pointer ¶
Pointer is an element in a HAMT node's Pointers array, encoded as an IPLD tuple in DAG-CBOR of shape:
CID or [KV...]
i.e. it is represented as a "kinded union" where a Link is a pointer to a child node, while an array is a bucket of elements local to this node. A Pointer must represent exactly one of of these two states and cannot be both (or neither).
There are between 1 and 2^bitWidth of these Pointers in any HAMT node.
A Pointer contains either a KV bucket of up to `bucketSize` (3) values or a link (CID) to a child node. When a KV bucket overflows beyond `bucketSize`, the bucket is replaced with a link to a newly created HAMT node which will contain the `bucketSize+1` elements in its own Pointers array.
The IPLD Schema representation of this data structure is as follows:
type Pointer union { &Node link Bucket list } representation kinded type Bucket [KV]