hamt

package module
v2.0.0+incompatible Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Oct 19, 2020 License: Apache-2.0 Imports: 2 Imported by: 0

README

V2 API

This package was transient only, now it has a functional mode as well. Transient is the classical style of modifying data structures in place. Functional is defined below as immutable & persistent.

This is a merger with github.com/lleo/go-hamt-functional, which it obsoletes. The motivation for this was because we were seeing slower performance for go-hamt-functional Get() operations even though we were certain that the old go-hamt and go-hamt-functional were using the same algorithm. This merger guarantees that the transient and functional Hamt implementations are using the exact same internal data structures. This is true even to the degree that we can recast a HamtTransient data structure to HamtFunctional and the code will switch from transient (modify in place) to functional (copy on write) behavior. Of course, this works the other way around as well (that is, we can cast a HamtFunctional to HamtTransient).

This package also obsoletes github.com/lleo/go-hamt-key because we pass a []byte slice to Get/Put/Del operations instead of a Key data structure. What happens is we use the []byte slice to build a Key data structure to be used internally. This results in a simpler API and no external dependency.

What is a HAMT?

HAMT stands for Hash Array Mapped Trie. That spells it out clearly right? Ok, not so much. A HAMT is an in-memory Key/Value data structure with some really cool features. The first feature is that we implement it as a tree with a reasonably large and fixed branching factor; thereby making it a pretty flat tree. The killer feature is that the tree has a maximum depth, resulting in a O(1) Search and Modify speeds. That is O(1) without a giant constant factor.

HAMT makes this happen by first hashing the []byte slice into either a 32bit or 64bit hash value. We then split this hash value up into a fixed number of parts. Each part now becomes the index into the Array we use for interior nodes of the tree. Lastly, we only use enough parts of the hash value to find a free node to store the value into; this is why it is called a Trie. So now we have a wide tree with a maximum depth where we only use enough of the parts of hash value to find a free spot to store the leaf; That is what makes a HAMT O(1) and fast.

In our implementation, we use the FNV hashing algorithm. It is fast and provides good randomness and that is all we need from it.

Also we choose to split the 32bit or 64bit hash value into 5bit values. 5bit values mean the tree will have a branching factor of 32 and a maximum depth of 6 for a 32bit hash value (hamt32) and 12 for a 64bit hash value (hamt64). You may be noticing that 5 does not go into 32 nor 64 cleanly. That is not a problem because we fold the extra 2 or 4 bits into the main hash value. Don't worry this is a legitimate thing to do. In the (very) rare case of a hash collision we use a special leaf value for both colliding key/value pairs.

go-hamt

We implement HAMT data structure based on either a 32 bit or 64 bit hash value, hamt32 and hamt64 respectively.

Further we can have the HAMT data structure behave in one of two modes, transient or functional. Transient means we modify the data structures in-place. Functional means persistent, which requires we use a copy-on-write strategy. In other words we copy each data structure and modify the copy, then given the parent now needs to be modified we follow this copy-on-write strategy up to a brand new HamtFunctional data structure. As you can imagine Transient is faster than Functional. Suprisingly, the transient behavior is not that much faster than the persistent behavior, because the HAMT data structure is flat and wide.

However, you cannot easily share transient datastructures between threads safely; you would need to implement a locking strategy. Where with functional data structures you are guaranteed safety across threads.

Functional (aka Immutable & Persistent)

The term Functional really stands for two properties immutable and persistent. Immutable means the datastructure is never modified after construction. Persistent means that when you modify the original immutable datastructure you do so by creating a new copy of the base datastructure which shares all its unmodified parts with the original datastructure.

Imagine a hypothetical balanced binary tree datastructure with four leaves, two interior nodes, and a root node. If you change the forth leaf node, than a new fourth tree node is created, as well as its parent interior node and a new root node. The new tree starting at the new root node is a persistent modification of the original tree as it shares the first interior node and its two leaves.

          (root tree node)     (root tree node')
            /          \         /          \
           /  +---------\-------+            \
          /  /           \                    \
    (tree node 1)    (tree node 2)        (tree node 2')
        /  \            /  \                /   \
       /    \          / +--\--------------+     \
      /      \        / /    \                    \
 (Leaf 1) (Leaf 2) (Leaf 3) (Leaf 4)             (Leaf 4')

Given this approach to changing a tree, a tree with a wide branching factor would be relatively shallow. So the path from root to any given leaf would be short and the amount of shared content would be substantial.

Documentation

Overview

Package hamt is just a trivial front door to the hamt32 and hamt64 packages which really contain the HAMT implementations. Those HAMT implementations are identical in every way but the size of the computed hash, called Hashval. Those are either uint32 or uint64 values for hamt32 and hamt64 respectively.

This package merely implements New(), New32() and New64() functions and the table option constants FixedTables, SparseTables, HybridTables, and the map TableOptionName (eg. hamt.TableOptionName[hamt.FixedTables] == "FixedTables").

Choices

There are several choices to make: Hashval hamt32 versus hamt64, FixedTables versus SparseTables versus HybridTables, and Functional versus Transient. Then there is a hidden choice; you can change the source code constant, NumIndexBits, to a value other than the current setting of 5.

The New() function makes all the recommended choices for you. That is it uses the 64 bit hashVal (aka hamt64), functional behavior, and hybrid tables.

Hashval hamt64 versus hamt32

Just use hamt64. I implemnted both before I really understood HAMT. I was conflating 32 bit hash values with 32 wide branching factor (that was just a feature of the other implmentations I was looking at).

While 32bit FNV hash values are still pretty random I have seen plenty of collisions in my tests.

I have never seen 64bit FNV hash values collide and in the current state of computing having 64bit CPUs as the norm. I recommend using hamt64. If you are on 32bit CPUs then maybe you could choose hamt32.

FixedTables versus SparseTables versus HybridTables

Tables is the name I use to for the interior (aka non-leaf) nodes of the hamt data structure. Those tables being the indexed arrays from the Hash-indexed Array Mapped Tries (HAMT) name.

This is the classic speed versus memory choice with a twist. The facts to consider are: The tree is indexed by essentially random values (the parts of the hash value of the key), so the tree is going to be "balanced" to a statistical likelihood. The inner branching nodes will be very densely populated, and the outer branching nodes will be very sparsely populated.

FixedTables are fastest to access and modify, because it is a simple matter of getting and setting preallocated fixed sized arrays. However, they will be wasting most of their allocated space most of the time.

SparseTables are slowest because we always have to calculate bit counts of bit arrays for get or set type operations. Further, inserting or removing values is a matter of manipulating slice values. On the other hand, given the way slice memory allocation works, they will usually waste less than half their allocated memory.

According to tests, HybridTables setting behaves precisely the way we want it to behave. For a test set of data with 3,149,824 KeyVal pairs, he distribution of tables comes in two groups: tables with 25-32 entries and tables with 1-11 entries. There are no tables outside those two groupings. The 25-32 entry tables are all fixed tables and the 1-11 entry tables are all sparse tables. Of the sparse tables %40.1 have 1 or 2 entries, %85.4 have 4 or less and %99.7 have 8 or less entries. Given sparse tables start at capacity of 2 and capacity grows by doubling, the sparse tables are efficiently packed. The conclusion from this test data is that HybridTables setting is an excellent fit for memory efficiency.

Benchmarks show that fixed table hamts are slower than hybrid table hamts for Put operations and comparable for Get & Del operations. I conclude that is due to the massive over use of memory in fixed tables. This conclusion is partly due to the fact that the Put operation differntial between fixed and hybrid table hamts is twice as large for functional versus transient hamt behavior.

Clearly HybridTables table option for my HAMT data structure is the best choice.

Transient versus Functional

The bottom line is that writing to transient behavior in a multiple threads almost guarantees problems unless you implement a locking solution (and that can be hard to do in a performant manner).

On the other hand, given that HamtFunctional behavior returns a new HamtFunctional data structure upon any modification, HamtFunctional data structures are inherently thread safe.

On your third hand, the copy-on-write strategy of HamtFunctional is inherently slower than modify-in-place strategy of HamtTransient. How much slower? For large hamt data structures (~3 million key/value pairs) the transient Put operation takes ~1000ns, where the functional Put op takes ~3200ns. Which really isn't that bad because they are within the same order of magnitude and it is already fast.

Using the transient behavior begs the question: why not use the Go builtin map? Of course, the reason is obvious if your key must be a slice; Go map keys can not be slices. The ability to implement a reasonably efficient functional behavior for HAMTs is the point of this library. The hamt transient speed is definitely is slower than Go's builtin map: 435 vs 130 ns/Get; 950 vs 235 ns/Put; 900 vs 175 ns/Del; 235 vs 23 ns/KeyVal iterate.

The point of this library is to implement the functional map-like behavior, so that is what I assume you will use it for. The transient behavior is useful for faster single threaded bulk operations then transform it back to the functional behavior.

NumIndexBits

Both hamt32 and hamt64 have a constant NumIndexBits which determines all the other constants defining the HAMT structures. For both hamt32 and hamt64, the NumIndexBits constant is set to 5. You can manually change the source code to set NumIndexBits to some uint other than 5. IndexBits is set to 5 because that is how other people do it.

NumIndexBits determines the branching factor (IndexLimit) and the depth (DepthLimit) of the HAMT data structure. Given IndexBits=5 IndexLimit=32, and DepthLimit=6 for hamt32 and DepthLimit=12 for hamt64.

Index

Constants

View Source
const (
	// HybridTables indicates the structure should use sparseTable
	// initially, then upgrad to fixedTable when appropriate.
	HybridTables = iota
	// FixedTables indicates the structure should use fixedTables ONLY.
	// This was intended to be for speed, as sparse tables use a software
	// bitCount function to access individual cells.
	FixedTables
	// SparseTables indicates the structure should use sparseTable's ONLY.
	// This was intended just save space, but also seems to be faster; CPU cache
	// locality maybe?
	SparseTables
)

Variables

View Source
var TableOptionName [3]string

TableOptionName is a lookup table to map the integer value of FixedTables, SparseTables, and HybridTables to a string representing that option.

var option = hamt32.FixedTables
hamt32.TableOptionName[option] == "FixedTables"

Functions

func New

func New() hamt64.Hamt

New() makes all the configuration choices for you. Specifically, it chooses functional behavior, 64bit hashes, and Hybrid tables. These are the recommended settings. See hamt64.Hamt for the API.

func New32

func New32(functional bool, opt int) hamt32.Hamt

New32 constructs a datastucture that implements the hamt32.Hamt interface.

When the functional argument is true it implements a hamt32.HamtFunctional data structure. When the functional argument is false it implements a HamtTransient data structure.

The tblOpt argument is the table option defined by the constants HybridTables, SparseTables, xor FixedTables.

func New64

func New64(functional bool, opt int) hamt64.Hamt

New64 constructs a datastucture that implements the hamt64.Hamt interface.

When the functional argument is true it implements a hamt64.HamtFunctional data structure. When the functional argument is false it implements a HamtTransient data structure.

The tblOpt argument is the table option defined by the constants HybridTables, SparseTables, xor FixedTables.

Types

This section is empty.

Directories

Path Synopsis
Package hamt64 defines interface to access a Hamt data structure based on 64bit hash values.
Package hamt64 defines interface to access a Hamt data structure based on 64bit hash values.

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL