Documentation ¶
Overview ¶
Package compactindex is an immutable hashtable index format inspired by djb's constant database (cdb).
Design ¶
Compactindex is used to create secondary indexes over arbitrary flat files. Each index is a single, immutable flat file.
Index files consist of a space-optimized and query-optimized key-value-like table.
Instead of storing actual keys, the format stores FKS dynamic perfect hashes. And instead of storing values, the format contains offsets into some file.
As a result, the database effectively only supports two operations, similarly to cdb. (Note that the actual Go interface is a bit more flexible).
func Create(kv map[[]byte]uint64) *Index func (*Index) Lookup(key []byte) (value uint64, exist bool)
Buckets ¶
The set of items is split into buckets of approx 10000 records. The number of buckets is unlimited.
The key-to-bucket assignment is determined by xxHash3 using uniform discrete hashing over the key space.
The index file header also mentions the number of buckets and the file offset of each bucket.
Tables ¶
Each bucket contains a table of entries, indexed by a collision-free hash function.
The hash function used in the entry table is xxHash. A 32-bit hash domain is prefixed to mine collision-free sets of hashes (FKS scheme). This hash domain is also recorded at the bucket header.
Each bucket entry is a constant-size record consisting of a 3-byte hash and an offset to the value. The size of the offset integer is the minimal byte-aligned integer width that can represent the target file size.
Querying ¶
The query interface (DB) is backend-agnostic, supporting any storage medium that provides random reads. To name a few: Memory buffers, local files, arbitrary embedded buffers, HTTP range requests, plan9, etc...
The DB struct itself performs zero memory allocations and therefore also doesn't cache. It is therefore recommended to provide a io.ReaderAt backed by a cache to improve performance.
Given a key, the query strategy is simple:
- Hash key to bucket using global hash function
- Retrieve bucket offset from bucket header table
- Hash key to entry using per-bucket hash function
- Search for entry in bucket (binary search)
The search strategy for locating entries in buckets can be adjusted to fit the latency/bandwidth profile of the underlying storage medium.
For example, the fastest lookup strategy in memory is a binary search retrieving double cache lines at a time. When doing range requests against high-latency remote storage (e.g. S3 buckets), it is typically faster to retrieve and scan through large parts of a bucket (multiple kilobytes) at once.
Construction ¶
Constructing a compactindex requires upfront knowledge of the number of items and highest possible target offset (read: target file size).
The process requires scratch space of around 16 bytes per entry. During generation, data is offloaded to disk for memory efficiency.
The process works as follows:
- Determine number of buckets and offset integer width based on known input params (item count and target file size).
- Linear pass over input data, populating temporary files that contain the unsorted entries of each bucket.
- For each bucket, brute force a perfect hash function that defines a bijection between hash values and keys in the bucket.
- For each bucket, sort by hash values.
- Store to index.
An alternative construction approach is available when the number of items or target file size is unknown. In this case, a set of keys is first serialized to a flat file.
Index ¶
Constants ¶
const Version = uint8(1)
Variables ¶
var ErrCollision = errors.New("hash collision")
var ErrNotFound = errors.New("not found")
ErrNotFound marks a missing entry.
var Magic = [8]byte{'r', 'd', 'c', 'e', 'c', 'i', 'd', 'x'}
Magic are the first eight bytes of an index.
Functions ¶
func EntryHash64 ¶
EntryHash64 is a xxHash-based hash function using an arbitrary prefix.
Types ¶
type Bucket ¶
type Bucket struct { BucketDescriptor Entries *io.SectionReader }
Bucket is a database handle pointing to a subset of the index.
type BucketDescriptor ¶
type BucketDescriptor struct { BucketHeader Stride uint8 // size of one entry in bucket OffsetWidth uint8 // with of offset field in bucket }
type BucketHeader ¶
BucketHeader occurs at the beginning of each bucket.
func (*BucketHeader) Hash ¶
func (b *BucketHeader) Hash(key []byte) uint64
Hash returns the per-bucket hash of a key.
func (*BucketHeader) Load ¶
func (b *BucketHeader) Load(buf *[bucketHdrLen]byte)
func (*BucketHeader) Store ¶
func (b *BucketHeader) Store(buf *[bucketHdrLen]byte)
type Builder ¶
type Builder struct { Header // contains filtered or unexported fields }
Builder creates new compactindex files.
func NewBuilder ¶
NewBuilder creates a new index builder.
If dir is an empty string, a random temporary directory is used.
numItems refers to the number of items in the index.
targetFileSize is the size of the file that index entries point to. Can be set to zero if unknown, which results in a less efficient (larger) index.
func (*Builder) Insert ¶
Insert writes a key-value mapping to the index.
Index generation will fail if the same key is inserted twice. The writer must not pass a value greater than targetFileSize.
type DB ¶
DB is a compactindex handle.
func Open ¶
Open returns a handle to access a compactindex.
The provided stream must start with the Magic byte sequence. Tip: Use io.NewSectionReader to create aligned substreams when dealing with a file that contains multiple indexes.
func (*DB) Lookup ¶
Lookup queries for a key in the index and returns the value (offset), if any.
Returns ErrNotFound if the key is unknown.
func (*DB) LookupBucket ¶
LookupBucket returns a handle to the bucket that might contain the given key.
type Entry ¶
Entry is a single element in a hash table.
func SearchSortedEntries ¶
SearchSortedEntries performs an in-memory binary search for a given hash.