Documentation ¶
Overview ¶
Package bbhash implements BBHash - a new algorithm for creating fast, minimal perfect hash functions as described in: https://arxiv.org/abs/1702.03154. This implementation builds the perfect hash table concurrently if the number of keys are larger than 'MinParallelKeys'. Additionally, BBHash instances can be marshaled and unmarshaled from byte buffers. This package also implements a constant database (read only) built on top of BBHash. The DB supports constant time lookups of arbitrary keys from the disk.
Index ¶
- Constants
- Variables
- type BBHash
- type DBReader
- type DBWriter
- func (w *DBWriter) Abort()
- func (w *DBWriter) AddCSVFile(fn string, comma, comment rune, kwfield, valfield int) (uint64, error)
- func (w *DBWriter) AddCSVStream(fd io.Reader, comma, comment rune, kwfield, valfield int) (uint64, error)
- func (w *DBWriter) AddKeyVals(keys [][]byte, vals [][]byte) (uint64, error)
- func (w *DBWriter) AddTextFile(fn string, delim string) (uint64, error)
- func (w *DBWriter) AddTextStream(fd io.Reader, delim string) (uint64, error)
- func (w *DBWriter) Freeze(g float64) error
- func (w *DBWriter) TotalKeys() int
Constants ¶
const Gamma float64 = 2.0
Gamma is an expansion factor for each of the bitvectors we build. Empirically, 2.0 is found to be a good balance between speed and space usage. See paper for more details.
const MaxLevel uint = 200
Maximum number of attempts (level) at making a perfect hash function. Per the paper, each successive level exponentially reduces the probability of collision.
const MinParallelKeys int = 20000
Minimum number of keys before we use a concurrent algorithm
Variables ¶
var ErrFrozen = errors.New("DB already frozen")
ErrFrozen is returned when attempting to add new records to an already frozen DB It is also returned when trying to freeze a DB that's already frozen.
var ErrMPHFail = errors.New("failed to build MPH; gamma possibly small")
ErrMPHFail is returned when the gamma value provided to Freeze() is too small to build a minimal perfect hash table.
var ErrNoKey = errors.New("No such key")
ErrNoKey is returned when a key cannot be found in the DB
Functions ¶
This section is empty.
Types ¶
type BBHash ¶
type BBHash struct { Bits []*bitVector // contains filtered or unexported fields }
BBHash represents a computed minimal perfect hash for a given set of keys.
func New ¶
New creates a new minimal hash function to represent the keys in 'keys'. This constructor selects a faster concurrent algorithm if the number of keys are greater than 'MinParallelKeys'. Once the construction is complete, callers can use "Find()" to find the unique mapping for each key in 'keys'.
func NewConcurrent ¶
NewConcurrent creates a new minimal hash function to represent the keys in 'keys'. This gives callers explicit control over when to use a concurrent algorithm vs. serial.
func NewSerial ¶
NewSerial creates a new minimal hash function to represent the keys in 'keys'. This constructor explicitly uses a single-threaded (non-concurrent) construction.
func UnmarshalBBHash ¶
UnmarshalBBHash reads a previously marshalled binary stream from 'r' and recreates the in-memory instance of BBHash.
func (*BBHash) Find ¶
Find returns a unique integer representing the minimal hash for key 'k'. The return value is meaningful ONLY for keys in the original key set (provided at the time of construction of the minimal-hash). If the key is in the original key-set
func (*BBHash) MarshalBinary ¶
MarshalBinary encodes the hash into a binary form suitable for durable storage. A subsequent call to UnmarshalBinary() will reconstruct the BBHash instance.
func (*BBHash) MarshalBinarySize ¶
MarshalBinarySize returns the size of the marshaled bbhash (in bytes)
type DBReader ¶
type DBReader struct {
// contains filtered or unexported fields
}
DBReader represents the query interface for a previously constructed constant database (built using NewDBWriter()). The only meaningful operation on such a database is Lookup().
func NewDBReader ¶
NewDBReader reads a previously construct database in file 'fn' and prepares it for querying. Records are opportunistically cached after reading from disk. We retain upto 'cache' number of records in memory (default 128).
func (*DBReader) Find ¶
Find looks up 'key' in the table and returns the corresponding value. It returns an error if the key is not found or the disk i/o failed or the record checksum failed.
type DBWriter ¶
type DBWriter struct {
// contains filtered or unexported fields
}
DBWriter represents an abstraction to construct a read-only constant database. This database uses BBHash as the underlying mechanism for constant time lookups of keys; keys and values are represented as arbitrary byte sequences ([]byte). The DB meta-data is protected by strong checksum (SHA512-256) and each key/value record is protected by a distinct siphash-2-4. Records can be added to the DB via plain delimited text files or CSV files. Once all addition of key/val is complete, the DB is written to disk via the Freeze() function.
The DB has the following general structure:
64 byte file header:
magic [4]byte "BBHH"
flags uint32 for now, all zeros
salt uint64 random salt for hash functions
nkeys uint64 Number of keys in the DB
offtbl uint64 file offset where the 'key/val' offsets start
Contiguous series of records; each record is a key/value pair:
keylen uint16 length of the key
vallen uint32 length of the value
cksum uint64 Siphash checksum of key, value, offset
key []byte keylen bytes of key
val []byte vallen bytes of value
Possibly a gap until the next PageSize boundary (4096 bytes)
Offset table: nkeys worth of file offsets. Entry 'i' is the perfect hash index for some key 'k' and offset[i] is the offset in the DB where the key and value can be found.
Marshaled BBHash bytes (BBHash:MarshalBinary())
32 bytes of strong checksum (SHA512_256); this checksum is done over the file header, offset-table and marshaled bbhash.
func NewDBWriter ¶
NewDBWriter prepares file 'fn' to hold a constant DB built using BBHash minimal perfect hash function. Once written, the DB is "frozen" and readers will open it using NewDBReader() to do constant time lookups of key to value.
func (*DBWriter) Abort ¶
func (w *DBWriter) Abort()
Abort stops the construction of the perfect hash db
func (*DBWriter) AddCSVFile ¶
func (w *DBWriter) AddCSVFile(fn string, comma, comment rune, kwfield, valfield int) (uint64, error)
AddCSVFile adds contents from CSV file 'fn'. If 'kwfield' and 'valfield' are non-negative, they indicate the field# of the key and value respectively; the default value for 'kwfield' & 'valfield' is 0 and 1 respectively. If 'comma' is not 0, the default CSV delimiter is ','. If 'comment' is not 0, then lines beginning with that rune are discarded. Records where the 'kwfield' and 'valfield' can't be evaluated are discarded. Returns number of records added.
func (*DBWriter) AddCSVStream ¶
func (w *DBWriter) AddCSVStream(fd io.Reader, comma, comment rune, kwfield, valfield int) (uint64, error)
AddCSVStream adds contents from CSV file 'fn'. If 'kwfield' and 'valfield' are non-negative, they indicate the field# of the key and value respectively; the default value for 'kwfield' & 'valfield' is 0 and 1 respectively. If 'comma' is not 0, the default CSV delimiter is ','. If 'comment' is not 0, then lines beginning with that rune are discarded. Records where the 'kwfield' and 'valfield' can't be evaluated are discarded. Returns number of records added.
func (*DBWriter) AddKeyVals ¶
AddKeyVals adds a series of key-value matched pairs to the db. If they are of unequal length, only the smaller of the lengths are used. Records with duplicate keys are discarded. Returns number of records added.
func (*DBWriter) AddTextFile ¶
AddTextFile adds contents from text file 'fn' where key and value are separated by one of the characters in 'delim'. Duplicates, Empty lines or lines with no value are skipped. This function just opens the file and calls AddTextStream() Returns number of records added.
func (*DBWriter) AddTextStream ¶
AddTextStream adds contents from text stream 'fd' where key and value are separated by one of the characters in 'delim'. Duplicates, Empty lines or lines with no value are skipped. Returns number of records added.