Documentation ¶
Overview ¶
Package mph implements two different perfect hash functions for large data sets:
- Compress Hash Displace: http://cmph.sourceforge.net/papers/esa09.pdf
- BBHash: https://arxiv.org/abs/1702.03154).
mph exposes a convenient way to serialize keys and values OR just keys into an on-disk single-file database. This serialized MPH DB is useful in situations where the reading from such a "constant" DB is much more frequent compared to updates to the DB.
The primary user interface for this package is via the 'DBWriter' and 'DBReader' objects. Each objected added to the MPH is a <key, value> pair. The key is identified by a uint64 value - most commonly obtained by hashing a user specific object. The caller must ensure that they use a good hash function (eg siphash) that produces a random distribution of the keys. The 'DBWriter'
Index ¶
- Constants
- Variables
- type DBReader
- func (rd *DBReader) Close()
- func (rd *DBReader) Desc() string
- func (rd *DBReader) DumpMeta(w io.Writer)
- func (rd *DBReader) Find(key uint64) ([]byte, error)
- func (rd *DBReader) IterFunc(fp func(k uint64, v []byte) error) error
- func (rd *DBReader) Len() int
- func (rd *DBReader) Lookup(key uint64) ([]byte, bool)
- type DBWriter
- type MPH
- type MPHBuilder
Constants ¶
const MinParallelKeys int = 20000
Minimum number of keys before bbhash switches to a concurrent construction algorithm
Variables ¶
var ( // ErrMPHFail is returned when the gamma value provided to Freeze() is too small to // build a minimal perfect hash table. ErrMPHFail = errors.New("failed to build MPH") // ErrFrozen is returned when attempting to add new records to an already frozen DB // It is also returned when trying to freeze a DB that's already frozen. ErrFrozen = errors.New("DB already frozen") // ErrValueTooLarge is returned if the value-length is larger than 2^32-1 bytes ErrValueTooLarge = errors.New("value is larger than 2^32-1 bytes") // ErrExists is returned if a duplicate key is added to the DB ErrExists = errors.New("key exists in DB") // ErrNoKey is returned when a key cannot be found in the DB ErrNoKey = errors.New("No such key") // Header too small for unmarshalling ErrTooSmall = errors.New("not enough data to unmarshal") )
Functions ¶
This section is empty.
Types ¶
type DBReader ¶
type DBReader struct {
// contains filtered or unexported fields
}
DBReader represents the query interface for a previously constructed constant database (built using NewDBWriter()). The only meaningful operation on such a database is Lookup().
func NewDBReader ¶
NewDBReader reads a previously construct database in file 'fn' and prepares it for querying. Value records are opportunistically cached after reading from disk. We retain upto 'cache' number of records in memory (default 128).
func (*DBReader) Find ¶
Find looks up 'key' in the table and returns the corresponding value. It returns an error if the key is not found or the disk i/o failed or the record checksum failed.
func (*DBReader) IterFunc ¶ added in v0.2.0
IterFunc iterates through every record of the MPH db and calls 'fp' on each. If the called function returns non-nil, it stops the iteration and the error is propogated to the caller.
type DBWriter ¶
type DBWriter struct {
// contains filtered or unexported fields
}
DBWriter represents an abstraction to construct a read-only MPH database. The underlying MPHF is either CHD or BBHash. Keys and values are represented as arbitrary byte sequences ([]byte). The values are stored sequentially in the DB along with a checksum protecting the integrity of the data via siphash-2-4. We don't want to use SHA512-256 over the entire file - because it will mean reading a potentially large file in DBReader(). By using checksums separately per record, we increase the overhead a bit - but speeds up DBReader initialization for the common case: we will be verifying actual records opportunistically.
The DB meta-data and MPH tables are protected by strong checksum (SHA512-256).
func NewChdDBWriter ¶
NewDBWriter prepares file 'fn' to hold a constant DB built using CHD minimal perfect hash function. Once written, the DB is "frozen" and readers will open it using NewDBReader() to do constant time lookups of key to value.
func (*DBWriter) AddKeyVals ¶
AddKeyVals adds a series of key-value matched pairs to the db. If they are of unequal length, only the smaller of the lengths are used. Records with duplicate keys are discarded. Returns number of records added.
type MPH ¶
type MPH interface { // Marshal the MPH into io.Writer 'w'; the writer is // guaranteed to start at a uint64 aligned boundary MarshalBinary(w io.Writer) (int, error) // Find the key and return a 0 based index - a perfect hash index // Return true if we find the key, false otherwise Find(key uint64) (uint64, bool) // Dump metadata about the constructed MPH to io.writer 'w' DumpMeta(w io.Writer) // Return number of entries in the MPH Len() int }
type MPHBuilder ¶
type MPHBuilder interface { // Add a new key Add(key uint64) error // Freeze the DB Freeze() (MPH, error) }
MPHBuilder is the common interface for constructing a MPH from a large number of keys
func NewBBHashBuilder ¶
func NewBBHashBuilder(g float64) (MPHBuilder, error)
NewBBHashBuilder enables creation of a minimal perfect hash function via the BBHash algorithm. Once created, callers can add keys to it before Freezing the MPH and generating a constant time lookup table. The parameter 'g' is "Gamma" from the paper; the recommended value is >= 2.0; larger values increase the constructed table size and also decreases probability of construction failure. Once the construction is frozen, callers can use "Find()" to find the unique mapping for each key in 'keys'.
func NewChdBuilder ¶
func NewChdBuilder(load float64) (MPHBuilder, error)
NewChdBuilder enables creation of a minimal perfect hash function via the Compress Hash Displace algorithm. Once created, callers can add keys to it before Freezing the MPH and generating a constant time lookup table. Once the construction is frozen, callers can use "Find()" to find the unique mapping for each key in 'keys'.