mph

package module
v0.2.0 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Aug 6, 2024 License: GPL-2.0 Imports: 19 Imported by: 0

README

GoDoc Go Report Card

go-mph - Minimal Perfect Hash Functions with persistence

What is it?

A library to create, query and serialize/de-serialize minimal perfect hash function ("MPHF"). There are two implementations of MPH's for large data sets:

  1. CHD - inspired by this gist.

  2. BBHash. It is in part inspired by Damien Gryski's Boomphf

One can construct an on-disk constant-time lookup using go-mph and one of the MPHFs. Such a DB is useful in situations where the key/value pairs are NOT changed frequently; i.e., read-dominant workloads. The typical pattern in such situations is to build the constant-DB once for efficient retrieval and do lookups multiple times.

NOTE Minimal Perfect Hash functions take a fixed input and generate a mapping to lookup the items in constant time. In particular, they are NOT a replacement for a traditional hash-table; i.e., it may yield false-positives when queried using keys not present during construction. In concrete terms:

Let S = {k0, k1, ... kn} be your input key set.

If H: S -> {0, .. n} is a minimal perfect hash function, then H(kx) for kx NOT in S may yield an integer result (indicating that kx was successfully "looked up").

The way one deals with this is to compare the actual keys stored against that index. DBReader()'s Find() method demonstrates how this is done.

go-mph uses cryptographically strong checksum on the entire MPH DB metadata. Additionally, tt uses siphash-2-4 checksums on each individual key-val record. This siphash checksum is verified opportunistically when keys are looked up in the MPH DB. The DB reader uses an in-memory cache for speeding up lookups.

How do I use it?

Like any other golang library: go get github.com/opencoff/go-mph. The library exposes the following types:

  • DBWriter: Used to construct a constant database of key-value pairs - where the lookup of a given key is done in constant time using CHD or BBHash. This type can be created by one of two functions: NewChdDBWriter() or NewBBHashDBWriter().

    Once created, you add keys & values to it via the Add() method. After all the entries are added, you freeze the database by calling the Freeze() method.

    DBWriter optimizes the database if there are no values present - i.e., keys-only. This optimization significantly reduces the file-size.

  • DBReader: Used to read a pre-constructed perfect-hash database and use it for constant-time lookups. The DBReader class comes with its own key/val cache to reduce disk accesses. The number of cache entries is configurable.

    After initializing the DB, key lookups are done primarily with the Find() method. A convenience method Lookup() elides errors and only returns the value and a boolean.

First, lets run some tests and make sure mph is working fine:


  $ git clone https://github.com/opencoff/go-mph
  $ cd go-mph
  $ make test

Example Program

There is a working example of the DBWriter and DBReader APIs in the example/ sub directory. This example demonstrates the following functionality:

  • add one or more space delimited key/value files (first field is key, second field is value)
  • add one or more CSV files (first field is key, second field is value)
  • Write the resulting MPH DB to disk
  • Read the DB and verify its integrity
  • Dump the contents of the DB or the DB "meta data"

Now, lets build and run the example program:


  $ make
  $ ./mphdb -h
  $ ./mphdb -V make foo.db chd /usr/share/dict/words
  $ ./mphdb -V fsck foo.db
  $ ./mphdb -V dump -m foo.db
  $ ./mphdb -V dump -a foo.db

This example above stores the words in the system dictionary into a fast-lookup table using the CHD algorithm. mphdb -h shows you a helpful usage for what else you can do with the example program.

There is a helper python script to generate a very large text file of hostnames and IP addresses: genhosts.py. You can run it like so:


  $ python ./example/genhosts.py 192.168.0.0/16 > a.txt

The above example generates 65535 hostnames and corresponding IP addresses; each of the IP addresses is sequentially drawn from the given subnet.

NOTE If you use a "/8" subnet mask you will generate a lot of data (~430MB in size).

Once you have the input generated, you can feed it to the example program above to generate a MPH DB:


  $ ./mphdb make foo.db chd a.txt
  $ ./mphdb fsck foo.db

It is possible that "mphdb" fails to construct a DB after trying 1,000,000 times. In that case, try lowering the "load" factor (default is 0.85).


  $ ./mphdb make -l 0.75 foo.db chd a.txt

The example program in example/ has helper routines to add from a text or CSV delimited file: see example/text.go. In fact is is a more-or-less complete usage of the MPH library API.

Implementation Notes

  • bbhash.go: Main implementation of the BBHash algorithm. This file implements the MPHBuilder and MPH interfaces (defined in mph.go).

  • bbhash_marshal.go: Marshaling/Unmarshaling bbhash MPHF tables.

  • bitvector.go: thread-safe bitvector implementation including a simple rank algorithm.

  • chd.go: The main implementation of the CHD algorithm. This file implements the MPHBuilder and MPH interfaces (defined in mph.go).

  • chd_marshal.go: Marshaling/Unmarshaling CHD MPHF tables.

  • dbreader.go: Provides a constant-time lookup of a previously constructed MPH DB. DB reads use mmap(2) for reading the MPH metadata. For little-endian architectures, there is no data "parsing" of the lookup tables, offset tables etc. They are interpreted in-situ from the mmap'd data. To keep the code generic, every multi-byte int is converted to little-endian order before use. These conversion routines are in endian_XX.go.

  • dbwriter.go: Create a read-only, constant-time MPH lookup DB. It can store arbitrary byte stream "values" - each of which is identified by a unique uint64 key. The DB structure is optimized for reading on the most common architectures - little-endian: amd64, arm64 etc.

  • slices.go: Non-copying type conversion to/from byte-slices to uints of different widths.

  • utils.go: Random number utils and other bits

License

GPL v2.0

Documentation

Overview

Package mph implements two different perfect hash functions for large data sets:

  1. Compress Hash Displace: http://cmph.sourceforge.net/papers/esa09.pdf
  2. BBHash: https://arxiv.org/abs/1702.03154).

mph exposes a convenient way to serialize keys and values OR just keys into an on-disk single-file database. This serialized MPH DB is useful in situations where the reading from such a "constant" DB is much more frequent compared to updates to the DB.

The primary user interface for this package is via the 'DBWriter' and 'DBReader' objects. Each objected added to the MPH is a <key, value> pair. The key is identified by a uint64 value - most commonly obtained by hashing a user specific object. The caller must ensure that they use a good hash function (eg siphash) that produces a random distribution of the keys. The 'DBWriter'

Index

Constants

View Source
const MinParallelKeys int = 20000

Minimum number of keys before bbhash switches to a concurrent construction algorithm

Variables

View Source
var (
	// ErrMPHFail is returned when the gamma value provided to Freeze() is too small to
	// build a minimal perfect hash table.
	ErrMPHFail = errors.New("failed to build MPH")

	// ErrFrozen is returned when attempting to add new records to an already frozen DB
	// It is also returned when trying to freeze a DB that's already frozen.
	ErrFrozen = errors.New("DB already frozen")

	// ErrValueTooLarge is returned if the value-length is larger than 2^32-1 bytes
	ErrValueTooLarge = errors.New("value is larger than 2^32-1 bytes")

	// ErrExists is returned if a duplicate key is added to the DB
	ErrExists = errors.New("key exists in DB")

	// ErrNoKey is returned when a key cannot be found in the DB
	ErrNoKey = errors.New("No such key")

	// Header too small for unmarshalling
	ErrTooSmall = errors.New("not enough data to unmarshal")
)

Functions

This section is empty.

Types

type DBReader

type DBReader struct {
	// contains filtered or unexported fields
}

DBReader represents the query interface for a previously constructed constant database (built using NewDBWriter()). The only meaningful operation on such a database is Lookup().

func NewDBReader

func NewDBReader(fn string, cache int) (rd *DBReader, err error)

NewDBReader reads a previously construct database in file 'fn' and prepares it for querying. Value records are opportunistically cached after reading from disk. We retain upto 'cache' number of records in memory (default 128).

func (*DBReader) Close

func (rd *DBReader) Close()

Close closes the db

func (*DBReader) Desc added in v0.2.0

func (rd *DBReader) Desc() string

Desc provides a human description of the MPH db

func (*DBReader) DumpMeta

func (rd *DBReader) DumpMeta(w io.Writer)

Dump the metadata to io.Writer 'w'

func (*DBReader) Find

func (rd *DBReader) Find(key uint64) ([]byte, error)

Find looks up 'key' in the table and returns the corresponding value. It returns an error if the key is not found or the disk i/o failed or the record checksum failed.

func (*DBReader) IterFunc added in v0.2.0

func (rd *DBReader) IterFunc(fp func(k uint64, v []byte) error) error

IterFunc iterates through every record of the MPH db and calls 'fp' on each. If the called function returns non-nil, it stops the iteration and the error is propogated to the caller.

func (*DBReader) Len

func (rd *DBReader) Len() int

Len returns the size of the MPH key space; it is not exactly the total number of keys.

func (*DBReader) Lookup

func (rd *DBReader) Lookup(key uint64) ([]byte, bool)

Lookup looks up 'key' in the table and returns the corresponding value. If the key is not found, value is nil and returns false.

type DBWriter

type DBWriter struct {
	// contains filtered or unexported fields
}

DBWriter represents an abstraction to construct a read-only MPH database. The underlying MPHF is either CHD or BBHash. Keys and values are represented as arbitrary byte sequences ([]byte). The values are stored sequentially in the DB along with a checksum protecting the integrity of the data via siphash-2-4. We don't want to use SHA512-256 over the entire file - because it will mean reading a potentially large file in DBReader(). By using checksums separately per record, we increase the overhead a bit - but speeds up DBReader initialization for the common case: we will be verifying actual records opportunistically.

The DB meta-data and MPH tables are protected by strong checksum (SHA512-256).

func NewBBHashDBWriter

func NewBBHashDBWriter(fn string, g float64) (*DBWriter, error)

func NewChdDBWriter

func NewChdDBWriter(fn string, load float64) (*DBWriter, error)

NewDBWriter prepares file 'fn' to hold a constant DB built using CHD minimal perfect hash function. Once written, the DB is "frozen" and readers will open it using NewDBReader() to do constant time lookups of key to value.

func (*DBWriter) Abort

func (w *DBWriter) Abort() error

Abort a construction

func (*DBWriter) Add

func (w *DBWriter) Add(key uint64, val []byte) error

Adds adds a single key,value pair.

func (*DBWriter) AddKeyVals

func (w *DBWriter) AddKeyVals(keys []uint64, vals [][]byte) (int, error)

AddKeyVals adds a series of key-value matched pairs to the db. If they are of unequal length, only the smaller of the lengths are used. Records with duplicate keys are discarded. Returns number of records added.

func (*DBWriter) Filename

func (w *DBWriter) Filename() string

Return the filename of the underlying db

func (*DBWriter) Freeze

func (w *DBWriter) Freeze() (err error)

Freeze builds the minimal perfect hash, writes the DB and closes it.

func (*DBWriter) Len

func (w *DBWriter) Len() int

Len returns the total number of distinct keys in the DB

type MPH

type MPH interface {
	// Marshal the MPH into io.Writer 'w'; the writer is
	// guaranteed to start at a uint64 aligned boundary
	MarshalBinary(w io.Writer) (int, error)

	// Find the key and return a 0 based index - a perfect hash index
	// Return true if we find the key, false otherwise
	Find(key uint64) (uint64, bool)

	// Dump metadata about the constructed MPH to io.writer 'w'
	DumpMeta(w io.Writer)

	// Return number of entries in the MPH
	Len() int
}

type MPHBuilder

type MPHBuilder interface {
	// Add a new key
	Add(key uint64) error

	// Freeze the DB
	Freeze() (MPH, error)
}

MPHBuilder is the common interface for constructing a MPH from a large number of keys

func NewBBHashBuilder

func NewBBHashBuilder(g float64) (MPHBuilder, error)

NewBBHashBuilder enables creation of a minimal perfect hash function via the BBHash algorithm. Once created, callers can add keys to it before Freezing the MPH and generating a constant time lookup table. The parameter 'g' is "Gamma" from the paper; the recommended value is >= 2.0; larger values increase the constructed table size and also decreases probability of construction failure. Once the construction is frozen, callers can use "Find()" to find the unique mapping for each key in 'keys'.

func NewChdBuilder

func NewChdBuilder(load float64) (MPHBuilder, error)

NewChdBuilder enables creation of a minimal perfect hash function via the Compress Hash Displace algorithm. Once created, callers can add keys to it before Freezing the MPH and generating a constant time lookup table. Once the construction is frozen, callers can use "Find()" to find the unique mapping for each key in 'keys'.

Directories

Path Synopsis

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL