blobloom

package module

v0.5.0 Latest Latest Go to latest Published: Sep 5, 2020 License: Apache-2.0 Imports: 4 Imported by: 4

Details

Valid go.mod file

The Go module system was introduced in Go 1.11 and is the official dependency management solution for Go.
Redistributable license

Redistributable licenses place minimal restrictions on how software can be used, modified, and redistributed.
Tagged version

Modules with tagged versions give importers more predictable builds.
Stable version

When a project reaches major version v1 it is considered stable.
Learn more about best practices

Repository

github.com/greatroar/blobloom

Links

Open Source Insights

README ¶

Blobloom

A blocked Bloom filter package for Go (golang) with no runtime dependencies.

Blocked Bloom filters are a cache-efficient variant of Bloom filters, the well-known approximate set data structure. To quote Daniel Lemire, they have unbeatable speed. See the directory benchmarks/ to determine exactly how fast Blobloom is compared to other packages.

Usage

Construct a Bloom filter:

f := blobloom.NewOptimized(blobloom.Config{
	Capacity: nkeys, // Expected number of keys.
	FPRate:   1e-4,  // One in 10000 false positives is acceptable.
})

Add a key:

// import "github.com/cespare/xxhash/v2"
h := xxhash.Sum64String(key)
f.Add(h)

Test for presence of a key:

h := xxhash.Sum64String(key)
if f.Has(h) {
	// Key is probably in f.
} else {
	// Key is certainly not present in f.
}

See the package documentation for further usage information and examples.

Hash functions

Blobloom does not provide hash functions. Instead, it requires client code to represent each key as a single 64-bit hash value, leaving it to the user to pick the right hash function for a particular problem. Here are some general suggestions:

If you use Bloom filters to speed up access to a key-value store, you might want to look at xxh3 or xxhash.
If your keys are cryptographic hashes, consider using the first 8 bytes of those hashes.
If you use Bloom filters to make probabilistic decisions, a randomized hash function such as siphash or maphash may prevent the same false positives occurring every time.

When evaluating a hash function, or designing a custom one, be aware that Blobloom uses the fastrange reduction on the lower 32 bits of the hash to select a block, and enhanced double hashing on the upper and lower 32-bit halves, followed by modulo 2⁹, to derive bit indices. In particular, this means that casting a 32-bit hash to uint64 causes the Bloom filter to perform suboptimally. (These are details of the current implementation, not API guarantees.)

License

Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at

 http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.

Documentation ¶

Rendered for

Overview ¶

Package blobloom implements blocked Bloom filters.

Blocked Bloom filters are an approximate set data structure: if a key has been added to a filter, a lookup of that key returns true, but if the key has not been added, there is a non-zero probability that the lookup still returns true (a false positive). False negatives are impossible: if the lookup for a key returns false, that key has not been added.

In this package, keys are represented exclusively as hashes. Client code is responsible for supplying a 64-bit hash value.

Compared to standard Bloom filters, blocked Bloom filters use the CPU cache more efficiently. A blocked Bloom filter is an array of ordinary Bloom filters of fixed size BlockBits (the blocks). The lower half of the hash selects the block to use.

To achieve the same false positive rate (FPR) as a standard Bloom filter, a blocked Bloom filter requires more memory. For an FPR of at most 2e-6 (two in a million), it uses ~20% more memory. At 1e-10, the space required is double that of standard Bloom filter.

For more details, see the 2010 paper by Putze, Sanders and Singler, https://algo2.iti.kit.edu/documents/cacheefficientbloomfilters-jea.pdf.

Example (Fnv) ¶

package main

import (
	"fmt"
	"hash/fnv"
	"io"

	"github.com/greatroar/blobloom"
)

func main() {
	// This example uses the hash/fnv package from the standard Go library.

	f := blobloom.New(10000, 5)
	h := fnv.New64()

	messages := []string{
		"Hello!",
		"Welcome!",
		"Mind your step!",
		"Have fun!",
		"Goodbye!",
	}

	for _, msg := range messages {
		h.Reset()
		io.WriteString(h, msg)
		f.Add(h.Sum64())
	}

	for _, msg := range messages {
		h.Reset()
		io.WriteString(h, msg)
		if f.Has(h.Sum64()) {
			fmt.Println(msg)
		} else {
			panic("Bloom filter didn't get the message")
		}
	}

}

Output:

Hello!
Welcome!
Mind your step!
Have fun!
Goodbye!

Example (Sha224) ¶

package main

import (
	"crypto/sha256"
	"encoding/binary"
	"fmt"

	"github.com/greatroar/blobloom"
)

func main() {
	// If you have items addressed by a cryptographic hash,
	// you can use a prefix of it as the hash value for a Bloom filter.
	//
	// If the cryptohashes denote objects from an untrusted source,
	// the Bloom filter can be tricked into giving false positives for
	// chosen objects, because it only uses a small part of the hash
	// that can easily be broken (by a birthday attack). If that can
	// cause problems in your application, first run SipHash on the
	// full cryptohash to get the hash value for the Bloom filter:
	//
	//	import "github.com/dchest/siphash"
	//	h := siphash.Hash(secret1, secret2, key[:])

	// A list of files, identified by their SHA-224.
	files := []string{
		"\x85\x52\xd8\xb7\xa7\xdc\x54\x76\xcb\x9e\x25\xde\xe6\x9a\x80\x91\x29\x07\x64\xb7\xf2\xa6\x4f\xe6\xe7\x8e\x95\x68",
		"\xa0\xad\x8f\x63\x90\x72\x74\x7b\xc3\x43\x09\x45\x94\x0e\x7c\x73\xb8\x34\x93\xf1\x77\x90\x0f\xd2\x7d\x09\x65\x94",
		"\x7b\xd3\xdb\x48\x1e\x7b\x05\x2c\x88\x18\x68\xcc\x13\xc3\x04\x34\x43\x2d\x7b\x49\x24\x74\x70\x33\xd2\xe8\x6e\x73",
	}

	// first64 extracts the first 64 bits of a key as a uint64.
	// The choice of big vs. little-endian is arbitrary.
	first64 := func(key []byte) uint64 {
		return binary.BigEndian.Uint64(key[:8])
	}

	f := blobloom.NewOptimized(blobloom.Config{Capacity: 600, FPRate: .002})

	for _, filehash := range files {
		f.Add(first64([]byte(filehash)))
	}

	for _, s := range []string{"Hello, world!", "Goodbye"} {
		h := sha256.Sum224([]byte(s))
		found := f.Has(first64(h[:]))
		if found {
			fmt.Printf("Found: %v\n", s)
		}
	}

}

Output:

Found: Hello, world!

Index ¶

Constants
func FPRate(nkeys, nbits uint64, nhashes int) float64
func Optimize(cfg Config) (nbits uint64, nhashes int)
type Config
type Filter
- func New(nbits uint64, nhashes int) *Filter
- func NewOptimized(cfg Config) *Filter

Constants ¶

View Source

const BlockBits = 512

BlockBits is the number of bits per block and the minimum number of bits in a Filter.

The value of this constant is chosen to match the L1 cache line size of popular architectures (386, amd64, arm64).

View Source

const MaxBits = BlockBits << 32 // 256GiB.

MaxBits is the maximum number of bits supported by a Filter.

Variables ¶

This section is empty.

Functions ¶

func FPRate ¶

func FPRate(nkeys, nbits uint64, nhashes int) float64

FPRate computes an estimate of the false positive rate of a Bloom filter after nkeys distinct keys have been added.

func Optimize ¶

func Optimize(cfg Config) (nbits uint64, nhashes int)

Optimize returns numbers of keys and hash functions that achieve the desired false positive described by cfg.

The estimated number of bits is imprecise for false positives rates below ca. 1e-15.

Example ¶

package main

import (
	"fmt"

	"github.com/greatroar/blobloom"
)

func main() {
	cfg := blobloom.Config{
		// We want to insert a billion keys and get a false positive rate of
		// one in a million, but we only have 2GiB (= 2^31 bytes) to spare.
		Capacity: 1e9,
		FPRate:   1e-6,
		MaxBits:  8 * 1 << 31,
	}
	nbits, nhashes := blobloom.Optimize(cfg)
	fpr := blobloom.FPRate(cfg.Capacity, nbits, nhashes)

	// How big will the filter be and what FP rate will we achieve?
	fmt.Printf("size = %dMiB\nfpr = %.3f\n", nbits/(8<<20), fpr)

}

Output:

size = 2048MiB
fpr = 0.001

Types ¶

type Config ¶

type Config struct {
	// Capacity is the expected number of distinct keys to be added.
	// More keys can always be added, but the false positive rate can be
	// expected to drop below FPRate is the number exceeds the Capacity.
	Capacity uint64

	// Desired lower bound on the false positive rate when the Bloom filter
	// has been filled to capacity.
	FPRate float64

	// Maximum size of the Bloom filter in bits. Zero means no limit.
	MaxBits uint64
	// contains filtered or unexported fields
}

A Config holds parameters for Optimize or NewOptimized.

type Filter ¶

type Filter struct {
	// contains filtered or unexported fields
}

A Filter is a blocked Bloom filter.

func New ¶

func New(nbits uint64, nhashes int) *Filter

New constructs a Bloom filter with given numbers of bits and hash functions.

The number of bits should be at least BlockBits; smaller values are silently increased.

The number of hashes reflects the number of hashes synthesized from the single hash passed in by the client. It is silently increased to two if a lower value is given.

func NewOptimized ¶

func NewOptimized(cfg Config) *Filter

NewOptimized is shorthand for New(Optimize(cfg)).

func (*Filter) Add ¶

func (f *Filter) Add(h uint64)

Add insert a key with hash value h into f.

The upper and lower half of h are treated as two independent hashes. These are used to derive further values using the enhanced double hashing construction of Dillinger and Manolios, https://www.ccs.neu.edu/home/pete/pub/bloom-filters-verification.pdf.

func (*Filter) AddAtomic ¶

func (f *Filter) AddAtomic(h uint64)

AddAtomic atomically inserts a key with hash value h into f.

This is a synchronized version of Add. Multiple goroutines may call AddAtomic concurrently, but no goroutine may call any other method on f concurrently with this method.

func (*Filter) Cardinality ¶ added in v0.2.0

func (f *Filter) Cardinality() float64

Cardinality estimates the number of distinct keys added to f.

The return value is the maximum likelihood estimate of Papapetrou, Siberski and Nejdl, summed over the blocks (https://www.win.tue.nl/~opapapetrou/papers/Bloomfilters-DAPD.pdf).

The estimate is most reliable when f is filled to roughly its capacity. It gets worse as f gets more densely filled. When one of the blocks is entirely filled, the estimate becomes +Inf.

Example (Infinity) ¶

package main

import (
	"fmt"
	"math"

	"github.com/greatroar/blobloom"
)

func main() {
	// To handle the case of Cardinality returning +Inf, track the number of
	// calls to Add and compute the minimum.
	//
	// The reason blobloom does not do this itself is that it would cause
	// contention in AddAtomic.

	// Construct a Bloom filter with too many hash functions, to force +Inf.
	f := blobloom.New(512, 999)
	var numAdd int

	add := func(h uint64) {
		f.Add(h)
		numAdd++
	}

	add(1)
	add(2)
	add(3)

	estimate := f.Cardinality()
	fmt.Printf("blobloom's estimate:    %.2f\n", estimate)
	fmt.Printf("number of calls to Add: %d\n", numAdd)
	estimate = math.Min(estimate, float64(numAdd))
	fmt.Printf("combined estimate:      %.2f\n", estimate)

}

Output:

blobloom's estimate:    +Inf
number of calls to Add: 3
combined estimate:      3.00

func (*Filter) Clear ¶

func (f *Filter) Clear()

Clear resets f to its empty state.

func (*Filter) FPRate ¶

func (f *Filter) FPRate(nkeys uint64) float64

FPRate computes an estimate of f's false positive rate after nkeys distinct keys have been added.

func (*Filter) Has ¶

func (f *Filter) Has(h uint64) bool

Has reports whether a key with hash value h has been added. It may return a false positive.

func (*Filter) Intersect ¶ added in v0.4.0

func (f *Filter) Intersect(g *Filter)

Intersect sets f to the intersection of f and g.

Intersect panics when f and g do not have the same number of bits and hash functions. Both Filters must be using the same hash function(s), but Intersect cannot check this.

Since Bloom filters may return false positives, Has may return true for a key that was not in both f and g.

After Intersect, the estimates from Cardinality and FPRate should be considered unreliable.

func (*Filter) NumBits ¶

func (f *Filter) NumBits() uint64

NumBits returns the number of bits of f.

func (*Filter) Union ¶

func (f *Filter) Union(g *Filter)

Union sets f to the union of f and g.

Union panics when f and g do not have the same number of bits and hash functions. Both Filters must be using the same hash function(s), but Union cannot check this.

Example ¶

package main

import (
	"hash/fnv"
	"io"

	"github.com/greatroar/blobloom"
)

const nworkers = 4

func getKeys(keys chan<- string) {
	keys <- "hello"
	keys <- "goodbye"
	close(keys)
}

func hash(key string) uint64 {
	h := fnv.New64()
	io.WriteString(h, key)
	return h.Sum64()
}

func main() {
	// Union can be used to fill a Bloom filter using multiple goroutines.
	//
	// Each goroutine allocates a filter, so the memory use increases by
	// a factor nworkers-1 compared to a sequential version or a version
	// using AddAtomic.

	keys := make(chan string, nworkers)
	filters := make(chan *blobloom.Filter, nworkers)

	go getKeys(keys)

	for i := 0; i < nworkers; i++ {
		go func() {
			f := blobloom.New(1<<20, 6)
			for key := range keys {
				f.Add(hash(key))
			}

			filters <- f
		}()
	}

	f := <-filters
	for i := 1; i < nworkers; i++ {
		f.Union(<-filters)
	}

}

Output:

Source Files ¶

View all Source files

Directories ¶

Path	Synopsis
benchmarks module
examples
spellcheck This package implements a toy interactive spell checker.	This package implements a toy interactive spell checker.

?	: This menu
/	: Search site
f or F	: Jump to
y or Y	: Canonical URL