bf

package module
v1.3.0 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Nov 3, 2024 License: MIT Imports: 6 Imported by: 0

README

Go Bloom Filter

codecov Codacy Badge Go Report Card Go Reference

A Bloom Filter written in GoLang with flexible options and no dependencies.

Usage

Initialize a Bloom Filter WithAccuracy configuration

package main

import "github.com/toniphan21/go-bf"

func main() {
	var errorRate = 0.001
	var numberOfItems uint32 = 10_000_000
	filter := bf.Must(bf.WithAccuracy(errorRate, numberOfItems))

	filter.Add([]byte("anything"))

	if !filter.Exists([]byte("not found")) {
		println("If a bloom filter returns false, it is 100% correct.")
	}

	if filter.Exists([]byte("anything")) {
		println("If a bloom filter returns true it MAYBE correct. Remember to check false positive cases.")
	}
}

Initialize a Bloom Filter WithCapacity configuration

package main

import "github.com/toniphan21/go-bf"

func main() {
	var capacityInBits uint32 = 65_536
	var numberOfHashFunctions byte = 5
	filter := bf.Must(bf.WithCapacity(capacityInBits, numberOfHashFunctions))

	filter.Add([]byte("anything"))

	if !filter.Exists([]byte("not found")) {
		println("If a bloom filter returns false, it is 100% correct.")
	}

	if filter.Exists([]byte("anything")) {
		println("If a bloom filter returns true it MAYBE correct. Remember to check false positive cases.")
	}
}

APIs

Constructors
  • New(Config, ...Options) (BloomFilter, error) initialize new instance
  • Must(Config, ...Options) BloomFilter initialize new instance, panic if encounter any error
BloomFilter interface

The BloomFilter interface has 8 main methods:

Method Description
Add([]byte) Add an item into the filter
Exists([]byte) bool Check existence of an item in the filter
Count() int Get number of items added into the filter. Return -1 if not sure (for example after using Intersect() or Union()
Clone() (BloomFilter, error) Create new BloomFilter instance with the same storage, hasher and data
Intersect(BloomFilter) error Intersect with given filter. They must use the same Storage and Hash. Only Storage's data of current filter is affected, given filter's data is not
Union(BloomFilter) error Union with given filter. They must use the same Storage and Hash. Only Storage's data of current filter is affected, given filter's data is not
Storage() Storage Get filter's Storage
Hasher() Hasher Get filter's Hash
Options

There are 4 option functions could be used from the second param of bf.New(Config, ...OptionFunc):

Signature Description
WithSHA() default Use splitted SHA hashing strategy (more uniform hash)
WithFNV() Use splitted FNV hashing strategy (better performance)
WithHasher(f HasherFactory) Customize Hashing strategy with a HasherFactory
WithStorage(f StorageFactory) Customize Storage strategy with a StorageFactory

Implementation Details

Error rate, number of hash functions calculation

This is just a summary of how to calculate estimated error rate of a Bloom Filter, for proof please check academic paper or wikipedia page.

Given:

  • n estimated number of items in a Bloom Filter
  • e false positive error rate
  • k number of hash functions needed
  • m bits of memory

If you have n and e (WithAccuracy config):

\huge k \approx -log_2 e
\huge m \approx -1.44 n log_2 e

WithAccuracy has a builtin .Info() returns calculated m, k and rounded values.

package main

import (
	"fmt"
	"github.com/toniphan21/go-bf"
)

func main() {
	var errorRate = 0.001
	var numberOfItems uint32 = 10_000_000
	config := bf.WithAccuracy(errorRate, numberOfItems)
	fmt.Println(config.Info())
}

/** OUTPUT:

Config WithAccuracy()
  - Requested error rate: 0.10000%
  - Expected number of items: 10000000
  - Bits per item: 14.351
  - Number of hash functions: 10
  - Size in bits of each has function: 28
  - Storage capacity: 143507294 bits = 17938412 bytes = 17517.98KB = 17.11MB
  - Estimated error rate: 0.10130%

*/

If you have m and k (WithCapacity config), estimated error rate when n items are added into a Bloom Filter:

\huge \epsilon \approx (1 - e^-\frac{k n}{m} )^k

WithCapacity configuration has a builtin .Info() to display the estimated error rate when config by m and k.

package main

import (
	"fmt"
	"github.com/toniphan21/go-bf"
)

func main() {
	var capacityInBits uint32 = 65_536
	var numberOfHashFunctions byte = 5
	config := bf.WithCapacity(capacityInBits, numberOfHashFunctions)
	fmt.Println(config.Info())
}

/** OUTPUT:

Config WithCapacity()
  - Storage capacity: 65536 bits = 8192 bytes = 8.00KB = 0.01MB
  - Number of hash functions: 5
  - Size in bits of each has function: 16
  - Estimated error rate by n - number of added items:
      n=   100; estimated error rate: 0.00000%
      n=   200; estimated error rate: 0.00000%
      n=   500; estimated error rate: 0.00001%
      n=  1000; estimated error rate: 0.00021%
      n=  2000; estimated error rate: 0.00568%
      n=  5000; estimated error rate: 0.32083%
      n= 10000; estimated error rate: 4.33023%
      n= 20000; estimated error rate: 29.35056%
      n= 50000; estimated error rate: 89.45317%
      n=100000; estimated error rate: 99.75726%
      n=200000; estimated error rate: 99.99988%
      n=500000; estimated error rate: 100.00000%

 */

If you spot something wrong with the calculation, no worries - you can write your own config.

Hashing strategy

This library has builtin 2 hashers with the same strategy:

  • From the config we could know: keySize minimum key size (in bits) and keyCount number of hash function needed.
  • When BloomFilter calls Hasher.Hash() it passes count - number of keys array needed
  • Use SHA-256 or FNV-128 to generate hash bytes from the input. If count * keySize * keyCount > 256 when use SHA (or 128 when use FNV), will do the hash multiple times with a byte prefixed
  • Pick count * keyCount * keySize bits from the hash bytes in previous step, discard all remaining bits.

Example 1: count = 1, keySize = 25, keyCount = 10, use SHA-256:

  • Because 1*25*10 = 250 bits, we only need to hash 1 time
  • hash(256 bits) = sha_hash(input)
  • pick key 0 = bit 0-24
  • pick key 1 = bit 25-49
  • ...
  • pick key 9 = bit 225-249
  • bit 250-255 is discarded
  • return [1][10]uint32{ {key0, key1...key9} }

Example 2: count = 1, keySize = 25, keyCount = 10, use FNV-128:

  • Because 1*25*10 > 128, we hash input 2 times
  • hash(256 bits) = fnv_128(input) + fnv_128(byte(0) + input)
  • pick key 0 = bit 0-24
  • pick key 1 = bit 25-49
  • ...
  • pick key 9 = bit 225-249
  • bit 250-255 is discarded
  • return [1][10]uint32{ {key0, key1...key9} }

Example 3: count = 2, keySize = 25, keyCount = 10, use FNV-128:

  • Because 2*25*10 = 500 > 128 bits, we need to hash 4 times
  • hash(512 bits) = fnv_128(input) + fnv_128(byte(0) + input) + fnv_128(byte(1) + input)+ fnv_128(byte(2) + input)
  • pick key 0 = bit 0-24
  • pick key 1 = bit 25-49
  • ...
  • pick key 9 = bit 225-249
  • ...
  • pick key 19 = bit 475-499
  • bit 500-512 is discarded
  • return [2][10]uint32{ {key0, key1...key9}, {key10...key19} }

This strategy guarantees that the first n keys will always be the same independent from count passed via Hasher.Hash() function. A classic BloomFilter always use count=1 but the other variants may use more than 1 key collection.

Customization

Write your own hashing strategy

By default, an SHAHasher is used. You can customize the Hasher functions by implement Hasher and HasherFactory interface:

package main

import "github.com/toniphan21/go-bf"

type YourHasher struct {
  keyCount byte
  keySize  byte
}

func (y *YourHasher) Equals(other bf.Hasher) bool {
  o, ok := other.(*YourHasher)
  if !ok {
    return false
  }
  // check other params
  return y.keyCount == o.keyCount && y.keySize == o.keySize
}

func (y *YourHasher) Hash(input []byte, count int) [][]bf.Key {
  // return an array of a hash array for given bytes input.
  //   - length of the array is count
  //   - each of subarray will have length = keyCount - number of hash functions
  //   - each hash need to >= keySize - minimum size of a hash in bits
  // For example: given keyCount = 5, keySize = 16
  //   - count = 1 requires you to returns:
  //     [][]Key{
  //       { "key0: at least 16 bits long", "key1:...", "key2:...", "key3:...", "key4:..."},
  //     }
  //   - count = 2 requires you to returns:
  //     [][]Key{
  //       { "key0: at least 16 bits long", "key1:...", "key2:...", "key3:...", "key4:..."},
  //       { "key5: at least 16 bits long", "key6:...", "key7:...", "key7:...", "key8:..."},
  //     }
  return [][]bf.Key{}
}

type YourHasherFactory struct{}

func (y *YourHasherFactory) Make(numberOfHashFunctions, hashSizeInBits byte) bf.Hasher {
  return &YourHasher{
    keyCount: numberOfHashFunctions,
    keySize:  hashSizeInBits,
  }
}

func main() {
  config := bf.WithAccuracy(0.01, 1_000_000)
  filter := bf.Must(config, bf.WithHasher(&YourHasherFactory{}))

  filter.Add([]byte("anything"))
  // ...
}
Write your own custom storage

By default, all data are stored in memory, you can customize a storage by implement Storage and StorageFactory interface:

package main

import "github.com/toniphan21/go-bf"

type FileStorage struct {
  capacity uint32
}

func (f *FileStorage) Intersect(other bf.Storage) {
  // implement BatchIntersect to perform Intersect operator faster
}

func (f *FileStorage) Union(other bf.Storage) {
  // implement BatchUnion to perform Union operator faster
}

func (f *FileStorage) Equals(other bf.Storage) bool {
  o, ok := other.(*FileStorage)
  if !ok {
    return false
  }
  // check other params
  return o.capacity == f.capacity
}

func (f *FileStorage) Set(index uint32) {
  // set a bit in the given index to true
}

func (f *FileStorage) Clear(index uint32) {
  // clear a bit in the given index to true
}

func (f *FileStorage) Get(index uint32) bool {
  // return a boolean in the given index
  return false
}

func (f *FileStorage) Capacity() uint32 {
  // return the capacity of the storage in bits
  return f.capacity
}

type FileStorageFactory struct{}

func (f *FileStorageFactory) Make(capacity uint32) (bf.Storage, error) {
  return &FileStorage{capacity}, nil
}

func main() {
  config := bf.WithAccuracy(0.01, 1_000_000)
  filter := bf.Must(config, bf.WithStorage(&FileStorageFactory{}))

  filter.Add([]byte("anything"))
  // ...
}
Write your own config

If you don't like WithCapacity() or WithAccuracy() configuration, you can write your own:

package main

import "github.com/toniphan21/go-bf"

type YourConfig struct {
}

func (y *YourConfig) Info() string {
  return "info about your config"
}

func (y *YourConfig) NumberOfHashFunctions() byte {
  return 5
}

func (y *YourConfig) StorageCapacity() uint32 {
  return 1_000_000
}

func (y *YourConfig) KeySize() byte {
  return 32
}

func main() {
  config := &YourConfig{}
  filter, err := bf.New(config)
  if err != nil {
    panic("Something went wrong")
  }

  filter.Add([]byte("anything"))
  // ...
}

Benchmark

BenchmarkBloomFilter_WithSHA_Add
BenchmarkBloomFilter_WithSHA_Add-12       	 1027905	      1152 ns/op
BenchmarkBloomFilter_WithFNV_Add
BenchmarkBloomFilter_WithFNV_Add-12       	 2113863	       573.3 ns/op
BenchmarkBloomFilter_WithSHA_Exists
BenchmarkBloomFilter_WithSHA_Exists-12    	 1000000	      1132 ns/op
BenchmarkBloomFilter_WithFNV_Exists
BenchmarkBloomFilter_WithFNV_Exists-12    	 2175315	       549.9 ns/op
BenchmarkBloomFilter_Intersect
BenchmarkBloomFilter_Intersect/bench
BenchmarkBloomFilter_Intersect/bench-12   	   10000	    106422 ns/op
BenchmarkBloomFilter_Union
BenchmarkBloomFilter_Union/bench
BenchmarkBloomFilter_Union/bench-12       	   10000	    106651 ns/op
BenchmarkBloomFilter_Clone
BenchmarkBloomFilter_Clone/bench
BenchmarkBloomFilter_Clone/bench-12       	    6856	    168797 ns/op

Documentation

Index

Constants

View Source
const DefaultErrorRate = 0.0001
View Source
const DefaultNumberOfHasFunction = 5
View Source
const DefaultNumberOfItem = 1000000
View Source
const DefaultSizeInBits = 8192

Variables

View Source
var ErrHasherDifference = errors.New("hasher is not the same")
View Source
var ErrInvalidStorageCapacity = errors.New("invalid storage capacity")
View Source
var ErrNilBloomFilter = errors.New("implementation of BloomFilter is nil")
View Source
var ErrNilConfig = errors.New("implementation of Config is nil")
View Source
var ErrNilHasher = errors.New("implementation of Hasher is nil")
View Source
var ErrNilHasherFactory = errors.New("implementation of HasherFactory is nil")
View Source
var ErrNilOptionFunc = errors.New("implementation of OptionFunc is nil")
View Source
var ErrNilStorage = errors.New("implementation of Storage is nil")
View Source
var ErrNilStorageFactory = errors.New("implementation of StorageFactory is nil")
View Source
var ErrStorageDifference = errors.New("storage is not the same")

Functions

This section is empty.

Types

type BatchIntersect added in v1.2.0

type BatchIntersect interface {
	Intersect(other Storage)
}

type BatchUnion added in v1.2.0

type BatchUnion interface {
	Union(other Storage)
}

type BloomFilter

type BloomFilter interface {
	Add(item []byte)

	Exists(item []byte) bool

	Count() int

	Storage() Storage

	Hasher() Hasher

	Intersect(other BloomFilter) error

	Union(other BloomFilter) error

	Clone() (BloomFilter, error)
}

func Must added in v1.1.0

func Must(config Config, opts ...OptionFunc) BloomFilter

Must create new BloomFilter instance with Config could be the built-in WithAccuracy or WithCapacity configuration. Options including WithStorage, WithHasher or a built-in hash strategy WithSHA (default) and WithFNV.

func New

func New(config Config, opts ...OptionFunc) (BloomFilter, error)

New BloomFilter instance with Config could be the built-in WithAccuracy or WithCapacity configuration. Options including WithStorage, WithHasher or a built-in hash strategy WithSHA (default) and WithFNV.

type Config

type Config interface {
	Info() string
	NumberOfHashFunctions() byte
	StorageCapacity() uint32
	KeySize() byte
}

func WithAccuracy

func WithAccuracy(errorRate float64, numberOfItems uint32) Config

func WithCapacity

func WithCapacity(capacityInBits uint32, numberOfHashFunctions byte) Config

type Hasher added in v1.2.0

type Hasher interface {
	Hash(input []byte, count int) [][]Key

	Equals(other Hasher) bool
}

type HasherFactory added in v1.2.0

type HasherFactory interface {
	Make(numberOfHashFunctions, hashSizeInBits byte) Hasher
}

type Key added in v1.2.0

type Key uint32

type KeySplitter

type KeySplitter struct {
	Source   []byte
	Count    int
	KeyCount int
	KeySize  int
}

func (*KeySplitter) Split

func (ks *KeySplitter) Split() [][]Key

type Option

type Option struct {
	// contains filtered or unexported fields
}

type OptionFunc

type OptionFunc func(option *Option)

func WithFNV

func WithFNV() OptionFunc

func WithHasher added in v1.2.0

func WithHasher(hf HasherFactory) OptionFunc

func WithSHA

func WithSHA() OptionFunc

func WithStorage

func WithStorage(sf StorageFactory) OptionFunc

type Storage

type Storage interface {
	Set(index uint32)

	Clear(index uint32)

	Get(index uint32) bool

	Capacity() uint32

	Equals(other Storage) bool
}

type StorageFactory

type StorageFactory interface {
	Make(capacity uint32) (Storage, error)
}

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL