README ¶
Bloom filters
A Bloom filter is a representation of a set of n items, where the main requirement is to make membership queries; i.e., whether an item is a member of a set.
A Bloom filter has two parameters: m, a maximum size (typically a reasonably large multiple of the cardinality of the set to represent) and k, the number of hashing functions on elements of the set. (The actual hashing functions are important, too, but this is not a parameter for this implementation). A Bloom filter is backed by a BitSet; a key is represented in the filter by setting the bits at each value of the hashing functions (modulo m). Set membership is done by testing whether the bits at each value of the hashing functions (again, modulo m) are set. If so, the item is in the set. If the item is actually in the set, a Bloom filter will never fail (the true positive rate is 1.0); but it is susceptible to false positives. The art is to choose k and m correctly.
In this implementation, the hashing functions used is murmurhash, a non-cryptographic hashing function.
This implementation accepts keys for setting and testing as []byte
. Thus, to
add a string item, "Love"
:
n := uint(1000)
filter := bloom.New(20*n, 5) // load of 20, 5 keys
filter.Add([]byte("Love"))
Similarly, to test if "Love"
is in bloom:
if filter.Test([]byte("Love"))
For numeric data, I recommend that you look into the encoding/binary library. But, for example, to add a uint32
to the filter:
i := uint32(100)
n1 := make([]byte, 4)
binary.BigEndian.PutUint32(n1, i)
filter.Add(n1)
Finally, there is a method to estimate the false positive rate of a particular bloom filter for a set of size n:
if filter.EstimateFalsePositiveRate(1000) > 0.001
Given the particular hashing scheme, it's best to be empirical about this. Note that estimating the FP rate will clear the Bloom filter.
Discussion here: Bloom filter
Godoc documentation: https://godoc.org/github.com/willf/bloom
Installation
go get -u github.com/willf/bloom
Contributing
If you wish to contribute to this project, please branch and issue a pull request against master ("GitHub Flow")
This project include a Makefile that allows you to test and build the project with simple commands. To see all available options:
make help
Running all tests
Before committing the code, please check if it passes all tests using (note: this will install some dependencies):
make deps
make qa
Documentation ¶
Overview ¶
Package bloom provides data structures and methods for creating Bloom filters.
A Bloom filter is a representation of a set of _n_ items, where the main requirement is to make membership queries; _i.e._, whether an item is a member of a set.
A Bloom filter has two parameters: _m_, a maximum size (typically a reasonably large multiple of the cardinality of the set to represent) and _k_, the number of hashing functions on elements of the set. (The actual hashing functions are important, too, but this is not a parameter for this implementation). A Bloom filter is backed by a BitSet; a key is represented in the filter by setting the bits at each value of the hashing functions (modulo _m_). Set membership is done by _testing_ whether the bits at each value of the hashing functions (again, modulo _m_) are set. If so, the item is in the set. If the item is actually in the set, a Bloom filter will never fail (the true positive rate is 1.0); but it is susceptible to false positives. The art is to choose _k_ and _m_ correctly.
In this implementation, the hashing functions used is murmurhash, a non-cryptographic hashing function.
This implementation accepts keys for setting as testing as []byte. Thus, to add a string item, "Love":
uint n = 1000 filter := bloom.New(20*n, 5) // load of 20, 5 keys filter.Add([]byte("Love"))
Similarly, to test if "Love" is in bloom:
if filter.Test([]byte("Love"))
For numeric data, I recommend that you look into the binary/encoding library. But, for example, to add a uint32 to the filter:
i := uint32(100) n1 := make([]byte,4) binary.BigEndian.PutUint32(n1,i) f.Add(n1)
Finally, there is a method to estimate the false positive rate of a particular Bloom filter for a set of size _n_:
if filter.EstimateFalsePositiveRate(1000) > 0.001
Given the particular hashing scheme, it's best to be empirical about this. Note that estimating the FP rate will clear the Bloom filter.
Index ¶
- func EstimateParameters(n uint, p float64) (m uint, k uint)
- func Locations(data []byte, k uint) []uint64
- type BloomFilter
- func (f *BloomFilter) Add(data []byte) *BloomFilter
- func (f *BloomFilter) AddString(data string) *BloomFilter
- func (f *BloomFilter) Cap() uint
- func (f *BloomFilter) ClearAll() *BloomFilter
- func (f *BloomFilter) Copy() *BloomFilter
- func (f *BloomFilter) Equal(g *BloomFilter) bool
- func (f *BloomFilter) EstimateFalsePositiveRate(n uint) (fpRate float64)
- func (f *BloomFilter) GobDecode(data []byte) error
- func (f *BloomFilter) GobEncode() ([]byte, error)
- func (f *BloomFilter) K() uint
- func (f *BloomFilter) MarshalJSON() ([]byte, error)
- func (f *BloomFilter) Merge(g *BloomFilter) error
- func (f *BloomFilter) ReadFrom(stream io.Reader) (int64, error)
- func (f *BloomFilter) Test(data []byte) bool
- func (f *BloomFilter) TestAndAdd(data []byte) bool
- func (f *BloomFilter) TestAndAddString(data string) bool
- func (f *BloomFilter) TestLocations(locs []uint64) bool
- func (f *BloomFilter) TestString(data string) bool
- func (f *BloomFilter) UnmarshalJSON(data []byte) error
- func (f *BloomFilter) WriteTo(stream io.Writer) (int64, error)
Constants ¶
This section is empty.
Variables ¶
This section is empty.
Functions ¶
func EstimateParameters ¶
EstimateParameters estimates requirements for m and k. Based on https://bitbucket.org/ww/bloom/src/829aa19d01d9/bloom.go used with permission.
Types ¶
type BloomFilter ¶
type BloomFilter struct {
// contains filtered or unexported fields
}
A BloomFilter is a representation of a set of _n_ items, where the main requirement is to make membership queries; _i.e._, whether an item is a member of a set.
func From ¶
func From(data []uint64, k uint) *BloomFilter
From creates a new Bloom filter with len(_data_) * 64 bits and _k_ hashing functions. The data slice is not going to be reset.
func New ¶
func New(m uint, k uint) *BloomFilter
New creates a new Bloom filter with _m_ bits and _k_ hashing functions We force _m_ and _k_ to be at least one to avoid panics.
func NewWithEstimates ¶
func NewWithEstimates(n uint, fp float64) *BloomFilter
NewWithEstimates creates a new Bloom filter for about n items with fp false positive rate
func (*BloomFilter) Add ¶
func (f *BloomFilter) Add(data []byte) *BloomFilter
Add data to the Bloom Filter. Returns the filter (allows chaining)
func (*BloomFilter) AddString ¶
func (f *BloomFilter) AddString(data string) *BloomFilter
AddString to the Bloom Filter. Returns the filter (allows chaining)
func (*BloomFilter) Cap ¶
func (f *BloomFilter) Cap() uint
Cap returns the capacity, _m_, of a Bloom filter
func (*BloomFilter) ClearAll ¶
func (f *BloomFilter) ClearAll() *BloomFilter
ClearAll clears all the data in a Bloom filter, removing all keys
func (*BloomFilter) Copy ¶
func (f *BloomFilter) Copy() *BloomFilter
Copy creates a copy of a Bloom filter.
func (*BloomFilter) Equal ¶
func (f *BloomFilter) Equal(g *BloomFilter) bool
Equal tests for the equality of two Bloom filters
func (*BloomFilter) EstimateFalsePositiveRate ¶
func (f *BloomFilter) EstimateFalsePositiveRate(n uint) (fpRate float64)
EstimateFalsePositiveRate returns, for a BloomFilter with a estimate of m bits and k hash functions, what the false positive rate will be while storing n entries; runs 100,000 tests. This is an empirical test using integers as keys. As a side-effect, it clears the BloomFilter.
func (*BloomFilter) GobDecode ¶
func (f *BloomFilter) GobDecode(data []byte) error
GobDecode implements gob.GobDecoder interface.
func (*BloomFilter) GobEncode ¶
func (f *BloomFilter) GobEncode() ([]byte, error)
GobEncode implements gob.GobEncoder interface.
func (*BloomFilter) K ¶
func (f *BloomFilter) K() uint
K returns the number of hash functions used in the BloomFilter
func (*BloomFilter) MarshalJSON ¶
func (f *BloomFilter) MarshalJSON() ([]byte, error)
MarshalJSON implements json.Marshaler interface.
func (*BloomFilter) Merge ¶
func (f *BloomFilter) Merge(g *BloomFilter) error
Merge the data from two Bloom Filters.
func (*BloomFilter) ReadFrom ¶
func (f *BloomFilter) ReadFrom(stream io.Reader) (int64, error)
ReadFrom reads a binary representation of the BloomFilter (such as might have been written by WriteTo()) from an i/o stream. It returns the number of bytes read.
func (*BloomFilter) Test ¶
func (f *BloomFilter) Test(data []byte) bool
Test returns true if the data is in the BloomFilter, false otherwise. If true, the result might be a false positive. If false, the data is definitely not in the set.
func (*BloomFilter) TestAndAdd ¶
func (f *BloomFilter) TestAndAdd(data []byte) bool
TestAndAdd is the equivalent to calling Test(data) then Add(data). Returns the result of Test.
func (*BloomFilter) TestAndAddString ¶
func (f *BloomFilter) TestAndAddString(data string) bool
TestAndAddString is the equivalent to calling Test(string) then Add(string). Returns the result of Test.
func (*BloomFilter) TestLocations ¶
func (f *BloomFilter) TestLocations(locs []uint64) bool
TestLocations returns true if all locations are set in the BloomFilter, false otherwise.
func (*BloomFilter) TestString ¶
func (f *BloomFilter) TestString(data string) bool
TestString returns true if the string is in the BloomFilter, false otherwise. If true, the result might be a false positive. If false, the data is definitely not in the set.
func (*BloomFilter) UnmarshalJSON ¶
func (f *BloomFilter) UnmarshalJSON(data []byte) error
UnmarshalJSON implements json.Unmarshaler interface.