bloom

package module

v3.7.0 Latest Latest Go to latest Published: Mar 16, 2024 License: BSD-2-Clause Imports: 9 Imported by: 155

Details

Valid go.mod file
Redistributable license
Tagged version
Stable version
Learn more about best practices

Repository

github.com/bits-and-blooms/bloom

Links

Open Source Insights

README ¶

Bloom filters

This library is used by popular systems such as Milvus and beego.

A Bloom filter is a concise/compressed representation of a set, where the main requirement is to make membership queries; i.e., whether an item is a member of a set. A Bloom filter will always correctly report the presence of an element in the set when the element is indeed present. A Bloom filter can use much less storage than the original set, but it allows for some 'false positives': it may sometimes report that an element is in the set whereas it is not.

When you construct, you need to know how many elements you have (the desired capacity), and what is the desired false positive rate you are willing to tolerate. A common false-positive rate is 1%. The lower the false-positive rate, the more memory you are going to require. Similarly, the higher the capacity, the more memory you will use. You may construct the Bloom filter capable of receiving 1 million elements with a false-positive rate of 1% in the following manner.

    filter := bloom.NewWithEstimates(1000000, 0.01)

You should call NewWithEstimates conservatively: if you specify a number of elements that it is too small, the false-positive bound might be exceeded. A Bloom filter is not a dynamic data structure: you must know ahead of time what your desired capacity is.

Our implementation accepts keys for setting and testing as []byte. Thus, to add a string item, "Love":

    filter.Add([]byte("Love"))

Similarly, to test if "Love" is in bloom:

    if filter.Test([]byte("Love"))

For numerical data, we recommend that you look into the encoding/binary library. But, for example, to add a uint32 to the filter:

    i := uint32(100)
    n1 := make([]byte, 4)
    binary.BigEndian.PutUint32(n1, i)
    filter.Add(n1)

Godoc documentation: https://pkg.go.dev/github.com/bits-and-blooms/bloom/v3

Installation

go get -u github.com/bits-and-blooms/bloom/v3

Verifying the False Positive Rate

Sometimes, the actual false positive rate may differ (slightly) from the theoretical false positive rate. We have a function to estimate the false positive rate of a Bloom filter with m bits and k hashing functions for a set of size n:

    if bloom.EstimateFalsePositiveRate(20*n, 5, n) > 0.001 ...

You can use it to validate the computed m, k parameters:

    m, k := bloom.EstimateParameters(n, fp)
    ActualfpRate := bloom.EstimateFalsePositiveRate(m, k, n)

or

    f := bloom.NewWithEstimates(n, fp)
    ActualfpRate := bloom.EstimateFalsePositiveRate(f.m, f.k, n)

You would expect ActualfpRate to be close to the desired false-positive rate fp in these cases.

The EstimateFalsePositiveRate function creates a temporary Bloom filter. It is also relatively expensive and only meant for validation.

Serialization

You can read and write the Bloom filters as follows:

	f := New(1000, 4)
	var buf bytes.Buffer
	bytesWritten, err := f.WriteTo(&buf)
	if err != nil {
		t.Fatal(err.Error())
	}
	var g BloomFilter
	bytesRead, err := g.ReadFrom(&buf)
	if err != nil {
		t.Fatal(err.Error())
	}
	if bytesRead != bytesWritten {
		t.Errorf("read unexpected number of bytes %d != %d", bytesRead, bytesWritten)
	}

Performance tip: When reading and writing to a file or a network connection, you may get better performance by wrapping your streams with bufio instances.

E.g.,

	f, err := os.Create("myfile")
	w := bufio.NewWriter(f)

	f, err := os.Open("myfile")
	r := bufio.NewReader(f)

Contributing

If you wish to contribute to this project, please branch and issue a pull request against master ("GitHub Flow")

This project includes a Makefile that allows you to test and build the project with simple commands. To see all available options:

make help

Running all tests

Before committing the code, please check if it passes all tests using (note: this will install some dependencies):

make deps
make qa

Design

A Bloom filter has two parameters: m, the number of bits used in storage, and k, the number of hashing functions on elements of the set. (The actual hashing functions are important, too, but this is not a parameter for this implementation). A Bloom filter is backed by a BitSet; a key is represented in the filter by setting the bits at each value of the hashing functions (modulo m). Set membership is done by testing whether the bits at each value of the hashing functions (again, modulo m) are set. If so, the item is in the set. If the item is actually in the set, a Bloom filter will never fail (the true positive rate is 1.0); but it is susceptible to false positives. The art is to choose k and m correctly.

In this implementation, the hashing functions used is murmurhash, a non-cryptographic hashing function.

Given the particular hashing scheme, it's best to be empirical about this. Note that estimating the FP rate will clear the Bloom filter.

Documentation ¶

Overview ¶

Package bloom provides data structures and methods for creating Bloom filters.

A Bloom filter is a representation of a set of _n_ items, where the main requirement is to make membership queries; _i.e._, whether an item is a member of a set.

A Bloom filter has two parameters: _m_, a maximum size (typically a reasonably large multiple of the cardinality of the set to represent) and _k_, the number of hashing functions on elements of the set. (The actual hashing functions are important, too, but this is not a parameter for this implementation). A Bloom filter is backed by a BitSet; a key is represented in the filter by setting the bits at each value of the hashing functions (modulo _m_). Set membership is done by _testing_ whether the bits at each value of the hashing functions (again, modulo _m_) are set. If so, the item is in the set. If the item is actually in the set, a Bloom filter will never fail (the true positive rate is 1.0); but it is susceptible to false positives. The art is to choose _k_ and _m_ correctly.

In this implementation, the hashing functions used is murmurhash, a non-cryptographic hashing function.

This implementation accepts keys for setting as testing as []byte. Thus, to add a string item, "Love":

uint n = 1000
filter := bloom.New(20*n, 5) // load of 20, 5 keys
filter.Add([]byte("Love"))

Similarly, to test if "Love" is in bloom:

if filter.Test([]byte("Love"))

For numeric data, I recommend that you look into the binary/encoding library. But, for example, to add a uint32 to the filter:

i := uint32(100)
n1 := make([]byte,4)
binary.BigEndian.PutUint32(n1,i)
f.Add(n1)

Finally, there is a method to estimate the false positive rate of a Bloom filter with _m_ bits and _k_ hashing functions for a set of size _n_:

if bloom.EstimateFalsePositiveRate(20*n, 5, n) > 0.001 ...

You can use it to validate the computed m, k parameters:

m, k := bloom.EstimateParameters(n, fp)
ActualfpRate := bloom.EstimateFalsePositiveRate(m, k, n)

or

f := bloom.NewWithEstimates(n, fp)
ActualfpRate := bloom.EstimateFalsePositiveRate(f.m, f.k, n)

You would expect ActualfpRate to be close to the desired fp in these cases.

The EstimateFalsePositiveRate function creates a temporary Bloom filter. It is also relatively expensive and only meant for validation.

Index ¶

func EstimateFalsePositiveRate(m, k, n uint) (fpRate float64)
func EstimateParameters(n uint, p float64) (m uint, k uint)
func Locations(data []byte, k uint) []uint64
type BloomFilter

Constants ¶

This section is empty.

Variables ¶

This section is empty.

Functions ¶

func EstimateFalsePositiveRate ¶ added in v3.3.0

func EstimateFalsePositiveRate(m, k, n uint) (fpRate float64)

EstimateFalsePositiveRate returns, for a BloomFilter of m bits and k hash functions, an estimation of the false positive rate when

storing n entries. This is an empirical, relatively slow

test using integers as keys. This function is useful to validate the implementation.

func EstimateParameters ¶

func EstimateParameters(n uint, p float64) (m uint, k uint)

EstimateParameters estimates requirements for m and k. Based on https://bitbucket.org/ww/bloom/src/829aa19d01d9/bloom.go used with permission.

func Locations ¶

func Locations(data []byte, k uint) []uint64

Locations returns a list of hash locations representing a data item.

Types ¶

type BloomFilter ¶

type BloomFilter struct {
	// contains filtered or unexported fields
}

A BloomFilter is a representation of a set of _n_ items, where the main requirement is to make membership queries; _i.e._, whether an item is a member of a set.

func From ¶

func From(data []uint64, k uint) *BloomFilter

From creates a new Bloom filter with len(_data_) * 64 bits and _k_ hashing functions. The data slice is not going to be reset.

func FromWithM ¶ added in v3.2.0

func FromWithM(data []uint64, m, k uint) *BloomFilter

FromWithM creates a new Bloom filter with _m_ length, _k_ hashing functions. The data slice is not going to be reset.

func New ¶

func New(m uint, k uint) *BloomFilter

New creates a new Bloom filter with _m_ bits and _k_ hashing functions We force _m_ and _k_ to be at least one to avoid panics.

func NewWithEstimates ¶

func NewWithEstimates(n uint, fp float64) *BloomFilter

NewWithEstimates creates a new Bloom filter for about n items with fp false positive rate

func (*BloomFilter) Add ¶

func (f *BloomFilter) Add(data []byte) *BloomFilter

Add data to the Bloom Filter. Returns the filter (allows chaining)

func (*BloomFilter) AddString ¶

func (f *BloomFilter) AddString(data string) *BloomFilter

AddString to the Bloom Filter. Returns the filter (allows chaining)

func (*BloomFilter) ApproximatedSize ¶ added in v3.1.0

func (f *BloomFilter) ApproximatedSize() uint32

Approximating the number of items https://en.wikipedia.org/wiki/Bloom_filter#Approximating_the_number_of_items_in_a_Bloom_filter

func (*BloomFilter) BitSet ¶ added in v3.1.0

func (f *BloomFilter) BitSet() *bitset.BitSet

BitSet returns the underlying bitset for this filter.

func (*BloomFilter) Cap ¶

func (f *BloomFilter) Cap() uint

Cap returns the capacity, _m_, of a Bloom filter

func (*BloomFilter) ClearAll ¶

func (f *BloomFilter) ClearAll() *BloomFilter

ClearAll clears all the data in a Bloom filter, removing all keys

func (*BloomFilter) Copy ¶

func (f *BloomFilter) Copy() *BloomFilter

Copy creates a copy of a Bloom filter.

func (*BloomFilter) Equal ¶

func (f *BloomFilter) Equal(g *BloomFilter) bool

Equal tests for the equality of two Bloom filters

func (*BloomFilter) GobDecode ¶

func (f *BloomFilter) GobDecode(data []byte) error

GobDecode implements gob.GobDecoder interface.

func (*BloomFilter) GobEncode ¶

func (f *BloomFilter) GobEncode() ([]byte, error)

GobEncode implements gob.GobEncoder interface.

func (*BloomFilter) K ¶

func (f *BloomFilter) K() uint

K returns the number of hash functions used in the BloomFilter

func (*BloomFilter) MarshalBinary ¶ added in v3.4.0

func (f *BloomFilter) MarshalBinary() ([]byte, error)

MarshalBinary implements binary.BinaryMarshaler interface.

func (BloomFilter) MarshalJSON ¶

func (f BloomFilter) MarshalJSON() ([]byte, error)

MarshalJSON implements json.Marshaler interface.

func (*BloomFilter) Merge ¶

func (f *BloomFilter) Merge(g *BloomFilter) error

Merge the data from two Bloom Filters.

func (*BloomFilter) ReadFrom ¶

func (f *BloomFilter) ReadFrom(stream io.Reader) (int64, error)

ReadFrom reads a binary representation of the BloomFilter (such as might have been written by WriteTo()) from an i/o stream. It returns the number of bytes read.

Performance: if this function is used to read from a disk or network connection, it might be beneficial to wrap the stream in a bufio.Reader. E.g.,

f, err := os.Open("myfile")
r := bufio.NewReader(f)

func (*BloomFilter) Test ¶

func (f *BloomFilter) Test(data []byte) bool

Test returns true if the data is in the BloomFilter, false otherwise. If true, the result might be a false positive. If false, the data is definitely not in the set.

func (*BloomFilter) TestAndAdd ¶

func (f *BloomFilter) TestAndAdd(data []byte) bool

TestAndAdd is equivalent to calling Test(data) then Add(data). The filter is written to unconditionnally: even if the element is present, the corresponding bits are still set. See also TestOrAdd. Returns the result of Test.

func (*BloomFilter) TestAndAddString ¶

func (f *BloomFilter) TestAndAddString(data string) bool

TestAndAddString is the equivalent to calling Test(string) then Add(string). The filter is written to unconditionnally: even if the string is present, the corresponding bits are still set. See also TestOrAdd. Returns the result of Test.

func (*BloomFilter) TestLocations ¶

func (f *BloomFilter) TestLocations(locs []uint64) bool

TestLocations returns true if all locations are set in the BloomFilter, false otherwise.

func (*BloomFilter) TestOrAdd ¶ added in v3.1.0

func (f *BloomFilter) TestOrAdd(data []byte) bool

TestOrAdd is equivalent to calling Test(data) then if not present Add(data). If the element is already in the filter, then the filter is unchanged. Returns the result of Test.

func (*BloomFilter) TestOrAddString ¶ added in v3.1.0

func (f *BloomFilter) TestOrAddString(data string) bool

TestOrAddString is the equivalent to calling Test(string) then if not present Add(string). If the string is already in the filter, then the filter is unchanged. Returns the result of Test.

func (*BloomFilter) TestString ¶

func (f *BloomFilter) TestString(data string) bool

TestString returns true if the string is in the BloomFilter, false otherwise. If true, the result might be a false positive. If false, the data is definitely not in the set.

func (*BloomFilter) UnmarshalBinary ¶ added in v3.4.0

func (f *BloomFilter) UnmarshalBinary(data []byte) error

UnmarshalBinary implements binary.BinaryUnmarshaler interface.

func (*BloomFilter) UnmarshalJSON ¶

func (f *BloomFilter) UnmarshalJSON(data []byte) error

UnmarshalJSON implements json.Unmarshaler interface.

func (*BloomFilter) WriteTo ¶

func (f *BloomFilter) WriteTo(stream io.Writer) (int64, error)

WriteTo writes a binary representation of the BloomFilter to an i/o stream. It returns the number of bytes written.

Performance: if this function is used to write to a disk or network connection, it might be beneficial to wrap the stream in a bufio.Writer. E.g.,

      f, err := os.Create("myfile")
	       w := bufio.NewWriter(f)

Source Files ¶

View all Source files

?	: This menu
/	: Search site
f or F	: Jump to
y or Y	: Canonical URL