maybe

package module
v0.0.0-...-22165a2 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Feb 7, 2021 License: MIT Imports: 5 Imported by: 0

README

Maybe

A library containing a set of probabilstic data structures that I found interesting while researching the subject.

Contents

Bloom Filter

A bloom filter is a great data structure to do set membership queries on large number of elements with very little memory footprint.

An example usage:

package main

import (
	"fmt"
	"github.com/chermehdi/maybe"
)

type String string

func (i String) Bytes() []byte {
	return []byte(i)
}

func main() {
	bf, _ := maybe.NewBloomFilter(1024, 3)
	bf.Add(String("hello"))

	if bf.Has(String("hello")) {
		fmt.Println("Found Hello in bloom filter")
	}
}
Count min sketch

A count min sketch is a data structure to estimate the count of elements in a large set of events with a very little memory footprint.

An example usage:

package main

import (
	"fmt"
	"github.com/chermehdi/maybe"
)

type String string

func (i String) Bytes() []byte {
	return []byte(i)
}

func main() {
	sk := maybe.NewCountMinSketch(1024, 3)
	sk.Increment(String("hello"))
	sk.Add(String("hello1"), 2)

	fmt.Println(sk.Count(String("hello")))
}
HyperLogLog

A HyperLogLog is an excelement data structure for cardinality estimation, it allows you to estimate the size of a set with a very reasonable controllable accuracy, and with very little memory footprint.

an example usage:

package main

import (
	"fmt"
	"github.com/chermehdi/maybe"
)

type String string

func (i String) Bytes() []byte {
	return []byte(i)
}

func main() {
	hll, _ := maybe.NewHyperLogLog(10)
	hll.Add(String("a"))
	hll.Add(String("b"))
	hll.Add(String("c"))
	hll.Add(String("d"))
	fmt.Println(4 == hll.Cardinality())
}

Resources

Documentation

Index

Constants

This section is empty.

Variables

This section is empty.

Functions

func MurMurHash

func MurMurHash(value AsBytes) uint64

Types

type AsBytes

type AsBytes interface {
	Bytes() []byte
}

AsBytes is an abstraction needed by the maybe library component to indicate anytype that can be transformed to a series of bytes, so that we can apply a hash function to it.

type BloomFilter

type BloomFilter struct {
	// contains filtered or unexported fields
}

BloomFilter is a probabilistic data structure to test weather a given value is a member of a given set (represented by this bloom filter instance).

The main advantage of this data structure is the low memory footprint, it can be used to do set membership on a very large multiset with very low memory footprint (in the order of `Kb`) as opposed to storing this in a set which can have a memory footprint in the order of `Mg` or `Gb`. Of course this comes with the additional tradeoff of certainty, if a bloom filter `Has` method returns `true` this means that the element is already a member of this set with a certainty of `x%` as opposed to the `100%` given by a set-like data structure. however if a `Has` call on a bloom filter returns `false` then the element is not in the set `100%`.

The certainty factor is determined by the size of the bitset in the internal implementation, and the number of hash functions you apply to the value, the corollary is this: the bigger the bitset and the more hash functions you have the more accurate the result of the bloom filter (or the less false positives you will get).

You can read the original paper for the math behind it.

func NewBloomFilter

func NewBloomFilter(size uint, numFunc uint) (*BloomFilter, error)

NewBloomFilter creates a new BloomFilter with a given size and given a number of hash functions. The numFunc will indicate the number of times the default hash function is going to be used to create additional hash functions. The default hash base hash function is the murmur3, any other hash function would've done the job however.

func NewBloomFilterWithHashes

func NewBloomFilterWithHashes(size uint, hashes []HashFunc) (*BloomFilter, error)

NewBloomFilterWithHashes creates a bloom filter with user defined hash functions

func (*BloomFilter) Add

func (bf *BloomFilter) Add(value AsBytes)

Add adds the value to the bloom filter.

The value should implement the AsBytes interface for the hash functions to work.

func (*BloomFilter) Has

func (bf *BloomFilter) Has(value AsBytes) bool

Has returns weather the value element exists in the bloom filter.

If the return value is true, than the value might have been added to the bloom filter otherwise the value has **definitely** never been added

type CountMinSketch

type CountMinSketch struct {
	// contains filtered or unexported fields
}

CountMinSketch is a probabilistic data structure that is used to count the number of occurrences of a given value in a stream of events

The data structure's bias is determined by the width and depth of the table, the bigger they are the better precision you get in point count queries. The we use a linear hash function based on the murmur3 algorithm in this implementation, the algorithm used is not highly important in the functioning of the data structure, that being said of course, using sha1 or some cryptographic hashing algorithm (usually slow) will impact the performance of adding and querying. A count min sketch is biased estimator data structure, it can over estimate (report more elements then the ones added) but it will never under estimate.

func NewCountMinSketch

func NewCountMinSketch(width, depth uint32) *CountMinSketch

NewCountMinSketch will create a count min sketch based of the width and depth given.

The values will determine the size of the underlying table, and the number of hash functions generated, bigger values will give you better estimates, but it will also impact your memory / cpu footprint.

func (*CountMinSketch) Add

func (cm *CountMinSketch) Add(value AsBytes, count uint64)

Add will increment the buckets corresponding to the value by count.

func (*CountMinSketch) Count

func (cm *CountMinSketch) Count(value AsBytes) uint64

Count will return an estimate of the number of values of type value that has been added to the sketch.

The count estimate is always greater than or equal the actual count.

func (*CountMinSketch) Increment

func (cm *CountMinSketch) Increment(value AsBytes)

Increment is equivalent to adding one element of type represented by value.

type HashFunc

type HashFunc func(AsBytes) uint64

HashFunc is any type that can transform an `AsBytes` value into a uint64

func MurmurFrom

func MurmurFrom(value uint64) HashFunc

MurmurFrom constructs a HashFunc from a murmur3 hash function and a multiplication factor This is a trick that can be used to create multiple hash function from the same one.

type HyperLogLog

type HyperLogLog struct {
	// contains filtered or unexported fields
}

HyperLogLog is data structure for estimating cardinality with a accuracy of ` 1 - 1.04 / sqrt(m)` with m defined as `2^b`, the reference is from this paper: http://algo.inria.fr/flajolet/Publications/FlFuGaMe07.pdf This being said the accuracy of the data structure is proportional to the value of b.

func NewHyperLogLog

func NewHyperLogLog(b uint) (*HyperLogLog, error)

NewHyperLogLog creates a new HLL based on the the number of counters

func (*HyperLogLog) Add

func (hll *HyperLogLog) Add(value AsBytes)

Add adds the given observable (value) to the HLL instance

func (*HyperLogLog) Cardinality

func (hll *HyperLogLog) Cardinality() uint64

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL