Documentation ¶
Index ¶
Constants ¶
This section is empty.
Variables ¶
This section is empty.
Functions ¶
This section is empty.
Types ¶
type ProbabilisticCounter ¶
type ProbabilisticCounter struct {
// contains filtered or unexported fields
}
ProbabilisticCounter implements an a linear-time counting algorithm, also known as "linear counting".
From: "A Linear-Time Probabilistic Counting Algorithm for Database Applications" (http://dblab.kaist.ac.kr/Publication/pdf/ACM90_TODS_v15n2.pdf)
"Linear counting is a two-step process. In step 1, the algorithm allocates a bit map (hash table) of size m in main memory. All entries are initialized to "0"s. The algorithm then scans the relation and applied a hash function to each data value in the column of interest. The hash function generates a bit map address and the algorithm sets this addressed bit to "1". In step 2, the algorithm first counts the number of empty bit map entries (equivalently, the number of "0" entries). It then estimates the column cardinality by dividing this count by the bit map size m (thus obtaining the fraction of empty bit map entries V_n) and plugging the result into the following equation:
n^ = -m * ln V_n (The symbol ^ denotes an estimator)"
Therefore, ProbabilisticCounter is able to compute an approximate distinct count for a sufficiently large number of string values, with an error rate of less than 1.25% on average. It does so while using a very low amount of memory, for instance tracking up to 1 million string values would consume 128 kilobytes.
ProbabilisticCounter *does not* provide a query-based API which would let callers verify whether a given string value has been counted before or not. There are other algorithms which are more efficient at providing that kind of funcionality, for instance "min-count log sketch" and "hyperloglog".
Lastly, ProbabilisticCounter provides serialization/deserialization in case it is needed to use and persist this data structure in a database.
func NewProbabilisticCounter ¶
func NewProbabilisticCounter(cardinality int) *ProbabilisticCounter
It creates a new ProbabilisticCounter given an *estimate* of the true cardinality of the set of values it should count. This estimate is needed to that an internal bit map can be initialized and sized properly.
func NewProbabilisticCounterFromBytes ¶
func NewProbabilisticCounterFromBytes(cardinality int, bytes []byte) (*ProbabilisticCounter, error)
It deserializes a new ProbabilisticCounter which has probably been serialized before with probabiliscCounter.Bytes(). cardinality: an *estimate* of the true cardinality of the set of values it should count so that an internal bit map can be initialized and sized properly.
func (*ProbabilisticCounter) Add ¶
func (pc *ProbabilisticCounter) Add(values ...string)
It adds one or more string values to this counter.
func (*ProbabilisticCounter) Bytes ¶
func (pc *ProbabilisticCounter) Bytes() ([]byte, error)
It serializes this ProbabilisticCounter into a slice of bytes, which can be used to persist this data structure.
func (*ProbabilisticCounter) Count ¶
func (pc *ProbabilisticCounter) Count() int
It provides an (approximate) distinct count of values counter so far, with an error rate less than 1.25% on average.