hll

package module
v0.0.0-...-c6eb27a Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: May 22, 2018 License: Apache-2.0 Imports: 5 Imported by: 2

README

HyperLogLog in golang. Docs. Build Status codecov

What

A go implementation of HypeLogLog data structure with a twist. See HyperLogLog in Practice paper by Stefan Heule, Marc Nunkesser, Alex Hall.

Ø-serialization

There is no need to serialize/deserialize hll. Everything is stored in a byte slice, which can be memory mapped, passed around over the network as is etc.

Differences from the paper:

  • sparse representation. this implementation does exact counting for small sets.
  • fixed memory usage (even for empty HLL). HLL of a given precision P uses fixed (8 + 3*2^(P-2), 8 byte header + 6 bits per register) size in bytes.
  • thresholds are tuned. different from Sub-Algorithm Threshold.

Why

I wanted an HLL implementation that is

  • simple
  • reasonably fast
  • (almost) non-allocating
  • exact when number of unique elements is small
  • memory-mapped file friendly
  • well tested (90+% coverage)

Usage

Get go-hll:

go get github.com/sasha-s/go-hll

Use it:

s, err := SizeByP(16)
if err != nil {
	log.Panicln(err)
}
h := make(HLL, s)
...
for _, x := range []string{"alpha", "beta"} {
	h.Add(siphash.Hash(2, 57, []byte(x)))
}
log.Println(h.EstimateCardinality())

Use good hash (otherwise accuracy would be poor). Some options:

Speed

Benchmark results on my MacBook Pro (Mid 2014).

Add-8            9.68ns ± 1%
Estimate-8       27.3µs ± 1%
Merge-8          38.0µs ± 1%

AddDense-8       6.73ns ± 3%
MergeDense-8     37.9µs ± 1%
EstimateDense-8  22.9µs ± 1%
Sort-8            108µs ± 1%
AddSparse-8      10.2ns ± 3%

Merge/Estimate etc. are done for P=14.

Other implementations (in no particular order)

Documentation

Index

Constants

This section is empty.

Variables

View Source
var Alloc = func(n int) []byte {
	return make([]byte, n)
}

Alloc allocates the memory blob. It is a variable, so one can change it to use, say, sync.Pool.

View Source
var Free = func(blob []byte) {
}

Free returns the blob back. It is a variable, so one can change it to use, say, sync.Pool.

Functions

func DenseSizeByError

func DenseSizeByError(errorRate float64) (int, error)

DenseSizeByError returns a byte size of a dense HLL for a given errorRate. The error must be between 0.0253% and 26% (inclusive).

func DenseSizeByP

func DenseSizeByP(p int) (int, error)

DenseSizeByP returns a byte size of a dense HLL for a given precision. Precision (p) must be between 4 and 25 (inclusive).

func ErrFromP

func ErrFromP(p int) float64

ErrFromP returns an expected error rate for a given p.

func SizeByError

func SizeByError(errorRate float64) (int, error)

SizeByError returns a byte size of an HLL for a given errorRate. The error must be between 0.0253% and 26% (inclusive).

func SizeByP

func SizeByP(p int) (int, error)

SizeByP returns a byte size of an HLL for a given precision. Precision (p) must be between 4 and 25 (inclusive).

Types

type Dense

type Dense []byte

Dense implements hyper-log log from http://research.google.com/pubs/pub40671.html With bias correction and linear counting, but not the HLL++ part (not sparse). HLL uses constant memory (0.75 * 2^p bytes.) All operations are non-allocating.

func (Dense) Add

func (h Dense) Add(hash uint64) bool

Add a hash to an HLL. Returns true if cardinality esimate changed.

func (Dense) Clear

func (h Dense) Clear()

Clear resets the HLL.

func (Dense) EstimateCardinality

func (h Dense) EstimateCardinality() uint64

EstimateCardinality returns a cardinality estimate.

func (Dense) IsValid

func (h Dense) IsValid() error

IsValid checks whether HLL size makes sense.

func (Dense) Merge

func (h Dense) Merge(g Dense) error

Merge another HLL (of the same precision) into this.

type HLL

type HLL []byte

HLL is a hybrid hyper-loglog: either sparse or dense, switching from sparse to dense when needed. Note, both sparse and dense representation take exactly same space. Dense representation performs no allocations, sparse might need some when switching to dense.

Sparse mode estimate is exact. HLL is byte buffer friendly (no need to serialize/deserialize).

Layout: First 8 bit header. Leftmost bit: dirty. Second left most: mode (1: dense, 0: sparse).

sparse:

Next 30 bits: number of elements (big endian), 32 bits unused, followed by elements, 8 bytes each (uint64 little endian).
Note, the header is a part of sparse HLL.

dense:

Next 62 bits: previous cardinality esimate (big endian). Valid if !dirty. Followed by dense HLL.

full: 8 byte header, hll[0]&(1<<6) != 0, followed by dense HLL.

All operations are in place. Add/Merge might allocate a temporary buffer when switching from sparse to dense representation. Note, this is not HLL++ - it uses a different sparse representation.

Creating an HLL:

s, err := SizeByP(p) // Or SizeByError.

if err != nil {
    log.Panicln(err)
}

h := make(HLL, s)

h is empty and sparse at this point.

func (HLL) Add

func (h HLL) Add(hash uint64)

Add a hash to an HLL. Might allocate a block (with Alloc) if HLL is sparse and it gets full. Make sure to use a good hash function.

func (HLL) EstimateCardinality

func (h HLL) EstimateCardinality() uint64

EstimateCardinality returns a cardinality estimate. Note, EstimateCardinality might (will) modify the HLL iff HLL is dirty.

func (HLL) IsSparse

func (h HLL) IsSparse() bool

IsSparse returns true iff the underlying HLL is sparse (and thus the cardinality estimate is exact).

func (HLL) IsValid

func (h HLL) IsValid() error

IsValid checks whether HLL size makes sense.

func (HLL) Merge

func (h HLL) Merge(g HLL) error

Merge another HLL (of the same precision) into this. Might allocate a block (with Alloc) if HLL is sparse and it gets full.

func (HLL) Reset

func (h HLL) Reset()

Reset the HLL.

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL