Documentation ¶
Overview ¶
This is a Go implementation of the HyperLogLog++ algorithm from "HyperLogLog in Practice: Algorithmic Engineering of a State of The Art Cardinality Estimation Algorithm" by Heule, Nunkesser and Hall of Google. This is a cardinality estimation algorithm: given a stream of input elements, it will estimate the number of unique items in the stream. The estimation error can be controlled by choosing how much memory to use. HyperLogLog++ improves on the basic HyperLogLog algorithm by using less space, improving accuracy, and correcting bias.
This code is a translation of the pseudocode contained in Figures 6 and 7 of the Google paper. Not all algorithms are provided in the paper, but we've tried our best to be true to the authors' intent when writing the omitted algorithms. We're not trying to be creative, we're just implementing the algorithm described in the paper as directly as possible.
The HyperLogLog++ paper is available at http://static.googleusercontent.com/media/research.google.com/en/us/pubs/archive/40671.pdf
Package hll is a generated protocol buffer package. It is generated from these files: hll.proto It has these top-level messages: HllPb
Example ¶
const ( p = 14 // Max memory usage is 0.75 * 2^p bytes pPrime = 25 // Setting this is a bit more complicated, Google recommends 25. numToInsert = 1000000 ) // You can use any good hash function, truncated to 8 bytes to fit in a uint64. hashU64 := func(s string) uint64 { sha1Hash := sha1.Sum([]byte(s)) return binary.LittleEndian.Uint64(sha1Hash[0:8]) } hll := NewHll(p, pPrime) // For this example, our inputs will just be strings, e.g. "1", "2" for i := 0; i < numToInsert; i++ { inputString := strconv.Itoa(i) // To use HLL, you hash your item, convert the hash to uint64, and pass it to Add(). hll.Add(hashU64(inputString)) } // Duplicates do not affect the cardinality. The following loop has no effect. for i := 0; i < 10000; i++ { hll.Add(hashU64("1")) } // We inserted 1M unique elements, the cardinality should be roughly 1M. fmt.Printf("%d\n", hll.Cardinality())
Output: 989546
Index ¶
- Variables
- type Hll
- func (h *Hll) Add(x uint64)
- func (h *Hll) Cardinality() uint64
- func (h *Hll) Combine(other *Hll)
- func (h *Hll) GetBSON() (interface{}, error)
- func (h *Hll) GobDecode(data []byte) error
- func (h *Hll) GobEncode() ([]byte, error)
- func (h *Hll) MarshalJSON() ([]byte, error)
- func (h *Hll) MarshalPb() ([]byte, error)
- func (h *Hll) SetBSON(raw bson.Raw) error
- func (h *Hll) UnmarshalJSON(buf []byte) error
- func (h *Hll) UnmarshalPb(buf []byte) error
- type HllPb
- func (m *HllPb) GetM() []byte
- func (m *HllPb) GetP() int32
- func (m *HllPb) GetPp() int32
- func (m *HllPb) GetS() *HllPbSparse
- func (m *HllPb) Marshal() (data []byte, err error)
- func (m *HllPb) MarshalTo(data []byte) (int, error)
- func (*HllPb) ProtoMessage()
- func (m *HllPb) Reset()
- func (m *HllPb) Size() (n int)
- func (m *HllPb) String() string
- func (m *HllPb) Unmarshal(data []byte) error
- type HllPbSparse
- func (m *HllPbSparse) GetBuf() []byte
- func (m *HllPbSparse) GetLastVal() uint64
- func (m *HllPbSparse) GetNumElements() uint64
- func (m *HllPbSparse) Marshal() (data []byte, err error)
- func (m *HllPbSparse) MarshalTo(data []byte) (int, error)
- func (*HllPbSparse) ProtoMessage()
- func (m *HllPbSparse) Reset()
- func (m *HllPbSparse) Size() (n int)
- func (m *HllPbSparse) String() string
- func (m *HllPbSparse) Unmarshal(data []byte) error
Examples ¶
Constants ¶
This section is empty.
Variables ¶
var (
ErrInvalidLengthHll = fmt.Errorf("proto: negative length found during unmarshaling")
)
Functions ¶
This section is empty.
Types ¶
type Hll ¶
type Hll struct {
// contains filtered or unexported fields
}
func NewHll ¶
Initialize a new hyper-log-log struct based on inputs p and p'. Google recommends that p be set to 14, and p' to equal either 20 or 25.
func (*Hll) Add ¶
Add takes a hash and updates the cardinality estimation data structures.
The input should be a hash of whatever type you're estimating of. For example, if you're estimating the cardinality of a stream of strings, you'd pass the hash of each string to this function.
func (*Hll) Cardinality ¶
Returns the estimated cardinality (the number of unique inputs seen so far).
func (*Hll) Combine ¶
Combine() merges two HyperLogLog++ calculations. This allows you to parallelize cardinality estimation: each thread can process a shard of the input, then the results can be merged later to give the cardinality of the entire data set (the union of the shards).
WARNING: The "other" parameter may be mutated during this call! It may be converted from a sparse to dense representation, which may affect its space usage and precision. This is a deliberate design decision that helps to minimize memory consumption.
The inputs must have the same p and pPrime or this function will panic. The Google paper doesn't give an algorithm for this operation, but its existence is implied, and the ability to do this combine operation is one of the main benefits of using a HyperLogLog-type algorithm in the first place.
func (*Hll) MarshalJSON ¶
func (*Hll) UnmarshalJSON ¶
func (*Hll) UnmarshalPb ¶
type HllPb ¶
type HllPb struct { P *int32 `protobuf:"varint,1,req,name=p" json:"p,omitempty"` Pp *int32 `protobuf:"varint,2,req,name=pp" json:"pp,omitempty"` M []byte `protobuf:"bytes,3,opt" json:"M,omitempty"` S *HllPbSparse `protobuf:"bytes,4,opt,name=s" json:"s,omitempty"` XXX_unrecognized []byte `json:"-"` }
func (*HllPb) GetS ¶
func (m *HllPb) GetS() *HllPbSparse
func (*HllPb) ProtoMessage ¶
func (*HllPb) ProtoMessage()
type HllPbSparse ¶
type HllPbSparse struct { Buf []byte `protobuf:"bytes,1,opt,name=buf" json:"buf,omitempty"` LastVal *uint64 `protobuf:"varint,2,req,name=lastVal" json:"lastVal,omitempty"` NumElements *uint64 `protobuf:"varint,3,req,name=numElements" json:"numElements,omitempty"` XXX_unrecognized []byte `json:"-"` }
func (*HllPbSparse) GetBuf ¶
func (m *HllPbSparse) GetBuf() []byte
func (*HllPbSparse) GetLastVal ¶
func (m *HllPbSparse) GetLastVal() uint64
func (*HllPbSparse) GetNumElements ¶
func (m *HllPbSparse) GetNumElements() uint64
func (*HllPbSparse) Marshal ¶
func (m *HllPbSparse) Marshal() (data []byte, err error)
func (*HllPbSparse) ProtoMessage ¶
func (*HllPbSparse) ProtoMessage()
func (*HllPbSparse) Reset ¶
func (m *HllPbSparse) Reset()
func (*HllPbSparse) Size ¶
func (m *HllPbSparse) Size() (n int)
func (*HllPbSparse) String ¶
func (m *HllPbSparse) String() string
func (*HllPbSparse) Unmarshal ¶
func (m *HllPbSparse) Unmarshal(data []byte) error