minhash

package module
v0.0.0-...-58d649f Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Jul 13, 2016 License: MIT Imports: 6 Imported by: 1

README

Package minhash provides probabilistic data structures for computing MinHash signatures for streaming data.

The reader should also confer https://github.com/dgryski/go-minhash and https://github.com/tylertreat/BoomFilters. In fact, most of the implementation details of this package are based off the former.

MinHash signatures can be used to estimate the Jaccard index J(A, B) := |A & B| / |A || B| of two sets that are subsets of some finite, totally ordered set U. If s is a permutation of U chosen uniformly at random, then x := argmin s(A || B) is just a random element chosen uniformly from A || B. It's clear that P(x in A & B) = J(A, B). Since min s(A) = min s(B) if and only if x is in A & B, we have just shown that P(min s(A) = min S(B)) = J(A, B).

The central idea of minhash signatures is to repeatedly perform the above experiment with different permutations as a way to estimate the underlying probability J(A, B) = P(an element x in A || B is also in A & B).

A length k minhash signature S(A) is theoretically generated by randomly choosing k permutations si (i=1, ..., k) in the symmetric group of U (group of bijective endofunctions on U) and computing hi(A) := min si(A) for each permutation. We take S(A) := [h1(A), ..., hk(A)]. Since each permutation is a bijection, min si(A) = min si(B) if and only if argmin si(A) = argmin si(B) and so we could just as well use these argmins, which is sometimes how the signature S(A) is defined.

Specifying permutations for large U is not efficient, and so we often take a family of integer-valued hash functions that are minwise independent, in the sense that for most sets A, min h(A) ! = min g(A) for two distinct hash functions in the family. Frequently this family is parametrically generated.

For more information:

MinHashing

BottomK

This package works best when provided with a strong 64-bit hash function, such as CityHash, Spooky, Murmur3, or SipHash.

MinWise

The MinWise data structure computes the MinHash for a set by creating a parametric family of hash functions. It is initialized with two hash functions and an integer size parameter. The two hash functions h1 and h2 generate the family h1 + i*h2 for i=1, ..., n, where n is the size of the signature we wish to compute.

Usage

Let's explore an example of ingesting a stream of data, sketching a signature, and computing the similarity between signatures.

package main

import (
	"log"

	"github.com/dgryski/go-farm"
	"github.com/dgryski/go-spooky"
	"github.com/shawnohare/go-minhash"
)

func main() {
	// Some pre-existing sets to compute the signatures of.
	S1 := []string{"5", "1", "2", "3", "4"}
	S2 := []int{1, 2, 3, 4, 5}
	words := []string{"idempotent", "condensation", "is", "good"}

	// Specify two hash functions to initialize all our MinWise instances
	// with.  These two hash functions generate a parametric family
	// of hash functions of cardinality = size.
	h1 := spooky.Hash64
	h2 := farm.Hash64
	size := 3

	// Init a MinWise instance for each set above.
	wmw := minhash.NewMinWise(h1, h2, size) // handle words set
	mw1 := minhash.NewMinWise(h1, h2, size) // handle S1
	mw2 := minhash.NewMinWise(h1, h2, size) // handle S2

	// Ingest the words set one element at a time.
	for _, w := range words {
		wmw.Push(w)
	}

	// Repeat the above, but with string integer data S1 and integer data S2.

	// Ingest S1.
	for _, x := range S1 {
		mw1.PushStringInt(x) // Note the different push function.
	}

	// Ingest S2.
	for _, x := range S2 {
		mw2.Push(x) // we use Push for both integers and word strings.
	}

	// NOTE In the above we used the Push and PushStringInt methods.
	// If the data is already represented as a set of []byte elements,
	// then the PushBytes function is slightly more efficient.

	// Comparing signatures.
	var s float64
	s = mw1.Similarity(mw2)
	log.Println("Similarity between signatures of S1 and S2:", s)

	// Output signatures for potential storage.
	var wordsSig []uint64
	var sig1 []uint64
	var sig2 []uint64
	wordsSig = wmw.Signature()
	sig1 = mw1.Signature()
	sig2 = mw2.Signature()
	log.Println("Signature for words set:", wordsSig)

	// Computing similarity of signatures directly.
	// Suppose we store sig1 and sig2 above and retrieve them as []int.
	// We can directly compare the similarities as follows.
	usig1 := make([]uint64, len(sig1))
	usig2 := make([]uint64, len(sig2))
	// First, convert to []uint64
	for i, v := range sig1 {
		usig1[i] = uint64(v)
		usig2[i] = uint64(sig2[i])
	}
	// Calculate similarities using the now appropriately typed signatures.
	simFromSigs := minhash.MinWiseSimilarity(usig1, usig2)
	log.Println("Similarity calcuted directly from signatures: ", simFromSigs)

	// Construct a MinWise instance from a signature.
	// If we want to continue to stream elements into the set represented by
	// sig1, we can convert it into a MinWise instance via
	m := minhash.NewMinWiseFromSignature(h1, h2, usig1)
	// We can now stream in more elements an update the signature.
	for i := 20; i <= 50; i++ {
		m.Push(i)
	}
	// Calculate new similarity between updated S1 and S2
	newSim := m.Similarity(mw2)
	log.Println("New similarity between signatures for S1 and S2:", newSim)
}

Documentation

Overview

Package minhash provides probabilistic data structures for computing MinHash signatures for streaming data from a min-wise independent parametric family of hash functions.

The reader should also conf https://github.com/dgryski/go-minhash and https://github.com/tylertreat/BoomFilters. In fact, most of the implementation details of this package are based off the former.

MinHash signatures can be used to estimate the Jaccard index J(A, B) := |A & B| / |A || B| of two sets that are subsets of some finite, totally ordered set U. If s is a permutation of U chosen uniformly at random, then x := argmin s(A || B) is just a random element chosen uniformly from A || B. It's clear that P(x in A & B) = J(A, B). Since min s(A) = min s(B) if and only if x is in A & B, we have just shown that P(min s(A) = min S(B)) = J(A, B).

The central idea of minhash signatures is to repeatedly perform the above experiment with different permutations as a way to estimate the underlying probability J(A, B) = P(an element x in A || B is also in A & B).

A length k minhash signature S(A) is theoretically generated by randomly choosing k permutations si (i=1, ..., k) in the symmetric group of U (group of bijective endofunctions on U) and computing hi(A) := min si(A) for each permutation. We take S(A) := [h1(A), ..., hk(A)]. Since each permutation is a bijection, min si(A) = min si(B) if and only if argmin si(A) = argmin si(B) and so we could just as well use these argmins, which is sometimes how the signature S(A) is defined.

Specifying permutations for large U is not efficient, and so we often take a family of integer-valued hash functions that are minwise independent, in the sense that for most sets A, min h(A) ! = min g(A) for two distinct hash functions in the family. Frequently this family is parametrically generated.

For more information,

http://research.neustar.biz/2012/07/09/sketch-of-the-day-k-minimum-values/

MinHashing:
http://infolab.stanford.edu/~ullman/mmds/ch3.pdf
https://en.wikipedia.org/wiki/MinHash

BottomK:
http://www.math.tau.ac.il/~haimk/papers/p225-cohen.pdf
http://cohenwang.org/edith/Papers/metrics394-cohen.pdf

http://www.mpi-inf.mpg.de/~rgemulla/publications/beyer07distinct.pdf

This package works best when provided with a strong 64-bit hash function, such as CityHash, Spooky, Murmur3, or SipHash.

Index

Constants

View Source
const (
	MaxUint uint = ^uint(0)
	MaxInt  int  = int(MaxUint >> 1)
)

Variables

This section is empty.

Functions

func Cardinality

func Cardinality(m Interface) int

Cardinality estimates the cardinality of the set. Both the signature for the empty set and the zero signature have cardinality 0.

func IntersectionCardinality

func IntersectionCardinality(m1, m2 Interface) int

IntersectionCardinality estimates the cardinality of the intersection.

func IsEmpty

func IsEmpty(m Interface) bool

func LessCardinality

func LessCardinality(m1, m2 Interface) int

LessCardinality estimates the cardinality of the left set minus the right set. This operator is not symmetric.

func Similarity

func Similarity(x, y Interface) float64

MinHashSimilarity computes an estimate for the Jaccard similarity of two sets given their MinHash signatures.

func SymmetricDifferenceCardinality

func SymmetricDifferenceCardinality(m1, m2 Interface) int

SymmetricDifferenceCardinality estimates the difference between the cardinality of the union and intersection.

func UnionCardinality

func UnionCardinality(m1, m2 Interface) int

UnionCardinality estimates the cardinality of the union.

Types

type HashFunc

type HashFunc func([]byte) uint64

type Interface

type Interface interface {
	// Signature returns the signature itself.
	Signature() []uint64
}

Interface is an a probabilistic data structure used to compute a similarity preserving signature for a set. It ingests a stream of the set's elements and continuously updates the signature.

type MinHash

type MinHash struct {
	// contains filtered or unexported fields
}

MinHash is a data structure for generating a min-wise independent parametric family of hash functions of the form h1 + i*h2 for i=1, ..., k in order to to compute a MinHash signature. Each instance is tied to a single streamed set and hence single signature. As the instance ingests elements it keeps track of the minimum for the ith hash function.

func New

func New(h1, h2 HashFunc, size int) *MinHash

New constructs a new MinHash instance.

func NewFromSignature

func NewFromSignature(h1, h2 HashFunc, signature []uint64) *MinHash

NewFromSignature creates a MinHash initialzed with the input signature.

func NewMinHash

func NewMinHash(h1, h2 HashFunc, size int) *MinHash

NewMinHash is deprecated. Use New instead.

func NewMinHashFromSignature

func NewMinHashFromSignature(h1, h2 HashFunc, sig []uint64) *MinHash

NewMinHashFromSignature is deprecated. Use NewFromSignature instead.

func (*MinHash) Cardinality

func (m *MinHash) Cardinality() int

Cardinality estimates the cardinality of the set. Both the signature for the empty set and the zero signature have 0 cardinality.

func (*MinHash) Copy

func (m *MinHash) Copy() *MinHash

Copy returns a new MinHash instance with the same type and data.

func (*MinHash) IntersectionCardinality

func (m *MinHash) IntersectionCardinality(m2 Interface) int

IntersectionCardinality estimates the cardinality of the intersection.

func (*MinHash) IsEmpty

func (m *MinHash) IsEmpty() bool

IsEmpty reports whether the MinHash instance represents a signature of the empty set. Note that it's possible for a signature of a non-empty set to equal the signature for the empty set in rare circumstances (e.g., when the hash family is not min-wise independent).

func (*MinHash) LessCardinality

func (m *MinHash) LessCardinality(m2 Interface) int

LessCardinality estimates the cardinality of the left set minus the right set. This operator is not symmetric.

func (*MinHash) Merge

func (m *MinHash) Merge(m2 Interface) error

Merge combines the signatures of the second set, creating the signature of their union.

func (*MinHash) Push

func (m *MinHash) Push(x interface{})

Push deals with generic data by handling byte conversion. It first hashes the input with each function in the instance's family, and then compares these values to the set of current mins, updating them as necessary.

func (*MinHash) PushBytes

func (m *MinHash) PushBytes(b []byte)

PushBytes updates the set's signature. It hashes the input with each function in the family and compares these values with the current set of mins, replacing them as necessary.

func (*MinHash) PushString

func (m *MinHash) PushString(s string)

PushString casts the input as a []byte and pushes the element.

func (*MinHash) PushStringInt

func (m *MinHash) PushStringInt(s string)

PushStringInt first converts a string into a uint64 before pushing.

func (*MinHash) PushStrings

func (m *MinHash) PushStrings(ss ...string)

func (*MinHash) SetHashes

func (m *MinHash) SetHashes(h1 HashFunc, h2 HashFunc)

SetHashes sets the two hash functions the MinHash uses to generate a parametric family of hash functions from which the set signature is derived.

func (*MinHash) SetSignature

func (m *MinHash) SetSignature(signature []uint64)

SetSignature sets the instance's underlying signature. This can change the type of the MinHash if the input signature length differs from the length of the instance's signature.

func (*MinHash) Signature

func (m *MinHash) Signature() []uint64

Signature returns the underlying signature slice.

func (*MinHash) Similarity

func (m *MinHash) Similarity(m2 Interface) float64

Similarity computes the similarity of two signatures represented as MinHash instances. This estimates the Jaccard index of the two underlying sets.

func (*MinHash) SymmetricDifferenceCardinality

func (m *MinHash) SymmetricDifferenceCardinality(m2 Interface) int

SymmetricDifferenceCardinality estimates the difference between the cardinality of the union and intersection.

func (*MinHash) UnionCardinality

func (m *MinHash) UnionCardinality(m2 Interface) int

UnionCardinality estimates the cardinality of the union.

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL