Documentation ¶
Overview ¶
simhash package implements Charikar's simhash algorithm to generate a 64-bit fingerprint of a given document.
simhash fingerprints have the property that similar documents will have a similar fingerprint. Therefore, the hamming distance between two fingerprints will be small if the documents are similar
Example (Output) ¶
for standalone test, change package to `main` and the next func def to, func main() {
//package main package main import ( "fmt" "github.com/go-dedup/simhash" ) // for standalone test, change package to `main` and the next func def to, // func main() { func main() { hashes := make([]uint64, len(docs)) sh := simhash.NewSimhash() for i, d := range docs { hashes[i] = sh.GetSimhash(sh.NewWordFeatureSet(d)) fmt.Printf("Simhash of '%s': %x\n", d, hashes[i]) } fmt.Printf("Comparison of `%s` and `%s`: %d\n", docs[0], docs[1], simhash.Compare(hashes[0], hashes[1])) fmt.Printf("Comparison of `%s` and `%s`: %d\n", docs[0], docs[2], simhash.Compare(hashes[0], hashes[2])) fmt.Printf("Comparison of `%s` and `%s`: %d\n", docs[0], docs[3], simhash.Compare(hashes[0], hashes[3])) } var docs = [][]byte{ []byte("this is a test phrase"), []byte("this is a test phrass"), []byte("these are test phrases"), []byte("foo bar"), }
Output: Simhash of 'this is a test phrase': 8c3a5f7e9ecb3f35 Simhash of 'this is a test phrass': 8c3a5f7e9ecb3f21 Simhash of 'these are test phrases': ddfdbf7fbfaffb1d Simhash of 'foo bar': d8dbe7186bad3db3 Comparison of `this is a test phrase` and `this is a test phrass`: 2 Comparison of `this is a test phrase` and `these are test phrases`: 22 Comparison of `this is a test phrase` and `foo bar`: 29
Index ¶
- func Compare(a uint64, b uint64) uint8
- func NewFeature(f []byte) feature
- func NewFeatureWithWeight(f []byte, weight int) feature
- type Feature
- type FeatureSet
- type Simhash
- type SimhashBase
- func (st *SimhashBase) Fingerprint(v Vector) uint64
- func (st *SimhashBase) GetSimhash(fs FeatureSet) uint64
- func (st *SimhashBase) NewWordFeatureSet(b []byte) *WordFeatureSet
- func (st *SimhashBase) Shingle(w int, b [][]byte) [][]byte
- func (st *SimhashBase) SimhashBytes(b [][]byte) uint64
- func (st *SimhashBase) Vectorize(features []Feature) Vector
- func (st *SimhashBase) VectorizeBytes(features [][]byte) Vector
- type Vector
- type WordFeatureSet
Examples ¶
Constants ¶
This section is empty.
Variables ¶
This section is empty.
Functions ¶
func Compare ¶
Compare calculates the Hamming distance between two 64-bit integers
Currently, this is calculated using the Kernighan method [1]. Other methods exist which may be more efficient and are worth exploring at some point
[1] http://graphics.stanford.edu/~seander/bithacks.html#CountBitsSetKernighan
func NewFeature ¶
func NewFeature(f []byte) feature
Returns a new feature representing the given byte slice, using a weight of 1
func NewFeatureWithWeight ¶
Returns a new feature representing the given byte slice with the given weight
Types ¶
type Feature ¶
type Feature interface { // Sum returns the 64-bit sum of this feature Sum() uint64 // Weight returns the weight of this feature Weight() int }
Feature consists of a 64-bit hash and a weight
type FeatureSet ¶
type FeatureSet interface {
GetFeatures() []Feature
}
FeatureSet represents a set of features in a given document
type Simhash ¶
type Simhash interface { NewSimhash() *Simhash Vectorize(features []Feature) Vector VectorizeBytes(features [][]byte) Vector Fingerprint(v Vector) uint64 GetSimhash(fs FeatureSet) uint64 SimhashBytes(b [][]byte) uint64 NewWordFeatureSet(b []byte) *WordFeatureSet Shingle(w int, b [][]byte) [][]byte }
type SimhashBase ¶
type SimhashBase struct { }
func (*SimhashBase) Fingerprint ¶
func (st *SimhashBase) Fingerprint(v Vector) uint64
Fingerprint returns a 64-bit fingerprint of the given vector. The fingerprint f of a given 64-dimension vector v is defined as follows:
f[i] = 1 if v[i] >= 0 f[i] = 0 if v[i] < 0
func (*SimhashBase) GetSimhash ¶
func (st *SimhashBase) GetSimhash(fs FeatureSet) uint64
GetSimhash returns a 64-bit simhash of the given feature set
func (*SimhashBase) NewWordFeatureSet ¶
func (st *SimhashBase) NewWordFeatureSet(b []byte) *WordFeatureSet
func (*SimhashBase) Shingle ¶
func (st *SimhashBase) Shingle(w int, b [][]byte) [][]byte
Shingle returns the w-shingling of the given set of bytes. For example, if the given input was {"this", "is", "a", "test"}, this returns {"this is", "is a", "a test"}
func (*SimhashBase) SimhashBytes ¶
func (st *SimhashBase) SimhashBytes(b [][]byte) uint64
Returns a 64-bit simhash of the given bytes
func (*SimhashBase) Vectorize ¶
func (st *SimhashBase) Vectorize(features []Feature) Vector
Vectorize generates 64 dimension vectors given a set of features. Vectors are initialized to zero. The i-th element of the vector is then incremented by weight of the i-th feature if the i-th bit of the feature is set, and decremented by the weight of the i-th feature otherwise.
func (*SimhashBase) VectorizeBytes ¶
func (st *SimhashBase) VectorizeBytes(features [][]byte) Vector
VectorizeBytes generates 64 dimension vectors given a set of [][]byte, where each []byte is a feature with even weight.
Vectors are initialized to zero. The i-th element of the vector is then incremented by weight of the i-th feature if the i-th bit of the feature is set, and decremented by the weight of the i-th feature otherwise.
type WordFeatureSet ¶
type WordFeatureSet struct {
B []byte
}
WordFeatureSet is a feature set in which each word is a feature, all equal weight.
func (*WordFeatureSet) GetFeatures ¶
func (w *WordFeatureSet) GetFeatures() []Feature
Returns a []Feature representing each word in the byte slice
func (*WordFeatureSet) Normalize ¶
func (w *WordFeatureSet) Normalize()
Directories ¶
Path | Synopsis |
---|---|
sho -- SimHash Oracle, checks if a fingerprint is similar to existing ones.
|
sho -- SimHash Oracle, checks if a fingerprint is similar to existing ones. |
simhashCJK -- simhash language-specific handling for CJK.
|
simhashCJK -- simhash language-specific handling for CJK. |
simhashEng -- simhash language-specific handling for English.
|
simhashEng -- simhash language-specific handling for English. |
simhashUTF -- simhash language-specific handling for UTF.
|
simhashUTF -- simhash language-specific handling for UTF. |