Documentation ¶
Overview ¶
simhash package implements Charikar's simhash algorithm to generate a 64-bit fingerprint of a given document.
simhash fingerprints have the property that similar documents will have a similar fingerprint. Therefore, the hamming distance between two fingerprints will be small if the documents are similar
Index ¶
- func Compare(a uint64, b uint64) uint8
- func Fingerprint(v Vector) uint64
- func NewFeature(f []byte) feature
- func NewFeatureWithWeight(f []byte, weight int) feature
- func Shingle(w int, b [][]byte) [][]byte
- func Simhash(fs FeatureSet) uint64
- func SimhashBytes(b [][]byte) uint64
- type Feature
- type FeatureSet
- type UnicodeWordFeatureSet
- type Vector
- type WordFeatureSet
Constants ¶
This section is empty.
Variables ¶
This section is empty.
Functions ¶
func Compare ¶
Compare calculates the Hamming distance between two 64-bit integers
Currently, this is calculated using the Kernighan method [1]. Other methods exist which may be more efficient and are worth exploring at some point
[1] http://graphics.stanford.edu/~seander/bithacks.html#CountBitsSetKernighan
func Fingerprint ¶
Fingerprint returns a 64-bit fingerprint of the given vector. The fingerprint f of a given 64-dimension vector v is defined as follows:
f[i] = 1 if v[i] >= 0 f[i] = 0 if v[i] < 0
func NewFeature ¶
func NewFeature(f []byte) feature
Returns a new feature representing the given byte slice, using a weight of 1
func NewFeatureWithWeight ¶
Returns a new feature representing the given byte slice with the given weight
func Shingle ¶
Shingle returns the w-shingling of the given set of bytes. For example, if the given input was {"this", "is", "a", "test"}, this returns {"this is", "is a", "a test"}
func SimhashBytes ¶
Returns a 64-bit simhash of the given bytes
Types ¶
type Feature ¶
type Feature interface { // Sum returns the 64-bit sum of this feature Sum() uint64 // Weight returns the weight of this feature Weight() int }
Feature consists of a 64-bit hash and a weight
type FeatureSet ¶
type FeatureSet interface {
GetFeatures() []Feature
}
FeatureSet represents a set of features in a given document
type UnicodeWordFeatureSet ¶
type UnicodeWordFeatureSet struct {
// contains filtered or unexported fields
}
UnicodeWordFeatureSet is a feature set in which each word is a feature, all equal weight.
See: http://blog.golang.org/normalization See: https://groups.google.com/forum/#!topic/golang-nuts/YyH1f_qCZVc
func NewUnicodeWordFeatureSet ¶
func NewUnicodeWordFeatureSet(b []byte, f norm.Form) *UnicodeWordFeatureSet
func (*UnicodeWordFeatureSet) GetFeatures ¶
func (w *UnicodeWordFeatureSet) GetFeatures() []Feature
Returns a []Feature representing each word in the byte slice
type Vector ¶
type Vector [64]int
func Vectorize ¶
Vectorize generates 64 dimension vectors given a set of features. Vectors are initialized to zero. The i-th element of the vector is then incremented by weight of the i-th feature if the i-th bit of the feature is set, and decremented by the weight of the i-th feature otherwise.
func VectorizeBytes ¶
VectorizeBytes generates 64 dimension vectors given a set of [][]byte, where each []byte is a feature with even weight.
Vectors are initialized to zero. The i-th element of the vector is then incremented by weight of the i-th feature if the i-th bit of the feature is set, and decremented by the weight of the i-th feature otherwise.
type WordFeatureSet ¶
type WordFeatureSet struct {
// contains filtered or unexported fields
}
WordFeatureSet is a feature set in which each word is a feature, all equal weight.
func NewWordFeatureSet ¶
func NewWordFeatureSet(b []byte) *WordFeatureSet
func (*WordFeatureSet) GetFeatures ¶
func (w *WordFeatureSet) GetFeatures() []Feature
Returns a []Feature representing each word in the byte slice