Documentation
¶
Overview ¶
simhash package implements Charikar's simhash algorithm to generate a 64-bit fingerprint of a given document.
simhash fingerprints have the property that similar documents will have a similar fingerprint. Therefore, the hamming distance between two fingerprints will be small if the documents are similar
Example (Output) ¶
for standalone test, change package to `main` and the next func def to, func main() {
//package main package main import ( "fmt" "github.com/go-dedup/simhash" ) // for standalone test, change package to `main` and the next func def to, // func main() { func main() { hashes := make([]uint64, len(docs)) sh := simhash.NewSimhash() for i, d := range docs { hashes[i] = sh.GetSimhash(sh.NewWordFeatureSet(d)) fmt.Printf("Simhash of '%s': %x\n", d, hashes[i]) } fmt.Printf("Comparison of `%s` and `%s`: %d\n", docs[0], docs[1], simhash.Compare(hashes[0], hashes[1])) fmt.Printf("Comparison of `%s` and `%s`: %d\n", docs[0], docs[2], simhash.Compare(hashes[0], hashes[2])) fmt.Printf("Comparison of `%s` and `%s`: %d\n", docs[0], docs[3], simhash.Compare(hashes[0], hashes[3])) } var docs = [][]byte{ []byte("this is a test phrase"), []byte("this is a test phrass"), []byte("these are test phrases"), []byte("foo bar"), }
Output: Simhash of 'this is a test phrase': 8c3a5f7e9ecb3f35 Simhash of 'this is a test phrass': 8c3a5f7e9ecb3f21 Simhash of 'these are test phrases': ddfdbf7fbfaffb1d Simhash of 'foo bar': d8dbe7186bad3db3 Comparison of `this is a test phrase` and `this is a test phrass`: 2 Comparison of `this is a test phrase` and `these are test phrases`: 22 Comparison of `this is a test phrase` and `foo bar`: 29
Index ¶
- Variables
- func Compare(a uint64, b uint64) uint8
- func NewFeature(f []byte) feature
- func NewFeatureWithWeight(f []byte, weight int) feature
- type Feature
- type FeatureSet
- type Simhash
- type SimhashBase
- func (st *SimhashBase) BuildSimhash(doc string, doc2words text.Doc2Words) uint64
- func (st *SimhashBase) Fingerprint(v Vector) uint64
- func (st *SimhashBase) GetSimhash(fs FeatureSet) uint64
- func (st *SimhashBase) NewWordFeatureSet(b []byte) *WordFeatureSet
- func (st *SimhashBase) Shingle(w int, b [][]byte) [][]byte
- func (st *SimhashBase) SimhashBytes(b [][]byte) uint64
- func (st *SimhashBase) Vectorize(features []Feature) Vector
- func (st *SimhashBase) VectorizeBytes(features [][]byte) Vector
- type Vector
- type WordFeatureSet
Examples ¶
Constants ¶
This section is empty.
Variables ¶
var Doc2words = text.GetWordsFactory(text.Decorators( text.SplitCamelCase, text.ToLower, text.RemovePunctuation, text.Compact, text.Trim, ))
Functions ¶
func Compare ¶
Compare calculates the Hamming distance between two 64-bit integers
Currently, this is calculated using the Kernighan method [1]. Other methods exist which may be more efficient and are worth exploring at some point
[1] http://graphics.stanford.edu/~seander/bithacks.html#CountBitsSetKernighan
Example ¶
for standalone test, change package to `main` and the next func def to, func main() {
//package main package main import ( "fmt" "github.com/go-dedup/simhash" ) func testit() { hashes := make([]uint64, len(doc2)) sh := simhash.NewSimhash() for i, d := range doc2 { hashes[i] = sh.GetSimhash(sh.NewWordFeatureSet(d)) fmt.Printf("Simhash of '%s': %x\n", d, hashes[i]) } fmt.Printf("Comparison of `%s` and `%s`: %d\n", doc2[0], doc2[1], simhash.Compare(hashes[0], hashes[1])) fmt.Printf("Comparison of `%s` and `%s`: %d\n", doc2[0], doc2[2], simhash.Compare(hashes[0], hashes[2])) fmt.Printf("Comparison of `%s` and `%s`: %d\n", doc2[0], doc2[3], simhash.Compare(hashes[0], hashes[3])) } // for standalone test, change package to `main` and the next func def to, // func main() { func main() { doc2 = [][]byte{ []byte("Ford F-150. Lariat DO NOT BUY. Truck has been in the shop 50 days so far. It has had a vibration since day one and Ford cannot get rid of it. The have done everything possible to the underside of this truck and it is… 11,000km | Automatic"), []byte("2016 Ford Mustang 2016 Ford Mustang white with black stripes, this car is in showroom shape and it only has 14,000kms. this beast has never been in an accident nor does it have one scratch on the body. i purchased 20… 14,000km | Automatic"), []byte("2013 Ford Fiesta Sedan - 22,116 kms Body is in perfect condition. No mechanical problems. Oil change and maintenance package done in March/17. Registered inspection done in April/16. $10,000 firm (sales tax is extra). Call … 22,120km | Automatic"), []byte("2015 Ford Explorer Sport SUV, Crossover This vehicle is a real beauty and a pleasure to drive. It is in excellent condition and has been store inside since purchased in 2015. It has not been driven in winter other then to go for service.!… 18,600km | Automatic"), } testit() fmt.Println("================") doc2 = [][]byte{ []byte("2013 Ford Fiesta Sedan - 22,116 kms Body is in perfect condition. No mechanical problems. Oil change and maintenance package done in March/17. Registered inspection done in April/16. $10,000 firm (sales tax is extra). Call … 22,120km | Automatic"), []byte("2015 Ford Explorer Sport SUV, Crossover This vehicle is a real beauty and a pleasure to drive. It is in excellent condition and has been store inside since purchased in 2015. It has not been driven in winter other then to go for service.!… 18,600km | Automatic"), []byte("Ford F-150. Lariat DO NOT BUY. Truck has been in the shop 50 days so far. It has had a vibration since day one and Ford cannot get rid of it. The have done everything possible to the underside of this truck and it is… 11,000km | Automatic"), []byte("2016 Ford Mustang 2016 Ford Mustang white with black stripes, this car is in showroom shape and it only has 14,000kms. this beast has never been in an accident nor does it have one scratch on the body. i purchased 20… 14,000km | Automatic"), } testit() } var doc2 = [][]byte{}
Output: Simhash of 'Ford F-150. Lariat DO NOT BUY. Truck has been in the shop 50 days so far. It has had a vibration since day one and Ford cannot get rid of it. The have done everything possible to the underside of this truck and it is… 11,000km | Automatic': 1832c51ee6eb2e3e Simhash of '2016 Ford Mustang 2016 Ford Mustang white with black stripes, this car is in showroom shape and it only has 14,000kms. this beast has never been in an accident nor does it have one scratch on the body. i purchased 20… 14,000km | Automatic': 832df1ef4eb2e3e Simhash of '2013 Ford Fiesta Sedan - 22,116 kms Body is in perfect condition. No mechanical problems. Oil change and maintenance package done in March/17. Registered inspection done in April/16. $10,000 firm (sales tax is extra). Call … 22,120km | Automatic': 8329706e4eb2f3d Simhash of '2015 Ford Explorer Sport SUV, Crossover This vehicle is a real beauty and a pleasure to drive. It is in excellent condition and has been store inside since purchased in 2015. It has not been driven in winter other then to go for service.!… 18,600km | Automatic': 8b2df0ea6eb2f3c Comparison of `Ford F-150. Lariat DO NOT BUY. Truck has been in the shop 50 days so far. It has had a vibration since day one and Ford cannot get rid of it. The have done everything possible to the underside of this truck and it is… 11,000km | Automatic` and `2016 Ford Mustang 2016 Ford Mustang white with black stripes, this car is in showroom shape and it only has 14,000kms. this beast has never been in an accident nor does it have one scratch on the body. i purchased 20… 14,000km | Automatic`: 6 Comparison of `Ford F-150. Lariat DO NOT BUY. Truck has been in the shop 50 days so far. It has had a vibration since day one and Ford cannot get rid of it. The have done everything possible to the underside of this truck and it is… 11,000km | Automatic` and `2013 Ford Fiesta Sedan - 22,116 kms Body is in perfect condition. No mechanical problems. Oil change and maintenance package done in March/17. Registered inspection done in April/16. $10,000 firm (sales tax is extra). Call … 22,120km | Automatic`: 10 Comparison of `Ford F-150. Lariat DO NOT BUY. Truck has been in the shop 50 days so far. It has had a vibration since day one and Ford cannot get rid of it. The have done everything possible to the underside of this truck and it is… 11,000km | Automatic` and `2015 Ford Explorer Sport SUV, Crossover This vehicle is a real beauty and a pleasure to drive. It is in excellent condition and has been store inside since purchased in 2015. It has not been driven in winter other then to go for service.!… 18,600km | Automatic`: 9 ================ Simhash of '2013 Ford Fiesta Sedan - 22,116 kms Body is in perfect condition. No mechanical problems. Oil change and maintenance package done in March/17. Registered inspection done in April/16. $10,000 firm (sales tax is extra). Call … 22,120km | Automatic': 8329706e4eb2f3d Simhash of '2015 Ford Explorer Sport SUV, Crossover This vehicle is a real beauty and a pleasure to drive. It is in excellent condition and has been store inside since purchased in 2015. It has not been driven in winter other then to go for service.!… 18,600km | Automatic': 8b2df0ea6eb2f3c Simhash of 'Ford F-150. Lariat DO NOT BUY. Truck has been in the shop 50 days so far. It has had a vibration since day one and Ford cannot get rid of it. The have done everything possible to the underside of this truck and it is… 11,000km | Automatic': 1832c51ee6eb2e3e Simhash of '2016 Ford Mustang 2016 Ford Mustang white with black stripes, this car is in showroom shape and it only has 14,000kms. this beast has never been in an accident nor does it have one scratch on the body. i purchased 20… 14,000km | Automatic': 832df1ef4eb2e3e Comparison of `2013 Ford Fiesta Sedan - 22,116 kms Body is in perfect condition. No mechanical problems. Oil change and maintenance package done in March/17. Registered inspection done in April/16. $10,000 firm (sales tax is extra). Call … 22,120km | Automatic` and `2015 Ford Explorer Sport SUV, Crossover This vehicle is a real beauty and a pleasure to drive. It is in excellent condition and has been store inside since purchased in 2015. It has not been driven in winter other then to go for service.!… 18,600km | Automatic`: 7 Comparison of `2013 Ford Fiesta Sedan - 22,116 kms Body is in perfect condition. No mechanical problems. Oil change and maintenance package done in March/17. Registered inspection done in April/16. $10,000 firm (sales tax is extra). Call … 22,120km | Automatic` and `Ford F-150. Lariat DO NOT BUY. Truck has been in the shop 50 days so far. It has had a vibration since day one and Ford cannot get rid of it. The have done everything possible to the underside of this truck and it is… 11,000km | Automatic`: 10 Comparison of `2013 Ford Fiesta Sedan - 22,116 kms Body is in perfect condition. No mechanical problems. Oil change and maintenance package done in March/17. Registered inspection done in April/16. $10,000 firm (sales tax is extra). Call … 22,120km | Automatic` and `2016 Ford Mustang 2016 Ford Mustang white with black stripes, this car is in showroom shape and it only has 14,000kms. this beast has never been in an accident nor does it have one scratch on the body. i purchased 20… 14,000km | Automatic`: 8
func NewFeature ¶
func NewFeature(f []byte) feature
Returns a new feature representing the given byte slice, using a weight of 1
func NewFeatureWithWeight ¶
Returns a new feature representing the given byte slice with the given weight
Types ¶
type Feature ¶
type Feature interface { // Sum returns the 64-bit sum of this feature Sum() uint64 // Weight returns the weight of this feature Weight() int }
Feature consists of a 64-bit hash and a weight
func BuildFeatures ¶
BuildFeatures returns a []Feature representing each word in the byte slice
Example ¶
for _, d := range testDoc { fmt.Printf("%#v\n", BuildFeatures(string(d), Doc2words)) }
Output: []simhash.Feature{simhash.feature{sum:0x3787c7ee2ed5d4e, weight:1}, simhash.feature{sum:0xaf63bd4c8601b7b9, weight:1}, simhash.feature{sum:0xd98001186c3a6c5d, weight:1}, simhash.feature{sum:0x7a37c1ae2e57fa88, weight:1}, simhash.feature{sum:0x8326407b4eb32ae, weight:1}, simhash.feature{sum:0xd8b0a7186b8a3730, weight:1}, simhash.feature{sum:0xd8d9b1186bad4d2f, weight:1}, simhash.feature{sum:0x2c5b792934c8464e, weight:1}, simhash.feature{sum:0xd8c4c1186b9b0c0f, weight:1}, simhash.feature{sum:0x26feff7ef74c67b7, weight:1}, simhash.feature{sum:0x8325f07b4eb2a2c, weight:1}, simhash.feature{sum:0xd89cc2186b79bc7e, weight:1}, simhash.feature{sum:0x93104c7ea350e1e1, weight:1}, simhash.feature{sum:0x8329307b4eb82ae, weight:1}, simhash.feature{sum:0x14dfbd7eecce8288, weight:1}, simhash.feature{sum:0x8325507b4eb192b, weight:1}, simhash.feature{sum:0xd8cbcd186ba13ffc, weight:1}, simhash.feature{sum:0x8325f07b4eb2a36, weight:1}, simhash.feature{sum:0xd8c4c1186b9b0c0f, weight:1}, simhash.feature{sum:0xd8c4c1186b9b0c18, weight:1}, simhash.feature{sum:0xaf63bd4c8601b7be, weight:1}, simhash.feature{sum:0x214486cdc2d73f89, weight:1}, simhash.feature{sum:0x3d52262f868f65ad, weight:1}, simhash.feature{sum:0xd8d299186ba70599, weight:1}, simhash.feature{sum:0xd8adc6186b88367f, weight:1}, simhash.feature{sum:0xd8dcc6186bafa6b8, weight:1}, simhash.feature{sum:0x3787c7ee2ed5d4e, weight:1}, simhash.feature{sum:0x58bc5a1361284f0c, weight:1}, simhash.feature{sum:0xd8c8ad186b9ed323, weight:1}, simhash.feature{sum:0xd8a2cd186b7e3a1e, weight:1}, simhash.feature{sum:0x8325907b4eb2076, weight:1}, simhash.feature{sum:0x8325f07b4eb2a36, weight:1}, simhash.feature{sum:0xd89cc2186b79bc7e, weight:1}, simhash.feature{sum:0xf160267ed875749b, weight:1}, simhash.feature{sum:0x150fb27eecf79469, weight:1}, simhash.feature{sum:0x8a8c7bb9849d48f6, weight:1}, simhash.feature{sum:0x34e6e73324cc4c1c, weight:1}, simhash.feature{sum:0x8325407b4eb17fe, weight:1}, simhash.feature{sum:0xd89cc2186b79bc7e, weight:1}, simhash.feature{sum:0xbc78285d51f8f350, weight:1}, simhash.feature{sum:0x8325907b4eb2076, weight:1}, simhash.feature{sum:0x8c1a417e9fdb35c5, weight:1}, simhash.feature{sum:0x2c5b792934c8464e, weight:1}, simhash.feature{sum:0xd8dcc6186bafa6b8, weight:1}, simhash.feature{sum:0x8325f07b4eb2a36, weight:1}, simhash.feature{sum:0x8325f07b4eb2a31, weight:1}, simhash.feature{sum:0x1c7c2e0d9eb9677d, weight:1}, simhash.feature{sum:0x3035a365e168961e, weight:1}} []simhash.Feature{simhash.feature{sum:0xc5c6ff7fe1f34c8a, weight:1}, simhash.feature{sum:0x3787c7ee2ed5d4e, weight:1}, simhash.feature{sum:0x3075dfaf5552d79e, weight:1}, simhash.feature{sum:0xc5c6ff7fe1f34c8a, weight:1}, simhash.feature{sum:0x3787c7ee2ed5d4e, weight:1}, simhash.feature{sum:0x3075dfaf5552d79e, weight:1}, simhash.feature{sum:0x192cc0ca1d77458, weight:1}, simhash.feature{sum:0x6f5db37e8ecc76fd, weight:1}, simhash.feature{sum:0xdbfd3fbe6190d762, weight:1}, simhash.feature{sum:0x54c9ed4b266da2a5, weight:1}, simhash.feature{sum:0x8c1a417e9fdb35c5, weight:1}, simhash.feature{sum:0xd8d5c1186ba97fdd, weight:1}, simhash.feature{sum:0x8325f07b4eb2a31, weight:1}, simhash.feature{sum:0x8325f07b4eb2a2c, weight:1}, simhash.feature{sum:0xda392f7af918887b, weight:1}, simhash.feature{sum:0x357ef82f825da4b8, weight:1}, simhash.feature{sum:0xd8dcc6186bafa6b8, weight:1}, simhash.feature{sum:0x8325f07b4eb2a36, weight:1}, simhash.feature{sum:0xb77e117eb8748afb, weight:1}, simhash.feature{sum:0xd8c4c1186b9b0c0f, weight:1}, simhash.feature{sum:0x22b4b6630fb27c45, weight:1}, simhash.feature{sum:0x8c1a417e9fdb35c5, weight:1}, simhash.feature{sum:0x8fc1c6be36e055d6, weight:1}, simhash.feature{sum:0xd8c4c1186b9b0c0f, weight:1}, simhash.feature{sum:0x2b94c0591a2848b9, weight:1}, simhash.feature{sum:0x26feff7ef74c67b7, weight:1}, simhash.feature{sum:0x8325f07b4eb2a2c, weight:1}, simhash.feature{sum:0x8326707b4eb37b4, weight:1}, simhash.feature{sum:0x91a1dacc76ac782e, weight:1}, simhash.feature{sum:0xd8b0a7186b8a3736, weight:1}, simhash.feature{sum:0x150fbd7eecf7a6ce, weight:1}, simhash.feature{sum:0x8325f07b4eb2a36, weight:1}, simhash.feature{sum:0xf160267ed875749b, weight:1}, simhash.feature{sum:0xd8adc6186b88367f, weight:1}, simhash.feature{sum:0xc4e8fa88937cb69, weight:1}, simhash.feature{sum:0x8325907b4eb207e, weight:1}, simhash.feature{sum:0xd89cc2186b79bc7e, weight:1}, simhash.feature{sum:0x27132a7ef75d598d, weight:1}, simhash.feature{sum:0xaf63bd4c8601b7b6, weight:1}, simhash.feature{sum:0x873cad20b5b03ae4, weight:1}, simhash.feature{sum:0x8329607b4eb8787, weight:1}, simhash.feature{sum:0x2e047881b2f11bf2, weight:1}, simhash.feature{sum:0x3035a365e168961e, weight:1}} []simhash.Feature{simhash.feature{sum:0xc5c6ff7fe1f34c8f, weight:1}, simhash.feature{sum:0x3787c7ee2ed5d4e, weight:1}, simhash.feature{sum:0xfd0c9853db565f2f, weight:1}, simhash.feature{sum:0xd7f4302f4de077d2, weight:1}, simhash.feature{sum:0xf2b15d4ce63f5477, weight:1}, simhash.feature{sum:0xd8bac5186b92beb0, weight:1}, simhash.feature{sum:0x27132a7ef75d598d, weight:1}, simhash.feature{sum:0x8325f07b4eb2a31, weight:1}, simhash.feature{sum:0x8325f07b4eb2a2c, weight:1}, simhash.feature{sum:0x12d6ee02dfea32b8, weight:1}, simhash.feature{sum:0x6976a39422c2abd8, weight:1}, simhash.feature{sum:0x8325a07b4eb21ac, weight:1}, simhash.feature{sum:0xde2e60d07d4ebdb0, weight:1}, simhash.feature{sum:0xfff236c5f092af95, weight:1}, simhash.feature{sum:0xd8adc1186b882df7, weight:1}, simhash.feature{sum:0x656c734f40ac6679, weight:1}, simhash.feature{sum:0xd8dcc6186bafa6b8, weight:1}, simhash.feature{sum:0x5d60a51e6eb33462, weight:1}, simhash.feature{sum:0xe98a0708a4b03ab7, weight:1}, simhash.feature{sum:0x150fb27eecf79469, weight:1}, simhash.feature{sum:0x8325f07b4eb2a2c, weight:1}, simhash.feature{sum:0x8fde7a602c8faa3a, weight:1}, simhash.feature{sum:0x8329707b4eb895d, weight:1}, simhash.feature{sum:0xb8019c1cc35ecab1, weight:1}, simhash.feature{sum:0x37cbb6f821eaff03, weight:1}, simhash.feature{sum:0x150fb27eecf79469, weight:1}, simhash.feature{sum:0x8325f07b4eb2a2c, weight:1}, simhash.feature{sum:0xa4fb0fc51c2e551b, weight:1}, simhash.feature{sum:0x8329707b4eb895c, weight:1}, simhash.feature{sum:0x246c8b28007c1970, weight:1}, simhash.feature{sum:0x3716c7ee2e72321, weight:1}, simhash.feature{sum:0xf8b0c02f5fe0b257, weight:1}, simhash.feature{sum:0xd89cb9186b79aca8, weight:1}, simhash.feature{sum:0x8325f07b4eb2a31, weight:1}, simhash.feature{sum:0x4be610a6aef6c731, weight:1}, simhash.feature{sum:0x1cb6df7ef1041835, weight:1}, simhash.feature{sum:0xfcdeddf9a175b394, weight:1}, simhash.feature{sum:0x3035a365e168961e, weight:1}} []simhash.Feature{simhash.feature{sum:0xc5c6ff7fe1f34c89, weight:1}, simhash.feature{sum:0x3787c7ee2ed5d4e, weight:1}, simhash.feature{sum:0x1cbe3da5da62b610, weight:1}, simhash.feature{sum:0x7aa9362fa9816155, weight:1}, simhash.feature{sum:0xd89fad186b7bce35, weight:1}, simhash.feature{sum:0x12b142e963d1682d, weight:1}, simhash.feature{sum:0x8c1a417e9fdb35c5, weight:1}, simhash.feature{sum:0x38b39e054e1c1b67, weight:1}, simhash.feature{sum:0x8325f07b4eb2a31, weight:1}, simhash.feature{sum:0xaf63bd4c8601b7be, weight:1}, simhash.feature{sum:0x9ab4937ea75b5c59, weight:1}, simhash.feature{sum:0x268587373f1f77b5, weight:1}, simhash.feature{sum:0xd8dcc6186bafa6b8, weight:1}, simhash.feature{sum:0xaf63bd4c8601b7be, weight:1}, simhash.feature{sum:0x41ef33d8e01cb16c, weight:1}, simhash.feature{sum:0x8325407b4eb17fe, weight:1}, simhash.feature{sum:0xec8584acc12fcf27, weight:1}, simhash.feature{sum:0x8325f07b4eb2a36, weight:1}, simhash.feature{sum:0x8325f07b4eb2a31, weight:1}, simhash.feature{sum:0x8325f07b4eb2a2c, weight:1}, simhash.feature{sum:0x7306888cb4e8ab75, weight:1}, simhash.feature{sum:0x6976a39422c2abd8, weight:1}, simhash.feature{sum:0xd8dcc6186bafa6b8, weight:1}, simhash.feature{sum:0xd8c4c1186b9b0c0f, weight:1}, simhash.feature{sum:0x26feff7ef74c67b7, weight:1}, simhash.feature{sum:0x5a079a2f9797da68, weight:1}, simhash.feature{sum:0x9e8e79746e7ee735, weight:1}, simhash.feature{sum:0x3d52262f868f65ad, weight:1}, simhash.feature{sum:0x873cad20b5b03ae4, weight:1}, simhash.feature{sum:0x8325f07b4eb2a2c, weight:1}, simhash.feature{sum:0xc5c6ff7fe1f34c89, weight:1}, simhash.feature{sum:0x8325f07b4eb2a36, weight:1}, simhash.feature{sum:0xd8c4c1186b9b0c0f, weight:1}, simhash.feature{sum:0xd8b0a7186b8a3730, weight:1}, simhash.feature{sum:0x26feff7ef74c67b7, weight:1}, simhash.feature{sum:0x16af988c443cff2b, weight:1}, simhash.feature{sum:0x8325f07b4eb2a2c, weight:1}, simhash.feature{sum:0xd88a107ce8dad5a6, weight:1}, simhash.feature{sum:0x5bf18352ec4156d9, weight:1}, simhash.feature{sum:0x8c1a4d7e9fdb4a74, weight:1}, simhash.feature{sum:0x8325407b4eb17fe, weight:1}, simhash.feature{sum:0x8326107b4eb2dc7, weight:1}, simhash.feature{sum:0xd8cbc7186ba1352e, weight:1}, simhash.feature{sum:0xf90ceea98fba79f6, weight:1}, simhash.feature{sum:0xe8860067f74f9fbc, weight:1}, simhash.feature{sum:0x3035a365e168961e, weight:1}} []simhash.Feature{simhash.feature{sum:0xc5c6ff7fe1f34c8f, weight:1}, simhash.feature{sum:0x3787c7ee2ed5d4e, weight:1}, simhash.feature{sum:0xfd0c9853db565f2f, weight:1}, simhash.feature{sum:0xd7f4302f4de077d2, weight:1}, simhash.feature{sum:0xf2b15d4ce63f5477, weight:1}, simhash.feature{sum:0xd8bac5186b92beb0, weight:1}, simhash.feature{sum:0x27132a7ef75d598d, weight:1}, simhash.feature{sum:0x8325f07b4eb2a31, weight:1}, simhash.feature{sum:0x8325f07b4eb2a2c, weight:1}, simhash.feature{sum:0x12d6ee02dfea32b8, weight:1}, simhash.feature{sum:0x6976a39422c2abd8, weight:1}, simhash.feature{sum:0x8325a07b4eb21ac, weight:1}, simhash.feature{sum:0xde2e60d07d4ebdb0, weight:1}, simhash.feature{sum:0xfff236c5f092af95, weight:1}, simhash.feature{sum:0xd8adc1186b882df7, weight:1}, simhash.feature{sum:0x656c734f40ac6679, weight:1}, simhash.feature{sum:0xd8dcc6186bafa6b8, weight:1}, simhash.feature{sum:0x5d60a51e6eb33462, weight:1}, simhash.feature{sum:0xe98a0708a4b03ab7, weight:1}, simhash.feature{sum:0x150fb27eecf79469, weight:1}, simhash.feature{sum:0x8325f07b4eb2a2c, weight:1}, simhash.feature{sum:0x8fde7a602c8faa3a, weight:1}, simhash.feature{sum:0x8329707b4eb895d, weight:1}, simhash.feature{sum:0xb8019c1cc35ecab1, weight:1}, simhash.feature{sum:0x37cbb6f821eaff03, weight:1}, simhash.feature{sum:0x150fb27eecf79469, weight:1}, simhash.feature{sum:0x8325f07b4eb2a2c, weight:1}, simhash.feature{sum:0xa4fb0fc51c2e551b, weight:1}, simhash.feature{sum:0x8329707b4eb895c, weight:1}, simhash.feature{sum:0x246c8b28007c1970, weight:1}, simhash.feature{sum:0x3716c7ee2e72321, weight:1}, simhash.feature{sum:0xf8b0c02f5fe0b257, weight:1}, simhash.feature{sum:0xd89cb9186b79aca8, weight:1}, simhash.feature{sum:0x8325f07b4eb2a31, weight:1}, simhash.feature{sum:0x4be610a6aef6c731, weight:1}, simhash.feature{sum:0x1cb6df7ef1041835, weight:1}, simhash.feature{sum:0xfcdeddf9a175b394, weight:1}, simhash.feature{sum:0x3035a365e168961e, weight:1}} []simhash.Feature{simhash.feature{sum:0xc5c6ff7fe1f34c89, weight:1}, simhash.feature{sum:0x3787c7ee2ed5d4e, weight:1}, simhash.feature{sum:0x1cbe3da5da62b610, weight:1}, simhash.feature{sum:0x7aa9362fa9816155, weight:1}, simhash.feature{sum:0xd89fad186b7bce35, weight:1}, simhash.feature{sum:0x12b142e963d1682d, weight:1}, simhash.feature{sum:0x8c1a417e9fdb35c5, weight:1}, simhash.feature{sum:0x38b39e054e1c1b67, weight:1}, simhash.feature{sum:0x8325f07b4eb2a31, weight:1}, simhash.feature{sum:0xaf63bd4c8601b7be, weight:1}, simhash.feature{sum:0x9ab4937ea75b5c59, weight:1}, simhash.feature{sum:0x268587373f1f77b5, weight:1}, simhash.feature{sum:0xd8dcc6186bafa6b8, weight:1}, simhash.feature{sum:0xaf63bd4c8601b7be, weight:1}, simhash.feature{sum:0x41ef33d8e01cb16c, weight:1}, simhash.feature{sum:0x8325407b4eb17fe, weight:1}, simhash.feature{sum:0xec8584acc12fcf27, weight:1}, simhash.feature{sum:0x8325f07b4eb2a36, weight:1}, simhash.feature{sum:0x8325f07b4eb2a31, weight:1}, simhash.feature{sum:0x8325f07b4eb2a2c, weight:1}, simhash.feature{sum:0x7306888cb4e8ab75, weight:1}, simhash.feature{sum:0x6976a39422c2abd8, weight:1}, simhash.feature{sum:0xd8dcc6186bafa6b8, weight:1}, simhash.feature{sum:0xd8c4c1186b9b0c0f, weight:1}, simhash.feature{sum:0x26feff7ef74c67b7, weight:1}, simhash.feature{sum:0x5a079a2f9797da68, weight:1}, simhash.feature{sum:0x9e8e79746e7ee735, weight:1}, simhash.feature{sum:0x3d52262f868f65ad, weight:1}, simhash.feature{sum:0x873cad20b5b03ae4, weight:1}, simhash.feature{sum:0x8325f07b4eb2a2c, weight:1}, simhash.feature{sum:0xc5c6ff7fe1f34c89, weight:1}, simhash.feature{sum:0x8325f07b4eb2a36, weight:1}, simhash.feature{sum:0xd8c4c1186b9b0c0f, weight:1}, simhash.feature{sum:0xd8b0a7186b8a3730, weight:1}, simhash.feature{sum:0x26feff7ef74c67b7, weight:1}, simhash.feature{sum:0x16af988c443cff2b, weight:1}, simhash.feature{sum:0x8325f07b4eb2a2c, weight:1}, simhash.feature{sum:0xd88a107ce8dad5a6, weight:1}, simhash.feature{sum:0x5bf18352ec4156d9, weight:1}, simhash.feature{sum:0x8c1a4d7e9fdb4a74, weight:1}, simhash.feature{sum:0x8325407b4eb17fe, weight:1}, simhash.feature{sum:0x8326107b4eb2dc7, weight:1}, simhash.feature{sum:0xd8cbc7186ba1352e, weight:1}, simhash.feature{sum:0xf90ceea98fba79f6, weight:1}, simhash.feature{sum:0xe8860067f74f9fbc, weight:1}, simhash.feature{sum:0x3035a365e168961e, weight:1}} []simhash.Feature{simhash.feature{sum:0x3787c7ee2ed5d4e, weight:1}, simhash.feature{sum:0xaf63bd4c8601b7b9, weight:1}, simhash.feature{sum:0xd98001186c3a6c5d, weight:1}, simhash.feature{sum:0x7a37c1ae2e57fa88, weight:1}, simhash.feature{sum:0x8326407b4eb32ae, weight:1}, simhash.feature{sum:0xd8b0a7186b8a3730, weight:1}, simhash.feature{sum:0xd8d9b1186bad4d2f, weight:1}, simhash.feature{sum:0x2c5b792934c8464e, weight:1}, simhash.feature{sum:0xd8c4c1186b9b0c0f, weight:1}, simhash.feature{sum:0x26feff7ef74c67b7, weight:1}, simhash.feature{sum:0x8325f07b4eb2a2c, weight:1}, simhash.feature{sum:0xd89cc2186b79bc7e, weight:1}, simhash.feature{sum:0x93104c7ea350e1e1, weight:1}, simhash.feature{sum:0x8329307b4eb82ae, weight:1}, simhash.feature{sum:0x14dfbd7eecce8288, weight:1}, simhash.feature{sum:0x8325507b4eb192b, weight:1}, simhash.feature{sum:0xd8cbcd186ba13ffc, weight:1}, simhash.feature{sum:0x8325f07b4eb2a36, weight:1}, simhash.feature{sum:0xd8c4c1186b9b0c0f, weight:1}, simhash.feature{sum:0xd8c4c1186b9b0c18, weight:1}, simhash.feature{sum:0xaf63bd4c8601b7be, weight:1}, simhash.feature{sum:0x214486cdc2d73f89, weight:1}, simhash.feature{sum:0x3d52262f868f65ad, weight:1}, simhash.feature{sum:0xd8d299186ba70599, weight:1}, simhash.feature{sum:0xd8adc6186b88367f, weight:1}, simhash.feature{sum:0xd8dcc6186bafa6b8, weight:1}, simhash.feature{sum:0x3787c7ee2ed5d4e, weight:1}, simhash.feature{sum:0x58bc5a1361284f0c, weight:1}, simhash.feature{sum:0xd8c8ad186b9ed323, weight:1}, simhash.feature{sum:0xd8a2cd186b7e3a1e, weight:1}, simhash.feature{sum:0x8325907b4eb2076, weight:1}, simhash.feature{sum:0x8325f07b4eb2a36, weight:1}, simhash.feature{sum:0xd89cc2186b79bc7e, weight:1}, simhash.feature{sum:0xf160267ed875749b, weight:1}, simhash.feature{sum:0x150fb27eecf79469, weight:1}, simhash.feature{sum:0x8a8c7bb9849d48f6, weight:1}, simhash.feature{sum:0x34e6e73324cc4c1c, weight:1}, simhash.feature{sum:0x8325407b4eb17fe, weight:1}, simhash.feature{sum:0xd89cc2186b79bc7e, weight:1}, simhash.feature{sum:0xbc78285d51f8f350, weight:1}, simhash.feature{sum:0x8325907b4eb2076, weight:1}, simhash.feature{sum:0x8c1a417e9fdb35c5, weight:1}, simhash.feature{sum:0x2c5b792934c8464e, weight:1}, simhash.feature{sum:0xd8dcc6186bafa6b8, weight:1}, simhash.feature{sum:0x8325f07b4eb2a36, weight:1}, simhash.feature{sum:0x8325f07b4eb2a31, weight:1}, simhash.feature{sum:0x1c7c2e0d9eb9677d, weight:1}, simhash.feature{sum:0x3035a365e168961e, weight:1}} []simhash.Feature{simhash.feature{sum:0xc5c6ff7fe1f34c8a, weight:1}, simhash.feature{sum:0x3787c7ee2ed5d4e, weight:1}, simhash.feature{sum:0x3075dfaf5552d79e, weight:1}, simhash.feature{sum:0xc5c6ff7fe1f34c8a, weight:1}, simhash.feature{sum:0x3787c7ee2ed5d4e, weight:1}, simhash.feature{sum:0x3075dfaf5552d79e, weight:1}, simhash.feature{sum:0x192cc0ca1d77458, weight:1}, simhash.feature{sum:0x6f5db37e8ecc76fd, weight:1}, simhash.feature{sum:0xdbfd3fbe6190d762, weight:1}, simhash.feature{sum:0x54c9ed4b266da2a5, weight:1}, simhash.feature{sum:0x8c1a417e9fdb35c5, weight:1}, simhash.feature{sum:0xd8d5c1186ba97fdd, weight:1}, simhash.feature{sum:0x8325f07b4eb2a31, weight:1}, simhash.feature{sum:0x8325f07b4eb2a2c, weight:1}, simhash.feature{sum:0xda392f7af918887b, weight:1}, simhash.feature{sum:0x357ef82f825da4b8, weight:1}, simhash.feature{sum:0xd8dcc6186bafa6b8, weight:1}, simhash.feature{sum:0x8325f07b4eb2a36, weight:1}, simhash.feature{sum:0xb77e117eb8748afb, weight:1}, simhash.feature{sum:0xd8c4c1186b9b0c0f, weight:1}, simhash.feature{sum:0x22b4b6630fb27c45, weight:1}, simhash.feature{sum:0x8c1a417e9fdb35c5, weight:1}, simhash.feature{sum:0x8fc1c6be36e055d6, weight:1}, simhash.feature{sum:0xd8c4c1186b9b0c0f, weight:1}, simhash.feature{sum:0x2b94c0591a2848b9, weight:1}, simhash.feature{sum:0x26feff7ef74c67b7, weight:1}, simhash.feature{sum:0x8325f07b4eb2a2c, weight:1}, simhash.feature{sum:0x8326707b4eb37b4, weight:1}, simhash.feature{sum:0x91a1dacc76ac782e, weight:1}, simhash.feature{sum:0xd8b0a7186b8a3736, weight:1}, simhash.feature{sum:0x150fbd7eecf7a6ce, weight:1}, simhash.feature{sum:0x8325f07b4eb2a36, weight:1}, simhash.feature{sum:0xf160267ed875749b, weight:1}, simhash.feature{sum:0xd8adc6186b88367f, weight:1}, simhash.feature{sum:0xc4e8fa88937cb69, weight:1}, simhash.feature{sum:0x8325907b4eb207e, weight:1}, simhash.feature{sum:0xd89cc2186b79bc7e, weight:1}, simhash.feature{sum:0x27132a7ef75d598d, weight:1}, simhash.feature{sum:0xaf63bd4c8601b7b6, weight:1}, simhash.feature{sum:0x873cad20b5b03ae4, weight:1}, simhash.feature{sum:0x8329607b4eb8787, weight:1}, simhash.feature{sum:0x2e047881b2f11bf2, weight:1}, simhash.feature{sum:0x3035a365e168961e, weight:1}}
type FeatureSet ¶
type FeatureSet interface {
GetFeatures() []Feature
}
FeatureSet represents a set of features in a given document
type Simhash ¶
type Simhash interface { NewSimhash() *Simhash Vectorize(features []Feature) Vector VectorizeBytes(features [][]byte) Vector Fingerprint(v Vector) uint64 BuildSimhash(doc string, doc2words text.Doc2Words) uint64 GetSimhash(fs FeatureSet) uint64 SimhashBytes(b [][]byte) uint64 NewWordFeatureSet(b []byte) *WordFeatureSet Shingle(w int, b [][]byte) [][]byte }
type SimhashBase ¶
type SimhashBase struct { }
func (*SimhashBase) BuildSimhash ¶
func (st *SimhashBase) BuildSimhash(doc string, doc2words text.Doc2Words) uint64
BuildSimhash returns a 64-bit simhash of the given string
func (*SimhashBase) Fingerprint ¶
func (st *SimhashBase) Fingerprint(v Vector) uint64
Fingerprint returns a 64-bit fingerprint of the given vector. The fingerprint f of a given 64-dimension vector v is defined as follows:
f[i] = 1 if v[i] >= 0 f[i] = 0 if v[i] < 0
func (*SimhashBase) GetSimhash ¶
func (st *SimhashBase) GetSimhash(fs FeatureSet) uint64
GetSimhash returns a 64-bit simhash of the given feature set
func (*SimhashBase) NewWordFeatureSet ¶
func (st *SimhashBase) NewWordFeatureSet(b []byte) *WordFeatureSet
func (*SimhashBase) Shingle ¶
func (st *SimhashBase) Shingle(w int, b [][]byte) [][]byte
Shingle returns the w-shingling of the given set of bytes. For example, if the given input was {"this", "is", "a", "test"}, this returns {"this is", "is a", "a test"}
func (*SimhashBase) SimhashBytes ¶
func (st *SimhashBase) SimhashBytes(b [][]byte) uint64
Returns a 64-bit simhash of the given bytes
func (*SimhashBase) Vectorize ¶
func (st *SimhashBase) Vectorize(features []Feature) Vector
Vectorize generates 64 dimension vectors given a set of features. Vectors are initialized to zero. The i-th element of the vector is then incremented by weight of the i-th feature if the i-th bit of the feature is set, and decremented by the weight of the i-th feature otherwise.
func (*SimhashBase) VectorizeBytes ¶
func (st *SimhashBase) VectorizeBytes(features [][]byte) Vector
VectorizeBytes generates 64 dimension vectors given a set of [][]byte, where each []byte is a feature with even weight.
Vectors are initialized to zero. The i-th element of the vector is then incremented by weight of the i-th feature if the i-th bit of the feature is set, and decremented by the weight of the i-th feature otherwise.
type WordFeatureSet ¶
type WordFeatureSet struct {
B []byte
}
WordFeatureSet is a feature set in which each word is a feature, all equal weight.
func (*WordFeatureSet) GetFeatures ¶
func (w *WordFeatureSet) GetFeatures() []Feature
Returns a []Feature representing each word in the byte slice
func (*WordFeatureSet) Normalize ¶
func (w *WordFeatureSet) Normalize()
Directories
¶
Path | Synopsis |
---|---|
sho -- SimHash Oracle, checks if a fingerprint is similar to existing ones.
|
sho -- SimHash Oracle, checks if a fingerprint is similar to existing ones. |
simhashCJK -- simhash language-specific handling for CJK.
|
simhashCJK -- simhash language-specific handling for CJK. |
simhashEng -- simhash language-specific handling for English.
|
simhashEng -- simhash language-specific handling for English. |
simhashUTF -- simhash language-specific handling for UTF.
|
simhashUTF -- simhash language-specific handling for UTF. |