Documentation ¶
Overview ¶
simhashCJK -- simhash language-specific handling for CJK.
This package is provided to showcase how easy it is to extend the simhash's language-specific handling functionality, even when CJK handling is dramatically different than Western Unicode handling.
Example (Output) ¶
for standalone test, change package to `main` and the next func def to, func main() {
// package main package main import ( "fmt" "github.com/go-dedup/simhash" "github.com/go-dedup/simhash/simhashCJK" ) // for standalone test, change package to `main` and the next func def to, // func main() { func main() { hashes := make([]uint64, len(docs)) sh := simhashCJK.NewSimhash() for i, d := range docs { fs := sh.NewWordFeatureSet(d) // fmt.Printf("%#v\n", fs) // actual := fs.GetFeatures() // fmt.Printf("%#v\n", actual) hashes[i] = sh.GetSimhash(fs) fmt.Printf("Simhash of '%s': %x\n", d, hashes[i]) } fmt.Printf("Comparison of `%s` and `%s`: %d\n", docs[0], docs[1], simhash.Compare(hashes[0], hashes[1])) fmt.Printf("Comparison of `%s` and `%s`: %d\n", docs[0], docs[2], simhash.Compare(hashes[0], hashes[2])) fmt.Printf("Comparison of `%s` and `%s`: %d\n", docs[0], docs[3], simhash.Compare(hashes[0], hashes[3])) } var docs = [][]byte{ []byte("当山峰没有棱角的时候"), []byte("当山谷没有棱角的时候"), []byte("棱角的时候"), []byte("你妈妈喊你回家吃饭哦,回家罗回家罗"), }
Output: Simhash of '当山峰没有棱角的时候': d7185f186a2eea5a Simhash of '当山谷没有棱角的时候': d71a5f186a2eea5a Simhash of '棱角的时候': d71a5f186a2ffa52 Simhash of '你妈妈喊你回家吃饭哦,回家罗回家罗': d71bf7186a32b9f0 Comparison of `当山峰没有棱角的时候` and `当山谷没有棱角的时候`: 1 Comparison of `当山峰没有棱角的时候` and `棱角的时候`: 4 Comparison of `当山峰没有棱角的时候` and `你妈妈喊你回家吃饭哦,回家罗回家罗`: 16
Index ¶
Examples ¶
Constants ¶
This section is empty.
Variables ¶
This section is empty.
Functions ¶
This section is empty.
Types ¶
type CJKWordFeatureSet ¶
type CJKWordFeatureSet struct {
simhash.WordFeatureSet
}
CJKWordFeatureSet is a feature set in which each word is a feature, all equal weight.
func (*CJKWordFeatureSet) GetFeatures ¶
func (w *CJKWordFeatureSet) GetFeatures() []simhash.Feature
Returns a []Feature representing each word in the byte slice
type SimhashCJK ¶
type SimhashCJK struct {
simhash.SimhashBase
}
func (*SimhashCJK) NewCJKWordFeatureSet ¶
func (st *SimhashCJK) NewCJKWordFeatureSet(b []byte) *CJKWordFeatureSet
func (*SimhashCJK) NewWordFeatureSet ¶
func (st *SimhashCJK) NewWordFeatureSet(b []byte) *CJKWordFeatureSet
Click to show internal directories.
Click to hide internal directories.