simhashCJK

package
v2.0.0-...-581a106 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Jul 3, 2017 License: MIT Imports: 1 Imported by: 0

Documentation

Overview

simhashCJK -- simhash language-specific handling for CJK.

This package is provided to showcase how easy it is to extend the simhash's language-specific handling functionality, even when CJK handling is dramatically different than Western Unicode handling.

Example (Output)

for standalone test, change package to `main` and the next func def to, func main() {

// package main

package main

import (
	"fmt"

	"github.com/go-dedup/simhash"
	"github.com/go-dedup/simhash/simhashCJK"
)

// for standalone test, change package to `main` and the next func def to,
// func main() {
func main() {
	hashes := make([]uint64, len(docs))
	sh := simhashCJK.NewSimhash()
	for i, d := range docs {
		fs := sh.NewWordFeatureSet(d)
		// fmt.Printf("%#v\n", fs)
		// actual := fs.GetFeatures()
		// fmt.Printf("%#v\n", actual)
		hashes[i] = sh.GetSimhash(fs)
		fmt.Printf("Simhash of '%s': %x\n", d, hashes[i])
	}

	fmt.Printf("Comparison of `%s` and `%s`: %d\n", docs[0], docs[1], simhash.Compare(hashes[0], hashes[1]))
	fmt.Printf("Comparison of `%s` and `%s`: %d\n", docs[0], docs[2], simhash.Compare(hashes[0], hashes[2]))
	fmt.Printf("Comparison of `%s` and `%s`: %d\n", docs[0], docs[3], simhash.Compare(hashes[0], hashes[3]))

}

var docs = [][]byte{
	[]byte("当山峰没有棱角的时候"),
	[]byte("当山谷没有棱角的时候"),
	[]byte("棱角的时候"),
	[]byte("你妈妈喊你回家吃饭哦,回家罗回家罗"),
}
Output:

Simhash of '当山峰没有棱角的时候': d7185f186a2eea5a
Simhash of '当山谷没有棱角的时候': d71a5f186a2eea5a
Simhash of '棱角的时候': d71a5f186a2ffa52
Simhash of '你妈妈喊你回家吃饭哦,回家罗回家罗': d71bf7186a32b9f0
Comparison of `当山峰没有棱角的时候` and `当山谷没有棱角的时候`: 1
Comparison of `当山峰没有棱角的时候` and `棱角的时候`: 4
Comparison of `当山峰没有棱角的时候` and `你妈妈喊你回家吃饭哦,回家罗回家罗`: 16

Index

Examples

Constants

This section is empty.

Variables

This section is empty.

Functions

This section is empty.

Types

type CJKWordFeatureSet

type CJKWordFeatureSet struct {
	simhash.WordFeatureSet
}

CJKWordFeatureSet is a feature set in which each word is a feature, all equal weight.

func (*CJKWordFeatureSet) GetFeatures

func (w *CJKWordFeatureSet) GetFeatures() []simhash.Feature

Returns a []Feature representing each word in the byte slice

type SimhashCJK

type SimhashCJK struct {
	simhash.SimhashBase
}

func NewSimhash

func NewSimhash() *SimhashCJK

NewSimhash makes a new Simhash

func (*SimhashCJK) NewCJKWordFeatureSet

func (st *SimhashCJK) NewCJKWordFeatureSet(b []byte) *CJKWordFeatureSet

func (*SimhashCJK) NewWordFeatureSet

func (st *SimhashCJK) NewWordFeatureSet(b []byte) *CJKWordFeatureSet

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL