simhash

package module
v0.0.0-...-79f94a1 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Oct 7, 2015 License: MIT Imports: 4 Imported by: 28

README

simhash

simhash is a Go implementation of Charikar's simhash algorithm.

simhash is a hash with the useful property that similar documents produce similar hashes. Therefore, if two documents are similar, the Hamming-distance between the simhash of the documents will be small.

This package currently just implements the simhash algorithm. Future work will make use of this package to enable quickly identifying near-duplicate documents within a large collection of documents.

Installation

go get github.com/mfonda/simhash

Usage

Using simhash first requires tokenizing a document into a set of features (done through the FeatureSet interface). This package provides an implementation, WordFeatureSet, which breaks tokenizes the document into individual words. Better results are possible here, and future work will go towards this.

Example usage:

package main

import (
	"fmt"
	"github.com/mfonda/simhash"
)

func main() {
	var docs = [][]byte{
		[]byte("this is a test phrase"),
		[]byte("this is a test phrass"),
		[]byte("foo bar"),
	}

	hashes := make([]uint64, len(docs))
	for i, d := range docs {
		hashes[i] = simhash.Simhash(simhash.NewWordFeatureSet(d))
		fmt.Printf("Simhash of %s: %x\n", d, hashes[i])
	}

	fmt.Printf("Comparison of `%s` and `%s`: %d\n", docs[0], docs[1], simhash.Compare(hashes[0], hashes[1]))
	fmt.Printf("Comparison of `%s` and `%s`: %d\n", docs[0], docs[2], simhash.Compare(hashes[0], hashes[2]))
}

Output:

Simhash of this is a test phrase: 8c3a5f7e9ecb3f35
Simhash of this is a test phrass: 8c3a5f7e9ecb3f21
Simhash of foo bar: d8dbe7186bad3db3
Comparison of `this is a test phrase` and `this is a test phrass`: 2
Comparison of `this is a test phrase` and `foo bar`: 29

Documentation

Overview

simhash package implements Charikar's simhash algorithm to generate a 64-bit fingerprint of a given document.

simhash fingerprints have the property that similar documents will have a similar fingerprint. Therefore, the hamming distance between two fingerprints will be small if the documents are similar

Index

Constants

This section is empty.

Variables

This section is empty.

Functions

func Compare

func Compare(a uint64, b uint64) uint8

Compare calculates the Hamming distance between two 64-bit integers

Currently, this is calculated using the Kernighan method [1]. Other methods exist which may be more efficient and are worth exploring at some point

[1] http://graphics.stanford.edu/~seander/bithacks.html#CountBitsSetKernighan

func Fingerprint

func Fingerprint(v Vector) uint64

Fingerprint returns a 64-bit fingerprint of the given vector. The fingerprint f of a given 64-dimension vector v is defined as follows:

f[i] = 1 if v[i] >= 0
f[i] = 0 if v[i] < 0

func NewFeature

func NewFeature(f []byte) feature

Returns a new feature representing the given byte slice, using a weight of 1

func NewFeatureWithWeight

func NewFeatureWithWeight(f []byte, weight int) feature

Returns a new feature representing the given byte slice with the given weight

func Shingle

func Shingle(w int, b [][]byte) [][]byte

Shingle returns the w-shingling of the given set of bytes. For example, if the given input was {"this", "is", "a", "test"}, this returns {"this is", "is a", "a test"}

func Simhash

func Simhash(fs FeatureSet) uint64

Returns a 64-bit simhash of the given feature set

func SimhashBytes

func SimhashBytes(b [][]byte) uint64

Returns a 64-bit simhash of the given bytes

Types

type Feature

type Feature interface {
	// Sum returns the 64-bit sum of this feature
	Sum() uint64

	// Weight returns the weight of this feature
	Weight() int
}

Feature consists of a 64-bit hash and a weight

type FeatureSet

type FeatureSet interface {
	GetFeatures() []Feature
}

FeatureSet represents a set of features in a given document

type UnicodeWordFeatureSet

type UnicodeWordFeatureSet struct {
	// contains filtered or unexported fields
}

UnicodeWordFeatureSet is a feature set in which each word is a feature, all equal weight.

See: http://blog.golang.org/normalization See: https://groups.google.com/forum/#!topic/golang-nuts/YyH1f_qCZVc

func NewUnicodeWordFeatureSet

func NewUnicodeWordFeatureSet(b []byte, f norm.Form) *UnicodeWordFeatureSet

func (*UnicodeWordFeatureSet) GetFeatures

func (w *UnicodeWordFeatureSet) GetFeatures() []Feature

Returns a []Feature representing each word in the byte slice

type Vector

type Vector [64]int

func Vectorize

func Vectorize(features []Feature) Vector

Vectorize generates 64 dimension vectors given a set of features. Vectors are initialized to zero. The i-th element of the vector is then incremented by weight of the i-th feature if the i-th bit of the feature is set, and decremented by the weight of the i-th feature otherwise.

func VectorizeBytes

func VectorizeBytes(features [][]byte) Vector

VectorizeBytes generates 64 dimension vectors given a set of [][]byte, where each []byte is a feature with even weight.

Vectors are initialized to zero. The i-th element of the vector is then incremented by weight of the i-th feature if the i-th bit of the feature is set, and decremented by the weight of the i-th feature otherwise.

type WordFeatureSet

type WordFeatureSet struct {
	// contains filtered or unexported fields
}

WordFeatureSet is a feature set in which each word is a feature, all equal weight.

func NewWordFeatureSet

func NewWordFeatureSet(b []byte) *WordFeatureSet

func (*WordFeatureSet) GetFeatures

func (w *WordFeatureSet) GetFeatures() []Feature

Returns a []Feature representing each word in the byte slice

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL