simhash

package module
v1.0.0 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Dec 3, 2024 License: MIT Imports: 6 Imported by: 0

README

go version license version

simhash

simhash is a lightweight Go package for generating Simhash tokens and calculating their similarity using the Moses Charikar Simhash algorithm. It is ideal for applications like text deduplication, plagiarism detection, and near-duplicate content detection and fingerprinting.

For detailed usage, check this.


Documentation

Install

To get started with simhash, install it using:

go get github.com/erfanmomeniii/simhash

Next, include it in your application:

import "github.com/erfanmomeniii/simhash"

Quick Start

The following example demonstrates how to generate Simhash tokens and calculate similarity:

package main

import (
	"fmt"
	"github.com/erfanmomeniii/simhash"
)

func main() {
	// Create a new Simhash instance
	s := simhash.NewSimhash()

	// Add features with weights
	s.AddFeature("example", 2)
	s.AddFeature("test", 5)

	// Generate a Simhash token
	token1 := s.GenerateToken()

	// Create another Simhash instance with similar features
	s2 := simhash.NewSimhash()
	s2.AddFeature("example", 2)
	s2.AddFeature("testcase", 5)

	// Generate another token
	token2 := s2.GenerateToken()

	// Compute similarity between the two tokens
	similarity := simhash.ComputeSimilarity(token1, token2)

	fmt.Printf("Token1: %s\nToken2: %s\nSimilarity: %f\n", token1, token2, similarity)
}

Output:

Token1: F9E6E6EF197C2B25
Token2: FDA981914657B7D1
Similarity: 43.75

Features

Add Feature

Add features with their weights to the Simhash generator:

s.AddFeature("example", 5)
s.AddFeature(12345, 10)
Generate Token

Generate a 64-bit hexadecimal Simhash token based on the added features:

token := s.GenerateToken()
Compute Similarity

Calculate the similarity between two Simhash tokens as a percentage (normalized Hamming distance):

similarity := simhash.ComputeSimilarity(token1, token2)
Supported Feature Types

The AddFeature method accepts the following types:

  • Strings: e.g., "example"
  • Numbers: e.g., 123, float64, etc.
  • Byte slices: e.g., []byte("example")
  • Any other type: Converted using JSON serialization

Contributing

Pull requests are welcome! For any changes, please open an issue first to discuss the proposed modification. Ensure tests are updated accordingly.

Documentation

Index

Constants

This section is empty.

Variables

This section is empty.

Functions

func ComputeSimilarity

func ComputeSimilarity(token1, token2 string) float64

func HammingDistance

func HammingDistance(a, b uint64) int

Types

type Feature

type Feature struct {
	// contains filtered or unexported fields
}

type Simhash

type Simhash struct {
	// contains filtered or unexported fields
}

func NewSimhash

func NewSimhash() *Simhash

func (*Simhash) AddFeature

func (s *Simhash) AddFeature(value any, weight uint64) error

func (*Simhash) GenerateToken

func (s *Simhash) GenerateToken() string

type Vector

type Vector [64]int64

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL