smetrics

package module
v0.1.0 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Dec 16, 2020 License: MIT Imports: 3 Imported by: 2

README

Build Status

smetrics

smetrics is "string metrics".

Package smetrics provides a bunch of algorithms for calculating the distance between strings.

There are implementations for calculating the popular Levenshtein distance (aka Edit Distance or Wagner-Fischer), as well as the Jaro distance, the Jaro-Winkler distance, and more.

How to import

import "github.com/xrash/smetrics"

Documentation

Go to https://pkg.go.dev/github.com/xrash/smetrics for complete documentation.

Example

package main

import (
	"github.com/xrash/smetrics"
)

func main() {
	smetrics.WagnerFischer("POTATO", "POTATTO", 1, 1, 2)
	smetrics.WagnerFischer("MOUSE", "HOUSE", 2, 2, 4)

	smetrics.Ukkonen("POTATO", "POTATTO", 1, 1, 2)
	smetrics.Ukkonen("MOUSE", "HOUSE", 2, 2, 4)

	smetrics.Jaro("AL", "AL")
	smetrics.Jaro("MARTHA", "MARHTA")

	smetrics.JaroWinkler("AL", "AL", 0.7, 4)
	smetrics.JaroWinkler("MARTHA", "MARHTA", 0.7, 4)

	smetrics.Soundex("Euler")
	smetrics.Soundex("Ellery")

	smetrics.Hamming("aaa", "aaa")
	smetrics.Hamming("aaa", "aab")
}

Documentation

Overview

Package smetrics provides a bunch of algorithms for calculating the distance between strings.

There are implementations for calculating the popular Levenshtein distance (aka Edit Distance or Wagner-Fischer), as well as the Jaro distance, the Jaro-Winkler distance, and more.

For the Levenshtein distance, you can use the functions WagnerFischer() and Ukkonen(). Read the documentation on these functions.

For the Jaro and Jaro-Winkler algorithms, check the functions Jaro() and JaroWinkler(). Read the documentation on these functions.

For the Soundex algorithm, check the function Soundex().

For the Hamming distance algorithm, check the function Hamming().

Index

Constants

This section is empty.

Variables

This section is empty.

Functions

func Hamming

func Hamming(a, b string) (int, error)

The Hamming distance is the minimum number of substitutions required to change string A into string B. Both strings must have the same size. If the strings have different sizes, the function returns an error.

func Jaro

func Jaro(a, b string) float64

The Jaro distance. The result is 1 for equal strings, and 0 for completely different strings.

func JaroWinkler

func JaroWinkler(a, b string, boostThreshold float64, prefixSize int) float64

The Jaro-Winkler distance. The result is 1 for equal strings, and 0 for completely different strings. It is commonly used on Record Linkage stuff, thus it tries to be accurate for common typos when writing real names such as person names and street names. Jaro-Winkler is a modification of the Jaro algorithm. It works by first running Jaro, then boosting the score of exact matches at the beginning of the strings. Because of that, it introduces two more parameters: the boostThreshold and the prefixSize. These are commonly set to 0.7 and 4, respectively.

func Soundex

func Soundex(s string) string

The Soundex encoding. It is a phonetic algorithm that considers how the words sound in English. Soundex maps a string to a 4-byte code consisting of the first letter of the original string and three numbers. Strings that sound similar should map to the same code.

func Ukkonen

func Ukkonen(a, b string, icost, dcost, scost int) int

The Ukkonen algorithm for calculating the Levenshtein distance. The algorithm is described in http://www.cs.helsinki.fi/u/ukkonen/InfCont85.PDF, or in docs/InfCont85.PDF. It runs on O(t . min(m, n)) where t is the actual distance between strings a and b. It needs O(min(t, m, n)) space. This function might be preferred over WagnerFischer() for *very* similar strings. But test it out yourself. The first two parameters are the two strings to be compared. The last three parameters are the insertion cost, the deletion cost and the substitution cost. These are normally defined as 1, 1 and 2 respectively.

func WagnerFischer

func WagnerFischer(a, b string, icost, dcost, scost int) int

The Wagner-Fischer algorithm for calculating the Levenshtein distance. The first two parameters are the two strings to be compared. The last three parameters are the insertion cost, the deletion cost and the substitution cost. These are normally defined as 1, 1 and 2 respectively.

Types

This section is empty.

Directories

Path Synopsis

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL