Documentation ¶
Overview ¶
Package smetrics provides a bunch of algorithms for calculating the distance between strings.
There are implementations for calculating the popular Levenshtein distance (aka Edit Distance or Wagner-Fischer), as well as the Jaro distance, the Jaro-Winkler distance, and more.
For the Levenshtein distance, you can use the functions WagnerFischer() and Ukkonen(). Read the documentation on these functions.
For the Jaro and Jaro-Winkler algorithms, check the functions Jaro() and JaroWinkler(). Read the documentation on these functions.
For the Soundex algorithm, check the function Soundex().
For the Hamming distance algorithm, check the function Hamming().
Index ¶
Constants ¶
This section is empty.
Variables ¶
This section is empty.
Functions ¶
func Hamming ¶
The Hamming distance is the minimum number of substitutions required to change string A into string B. Both strings must have the same size. If the strings have different sizes, the function returns an error.
func Jaro ¶
The Jaro distance. The result is 1 for equal strings, and 0 for completely different strings.
func JaroWinkler ¶
The Jaro-Winkler distance. The result is 1 for equal strings, and 0 for completely different strings. It is commonly used on Record Linkage stuff, thus it tries to be accurate for common typos when writing real names such as person names and street names. Jaro-Winkler is a modification of the Jaro algorithm. It works by first running Jaro, then boosting the score of exact matches at the beginning of the strings. Because of that, it introduces two more parameters: the boostThreshold and the prefixSize. These are commonly set to 0.7 and 4, respectively.
func Soundex ¶
The Soundex encoding. It is a phonetic algorithm that considers how the words sound in English. Soundex maps a string to a 4-byte code consisting of the first letter of the original string and three numbers. Strings that sound similar should map to the same code.
func Ukkonen ¶
The Ukkonen algorithm for calculating the Levenshtein distance. The algorithm is described in http://www.cs.helsinki.fi/u/ukkonen/InfCont85.PDF, or in docs/InfCont85.PDF. It runs on O(t . min(m, n)) where t is the actual distance between strings a and b. It needs O(min(t, m, n)) space. This function might be preferred over WagnerFischer() for *very* similar strings. But test it out yourself. The first two parameters are the two strings to be compared. The last three parameters are the insertion cost, the deletion cost and the substitution cost. These are normally defined as 1, 1 and 2 respectively.
func WagnerFischer ¶
The Wagner-Fischer algorithm for calculating the Levenshtein distance. The first two parameters are the two strings to be compared. The last three parameters are the insertion cost, the deletion cost and the substitution cost. These are normally defined as 1, 1 and 2 respectively.
Types ¶
This section is empty.