Documentation ¶
Overview ¶
Package levenshtein implements distance and similarity metrics for strings, based on the Levenshtein measure.
The Levenshtein `Distance` between two strings is the minimum total cost of edits that would convert the first string into the second. The allowed edit operations are insertions, deletions, and substitutions, all at character (one UTF-8 code point) level. Each operation has a default cost of 1, but each can be assigned its own cost equal to or greater than 0.
A `Distance` of 0 means the two strings are identical, and the higher the value the more different the strings. Since in practice we are interested in finding if the two strings are "close enough", it often does not make sense to continue the calculation once the result is mathematically guaranteed to exceed a desired threshold. Providing this value to the `Distance` function allows it to take a shortcut and return a lower bound instead of an exact cost when the threshold is exceeded.
The `Similarity` function calculates the distance, then converts it into a normalized metric within the range 0..1, with 1 meaning the strings are identical, and 0 that they have nothing in common. A minimum similarity threshold can be provided to speed up the calculation of the metric for strings that are far too dissimilar for the purpose at hand. All values under this threshold are rounded down to 0.
The `Match` function provides a similarity metric, with the same range and meaning as `Similarity`, but with a bonus for string pairs that share a common prefix and have a similarity above a "bonus threshold". It uses the same method as proposed by Winkler for the Jaro distance, and the reasoning behind it is that these string pairs are very likely spelling variations or errors, and they are more closely linked than the edit distance alone would suggest.
The underlying `Calculate` function is also exported, to allow the building of other derivative metrics, if needed.
Index ¶
- func Calculate(str1, str2 []rune, maxCost, insCost, subCost, delCost int) (dist, prefixLen, suffixLen int)
- func Distance(str1, str2 string, p *Params) int
- func Match(str1, str2 string, p *Params) float64
- func Similarity(str1, str2 string, p *Params) float64
- type Params
- func (p *Params) BonusPrefix(v int) *Params
- func (p *Params) BonusScale(v float64) *Params
- func (p *Params) BonusThreshold(v float64) *Params
- func (p *Params) Clone() *Params
- func (p *Params) DelCost(v int) *Params
- func (p *Params) InsCost(v int) *Params
- func (p *Params) MaxCost(v int) *Params
- func (p *Params) MinScore(v float64) *Params
- func (p *Params) SubCost(v int) *Params
Constants ¶
This section is empty.
Variables ¶
This section is empty.
Functions ¶
func Calculate ¶
func Calculate(str1, str2 []rune, maxCost, insCost, subCost, delCost int) (dist, prefixLen, suffixLen int)
Calculate determines the Levenshtein distance between two strings, using the given costs for each edit operation. It returns the distance along with the lengths of the longest common prefix and suffix.
If maxCost is non-zero, the calculation stops as soon as the distance is determined to be greater than maxCost. Therefore, any return value higher than maxCost is a lower bound for the actual distance.
func Distance ¶
Distance returns the Levenshtein distance between str1 and str2, using the default or provided cost values. Pass nil for the third argument to use the default cost of 1 for all three operations, with no maximum.
func Match ¶
Match returns a similarity score adjusted by the same method as proposed by Winkler for the Jaro distance - giving a bonus to string pairs that share a common prefix, only if their similarity score is already over a threshold.
The score is in the range of 0..1, with 1 meaning the strings are identical, and 0 meaning they have nothing in common.
A nil third argument uses the default cost of 1 for all three operations, maximum length of common prefix to consider for bonus of 4, scaling factor of 0.1, and bonus threshold of 0.7.
If a non-zero MinScore value is provided in the parameters, scores lower than it will be returned as 0.
func Similarity ¶
Similarity returns a score in the range of 0..1 for how similar the two strings are. A score of 1 means the strings are identical, and 0 means they have nothing in common.
A nil third argument uses the default cost of 1 for all three operations.
If a non-zero MinScore value is provided in the parameters, scores lower than it will be returned as 0.
Types ¶
type Params ¶
type Params struct {
// contains filtered or unexported fields
}
Params represents a set of parameter values for the various formulas involved in the calculation of the Levenshtein string metrics.
func NewParams ¶
func NewParams() *Params
NewParams creates a new set of parameters and initializes it with the default values.
func (*Params) BonusPrefix ¶
BonusPrefix overrides the default value for the maximum length of common prefix to be considered for bonus by Match(). The new value must be zero or positive.
func (*Params) BonusScale ¶
BonusScale overrides the default value for the scaling factor used by Match() in calculating the bonus. The new value must be zero or positive. To guarantee that the similarity score remains in the interval 0..1, this scaling factor is not allowed to exceed 1 / BonusPrefix.
func (*Params) BonusThreshold ¶
BonusThreshold overrides the default value for the minimum similarity score for which Match() can assign a bonus. The new value must be zero or positive. Note that a threshold greater than 1 effectively makes Match() become the equivalent of Similarity().
func (*Params) Clone ¶
Clone returns a pointer to a copy of the receiver parameter set, or of a new default parameter set if the receiver is nil.
func (*Params) DelCost ¶
DelCost overrides the default value of 1 for the cost of deletion. The new value must be zero or positive.
func (*Params) InsCost ¶
InsCost overrides the default value of 1 for the cost of insertion. The new value must be zero or positive.
func (*Params) MaxCost ¶
MaxCost overrides the default value of 0 (meaning unlimited) for the maximum cost. The calculation of Distance() stops when the result is guaranteed to exceed this maximum, returning a lower-bound rather than exact value. The new value must be zero or positive.
func (*Params) MinScore ¶
MinScore overrides the default value of 0 for the minimum similarity score. Scores below this threshold are returned as 0 by Similarity() and Match(). The new value must be zero or positive. Note that a minimum greater than 1 can never be satisfied, resulting in a score of 0 for any pair of strings.