Documentation ¶
Overview ¶
Package stringclassifier finds the nearest match between a string and a set of known values. It uses the Levenshtein Distance (LD) algorithm to determine this. A match with a large LD is less likely to be correct than one with a small LD. A confidence percentage is returned, which indicates how confident the algorithm is that the match is correct. The higher the percentage, the greater the confidence that the match is correct.
Example Usage:
type Text struct { Name string Text string } func NewClassifier(knownTexts []Text) (*stringclassifier.Classifier, error) { sc := stringclassifier.New(stringclassifier.FlattenWhitespace) for _, known := range knownTexts { if err := sc.AddValue(known.Name, known.Text); err != nil { return nil, err } } return sc, nil } func IdentifyTexts(sc *stringclassifier.Classifier, unknownTexts []*Text) { for _, unknown := range unknownTexts { m := sc.NearestMatch(unknown.Text) log.Printf("The nearest match to %q is %q (confidence: %v)", unknown.Name, m.Name, m.Confidence) } }
Index ¶
Constants ¶
const ( // DefaultConfidenceThreshold is the minimum ratio threshold between // the matching range and the full source range that we're willing to // accept in order to say that the matching range will produce a // sufficiently good edit distance. I.e., if the matching range is // below this threshold we won't run the Levenshtein Distance algorithm // on it. DefaultConfidenceThreshold float64 = 0.80 )
Variables ¶
This section is empty.
Functions ¶
This section is empty.
Types ¶
type Classifier ¶
type Classifier struct { // MinDiffRatio defines the minimum ratio of the length difference // allowed to consider a known value a possible match. This is used as // a performance optimization to eliminate values that are unlikely to // be a match. // // For example, a value of 0.75 means that the shorter string must be // at least 75% the length of the longer string to consider it a // possible match. // // Setting this to 1.0 will require that strings are identical length. // Setting this to 0 will consider all known values as possible // matches. MinDiffRatio float64 // contains filtered or unexported fields }
A Classifier matches a string to a set of known values.
func New ¶
func New(threshold float64, funcs ...NormalizeFunc) *Classifier
New creates a new Classifier with the provided NormalizeFuncs. Each NormalizeFunc is applied in order to a string before comparison.
func (*Classifier) AddPrecomputedValue ¶
func (c *Classifier) AddPrecomputedValue(key, value string, set *searchset.SearchSet) error
AddPrecomputedValue adds a known value to be matched against. The value has already been normalized and the SearchSet object deserialized, so no processing is necessary.
func (*Classifier) AddValue ¶
func (c *Classifier) AddValue(key, value string) error
AddValue adds a known value to be matched against. If a value already exists for key, an error is returned.
func (*Classifier) MultipleMatch ¶
func (c *Classifier) MultipleMatch(s string) (matches Matches)
MultipleMatch tries to determine which known strings are found within an unknown string. This differs from "NearestMatch" in that it looks only at those areas within the unknown string that are likely to match. A list of potential matches are returned. It's up to the caller to determine which ones are acceptable.
func (*Classifier) NearestMatch ¶
func (c *Classifier) NearestMatch(s string) *Match
NearestMatch returns the name of the known value that most closely matches the unknown string and a confidence percentage is returned indicating how confident the classifier is in the result. A percentage of "1.0" indicates an exact match, while a percentage of "0.0" indicates a complete mismatch.
If the string is equidistant from multiple known values, it is undefined which will be returned.
type Match ¶
type Match struct { Name string // Name of knownValue that was matched Confidence float64 // Confidence percentage Offset int // The offset into the unknown string the match was made Extent int // The length from the offset into the unknown string }
Match identifies the result of matching a string against a knownValue.
type Matches ¶
type Matches []*Match
Matches is a list of Match-es. This is here mainly so that the list can be sorted.
type NormalizeFunc ¶
NormalizeFunc is a function that is used to normalize a string prior to comparison.
var FlattenWhitespace NormalizeFunc = func(s string) string {
return wsRegexp.ReplaceAllString(s, " ")
}
FlattenWhitespace will flatten contiguous blocks of whitespace down to a single space.
Directories ¶
Path | Synopsis |
---|---|
internal
|
|
pq
Package pq provides a priority queue.
|
Package pq provides a priority queue. |
Package searchset generates hashes for all substrings of a text.
|
Package searchset generates hashes for all substrings of a text. |
tokenizer
Package tokenizer converts a text into a stream of tokens.
|
Package tokenizer converts a text into a stream of tokens. |