stringclassifier

package
v0.0.0-...-e36dbc7 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Apr 9, 2024 License: Apache-2.0, Apache-2.0 Imports: 9 Imported by: 0

README

StringClassifier

StringClassifier is a library to classify an unknown text against a set of known texts. The classifier uses the Levenshtein Distance algorithm to determine which of the known texts most closely matches the unknown text. The Levenshtein Distance is normalized into a "confidence percentage" between 1 and 0, where 1.0 indicates an exact match and 0.0 indicates a complete mismatch.

Types of matching

There are two kinds of matching algorithms the string classifier can perform:

  1. Nearest matching, and
  2. Multiple matching.
Normalization

To get the best match, normalizing functions can be applied to the texts. For example, flattening whitespaces removes a lot of inconsequential formatting differences that would otherwise lower the matching confidence percentage.

sc := stringclassifier.New(stringclassifier.FlattenWhitespace, strings.ToLower)

The normalizating functions are run on all the known texts that are added to the classifier. They're also run on the unknown text before classification.

Nearest matching

A nearest match returns the name of the known text that most closely matches the full unknown text. This is most useful when the unknown text doesn't have extraneous text around it.

Example:

func IdentifyText(sc *stringclassifier.Classifier, name, unknown string) {
  m := sc.NearestMatch(unknown)
  log.Printf("The nearest match to %q is %q (confidence: %v)", name, m.Name, m.Confidence)
}

Multiple matching

Multiple matching identifies all of the known texts which may exist in the unknown text. It can also detect a known text in an unknown text even if there's extraneous text around the unknown text. As with nearest matching, a confidence percentage for each match is given.

Example:

log.Printf("The text %q contains:", name)
for _, m := range sc.MultipleMatch(unknown, false) {
  log.Printf("  %q (conf: %v, offset: %v)", m.Name, m.Confidence, m.Offset)
}

Disclaimer

This is not an official Google product (experimental or otherwise), it is just code that happens to be owned by Google.

Documentation

Overview

Package stringclassifier finds the nearest match between a string and a set of known values. It uses the Levenshtein Distance (LD) algorithm to determine this. A match with a large LD is less likely to be correct than one with a small LD. A confidence percentage is returned, which indicates how confident the algorithm is that the match is correct. The higher the percentage, the greater the confidence that the match is correct.

Example Usage:

type Text struct {
  Name string
  Text string
}

func NewClassifier(knownTexts []Text) (*stringclassifier.Classifier, error) {
  sc := stringclassifier.New(stringclassifier.FlattenWhitespace)
  for _, known := range knownTexts {
    if err := sc.AddValue(known.Name, known.Text); err != nil {
      return nil, err
    }
  }
  return sc, nil
}

func IdentifyTexts(sc *stringclassifier.Classifier, unknownTexts []*Text) {
  for _, unknown := range unknownTexts {
    m := sc.NearestMatch(unknown.Text)
    log.Printf("The nearest match to %q is %q (confidence: %v)",
      unknown.Name, m.Name, m.Confidence)
  }
}

Index

Constants

View Source
const (
	// DefaultConfidenceThreshold is the minimum ratio threshold between
	// the matching range and the full source range that we're willing to
	// accept in order to say that the matching range will produce a
	// sufficiently good edit distance. I.e., if the matching range is
	// below this threshold we won't run the Levenshtein Distance algorithm
	// on it.
	DefaultConfidenceThreshold float64 = 0.80
)

Variables

This section is empty.

Functions

This section is empty.

Types

type Classifier

type Classifier struct {

	// MinDiffRatio defines the minimum ratio of the length difference
	// allowed to consider a known value a possible match. This is used as
	// a performance optimization to eliminate values that are unlikely to
	// be a match.
	//
	// For example, a value of 0.75 means that the shorter string must be
	// at least 75% the length of the longer string to consider it a
	// possible match.
	//
	// Setting this to 1.0 will require that strings are identical length.
	// Setting this to 0 will consider all known values as possible
	// matches.
	MinDiffRatio float64
	// contains filtered or unexported fields
}

A Classifier matches a string to a set of known values.

func New

func New(threshold float64, funcs ...NormalizeFunc) *Classifier

New creates a new Classifier with the provided NormalizeFuncs. Each NormalizeFunc is applied in order to a string before comparison.

func (*Classifier) AddPrecomputedValue

func (c *Classifier) AddPrecomputedValue(key, value string, set *searchset.SearchSet) error

AddPrecomputedValue adds a known value to be matched against. The value has already been normalized and the SearchSet object deserialized, so no processing is necessary.

func (*Classifier) AddValue

func (c *Classifier) AddValue(key, value string) error

AddValue adds a known value to be matched against. If a value already exists for key, an error is returned.

func (*Classifier) MultipleMatch

func (c *Classifier) MultipleMatch(s string) (matches Matches)

MultipleMatch tries to determine which known strings are found within an unknown string. This differs from "NearestMatch" in that it looks only at those areas within the unknown string that are likely to match. A list of potential matches are returned. It's up to the caller to determine which ones are acceptable.

func (*Classifier) NearestMatch

func (c *Classifier) NearestMatch(s string) *Match

NearestMatch returns the name of the known value that most closely matches the unknown string and a confidence percentage is returned indicating how confident the classifier is in the result. A percentage of "1.0" indicates an exact match, while a percentage of "0.0" indicates a complete mismatch.

If the string is equidistant from multiple known values, it is undefined which will be returned.

type Match

type Match struct {
	Name       string  // Name of knownValue that was matched
	Confidence float64 // Confidence percentage
	Offset     int     // The offset into the unknown string the match was made
	Extent     int     // The length from the offset into the unknown string
}

Match identifies the result of matching a string against a knownValue.

type Matches

type Matches []*Match

Matches is a list of Match-es. This is here mainly so that the list can be sorted.

func (Matches) Len

func (m Matches) Len() int

func (Matches) Less

func (m Matches) Less(i, j int) bool

func (Matches) Names

func (m Matches) Names() []string

Names returns an unsorted slice of the names of the matched licenses.

func (Matches) Swap

func (m Matches) Swap(i, j int)

type NormalizeFunc

type NormalizeFunc func(string) string

NormalizeFunc is a function that is used to normalize a string prior to comparison.

var FlattenWhitespace NormalizeFunc = func(s string) string {
	return wsRegexp.ReplaceAllString(s, " ")
}

FlattenWhitespace will flatten contiguous blocks of whitespace down to a single space.

Directories

Path Synopsis
internal
pq
Package pq provides a priority queue.
Package pq provides a priority queue.
Package searchset generates hashes for all substrings of a text.
Package searchset generates hashes for all substrings of a text.
tokenizer
Package tokenizer converts a text into a stream of tokens.
Package tokenizer converts a text into a stream of tokens.

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL