fuzzy

package module
v1.0.0 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: May 23, 2016 License: MIT Imports: 12 Imported by: 71

README

Fuzzy

Build Status

Fuzzy is a very fast spell checker and query suggester written in Golang.

Motivation:

  • Sajari uses very large queries (hundreds of words) but needs to respond sub-second to these queries where possible. Common spell check algorithms are quite slow or very resource intensive.
  • The aim was to achieve spell checks in sub 100usec per word (10,000 / second single core) with at least 60% accuracy and multi-language support.
  • Currently we see sub 40usec per word and ~70% accuracy for a Levenshtein distance of 2 chars on a 2012 macbook pro (english test set comes from Peter Norvig's article, see http://norvig.com/spell-correct.html).
  • A 500 word query can be spell checked in ~0.02 sec / cpu cores, which is good enough for us.

Notes:

  • It is currently executed as a single goroutine per lookup, so undoubtedly this could be much faster using multiple cores, but currently the speed is quite good.
  • Accuracy is hit slightly because several correct words don't appear at all in the training text (data/big.txt).
  • Fuzzy is a "Symmetric Delete Spelling Corrector", which relates to some blogs by Wolf Garbe at Faroo.com (see http://blog.faroo.com/2012/06/07/improved-edit-distance-based-spelling-correction/)

Config:

  • Generally no config is required, but you can tweak the model for your application.
  • "threshold" is the trigger point when a word becomes popular enough to build lookup keys for it. Setting this to "1" means any instance of a given word makes it a legitimate spelling. This typically corrects the most errors, but can also cause false positives if incorrect spellings exist in the training data. It also causes a much larger index to be built. By default this is set to 4.
  • "depth" is the Levenshtein distance the model builds lookup keys for. For spelling correction, a setting of "2" is typically very good. At a distance of "3" the potential number of words is much, much larger, but adds little benefit to accuracy. For query prediction a larger number can be useful, but again is much more expensive. A depth of "1" and threshold of "1" for the 1st Norvig test set gives ~70% correction accuracy at ~5usec per check (e.g. ~200kHz), for many applications this will be good enough. At depths > 2, the false positives begin to hurt the accuracy.

Future improvements:

  • Make some of the expensive processes concurrent.
  • Add spelling checks for different languages. If you have misspellings in different languages please add them or send to us.
  • Allow the term-score map to be read from an external term set (e.g. integrating this currently may double up on keeping a term count).
  • Currently there is no method to delete lookup keys, so potentially this may cause bloating over time if the dictionary changes signficantly.
  • Add right to left deletion beyond Levenshtein config depth (e.g. don't process all deletes accept for query predictors).

Usage:

  • Below is some example code showing how to use the package.
  • An example showing how to train with a static set of words is contained in the fuzzy_test.go file, which uses the "big.text" file to create an english dictionary.
  • To integrate with your application (e.g. custom dictionary / word popularity), use the single word and multiword training functions shown in the example below. Each time you add a new instance of a given word, pass it to this function. The model will keep a count and
  • We haven't tested with other langauges, but this should work fine. Please let us know how you go? support@sajari.com
package main 

import(
	"github.com/sajari/fuzzy"
	"fmt"
)

func main() {
	model := fuzzy.NewModel()

	// For testing only, this is not advisable on production
	model.SetThreshold(1)

	// This expands the distance searched, but costs more resources (memory and time). 
	// For spell checking, "2" is typically enough, for query suggestions this can be higher
	model.SetDepth(5)

	// Train multiple words simultaneously by passing an array of strings to the "Train" function
	words := []string{"bob", "your", "uncle", "dynamite", "delicate", "biggest", "big", "bigger", "aunty", "you're"}
	model.Train(words)
	
	// Train word by word (typically triggered in your application once a given word is popular enough)
	model.TrainWord("single")

	// Check Spelling
	fmt.Println("\nSPELL CHECKS")
	fmt.Println("	Deletion test (yor) : ", model.SpellCheck("yor"))
	fmt.Println("	Swap test (uncel) : ", model.SpellCheck("uncel"))
	fmt.Println("	Replace test (dynemite) : ", model.SpellCheck("dynemite"))
	fmt.Println("	Insert test (dellicate) : ", model.SpellCheck("dellicate"))
	fmt.Println("	Two char test (dellicade) : ", model.SpellCheck("dellicade"))

	// Suggest completions
	fmt.Println("\nQUERY SUGGESTIONS")
	fmt.Println("	\"bigge\". Did you mean?: ", model.Suggestions("bigge", false))
	fmt.Println("	\"bo\". Did you mean?: ", model.Suggestions("bo", false))
	fmt.Println("	\"dyn\". Did you mean?: ", model.Suggestions("dyn", false))

	// Autocomplete suggestions
	suggested, _ := model.Autocomplete("bi")
	fmt.Printf("	\"bi\". Suggestions: %v", suggested)

}

Documentation

Overview

Eventually this should be removed. Currently it gives backwards compatability to old versions that did not store the query count, which is now used for autocomplete.

Index

Constants

View Source
const (
	SpellDepthDefault              = 2
	SpellThresholdDefault          = 5
	SuffDivergenceThresholdDefault = 100
)
View Source
const (
	MethodIsWord                   Method = 0
	MethodSuggestMapsToInput              = 1
	MethodInputDeleteMapsToDict           = 2
	MethodInputDeleteMapsToSuggest        = 3
)

Variables

This section is empty.

Functions

func Edits1

func Edits1(word string) []string

Edits1 creates a set of terms that are 1 char delete from the input term

func Levenshtein

func Levenshtein(a, b *string) int

Calculate the Levenshtein distance between two strings

func SampleEnglish

func SampleEnglish() []string

Types

type Autos

type Autos struct {
	Results []string
	Model   *Model
}

For sorting autocomplete suggestions to bias the most popular first

func (Autos) Len

func (a Autos) Len() int

func (Autos) Less

func (a Autos) Less(i, j int) bool

func (Autos) Swap

func (a Autos) Swap(i, j int)

type Counts

type Counts struct {
	Corpus int `json:"corpus"`
	Query  int `json:"query"`
}

type Method

type Method int

func (Method) String

func (m Method) String() string

type Model

type Model struct {
	Data                    map[string]*Counts  `json:"data"`
	Maxcount                int                 `json:"maxcount"`
	Suggest                 map[string][]string `json:"suggest"`
	Depth                   int                 `json:"depth"`
	Threshold               int                 `json:"threshold"`
	UseAutocomplete         bool                `json:"autocomplete"`
	SuffDivergence          int                 `json:"-"`
	SuffDivergenceThreshold int                 `json:"suff_threshold"`
	SuffixArr               *suffixarray.Index  `json:"-"`
	SuffixArrConcat         string              `json:"-"`
	sync.RWMutex
}

func FromReader

func FromReader(r io.Reader) (*Model, error)

FromReader loads a model from a Reader

func Load

func Load(filename string) (*Model, error)

Load a saved model from disk

func NewModel

func NewModel() *Model

Create and initialise a new model

func (*Model) Autocomplete

func (model *Model) Autocomplete(input string) ([]string, error)

For a given string, autocomplete using the suffix array model

func (*Model) CheckKnown

func (model *Model) CheckKnown(input string, correct string) bool

Test an input, if we get it wrong, look at why it is wrong. This function returns a bool indicating if the guess was correct as well as the term it is suggesting. Typically this function would be used for testing, not for production

func (*Model) EditsMulti

func (model *Model) EditsMulti(term string, depth int) []string

Edits at any depth for a given term. The depth of the model is used

func (*Model) Init

func (model *Model) Init() *Model

func (*Model) Potentials

func (model *Model) Potentials(input string, exhaustive bool) map[string]*Potential

Return the raw potential terms so they can be ranked externally to this package

func (*Model) Save

func (model *Model) Save(filename string) error

Save a spelling model to disk

func (*Model) SaveLight

func (model *Model) SaveLight(filename string) error

Save a spelling model to disk, but discard all entries less than the threshold number of occurences Much smaller and all that is used when generated as a once off, but not useful for incremental usage

func (*Model) SetCount

func (model *Model) SetCount(term string, count int, suggest bool)

Manually set the count of a word. Optionally trigger the creation of suggestion keys for the term. This function lets you build a model from an existing dictionary with word popularity counts without needing to run "TrainWord" repeatedly

func (*Model) SetDepth

func (model *Model) SetDepth(val int)

Change the default depth value of the model. This sets how many character differences are indexed. The default is 2.

func (*Model) SetDivergenceThreshold

func (model *Model) SetDivergenceThreshold(val int)

Optionally set the suffix array divergence threshold. This is the number of query training steps between rebuilds of the suffix array. A low number will be more accurate but will use resources and create more garbage.

func (*Model) SetThreshold

func (model *Model) SetThreshold(val int)

Change the default threshold of the model. This is how many times a term must be seen before suggestions are created for it

func (*Model) SetUseAutocomplete

func (model *Model) SetUseAutocomplete(val bool)

Optionally disabled suffixarray based autocomplete support

func (*Model) SpellCheck

func (model *Model) SpellCheck(input string) string

Return the most likely correction for the input term

func (*Model) SpellCheckSuggestions

func (model *Model) SpellCheckSuggestions(input string, n int) []string

Return the most likely corrections in order from best to worst

func (*Model) Suggestions

func (model *Model) Suggestions(input string, exhaustive bool) []string

For a given input string, suggests potential replacements

func (*Model) Train

func (model *Model) Train(terms []string)

Add an array of words to train the model in bulk

func (*Model) TrainQuery

func (model *Model) TrainQuery(term string)

Train using a search query term. This builds a second popularity index of terms used to search, as opposed to generally occurring in corpus text

func (*Model) TrainWord

func (model *Model) TrainWord(term string)

Train the model word by word. This is corpus training as opposed to query training. Word counts from this type of training are not likely to correlate with those of search queries

func (*Model) WriteTo

func (model *Model) WriteTo(w io.Writer) (int64, error)

WriteTo writes a model to a Writer

type OldModel

type OldModel struct {
	Data            map[string]int      `json:"data"`
	Maxcount        int                 `json:"maxcount"`
	Suggest         map[string][]string `json:"suggest"`
	Depth           int                 `json:"depth"`
	Threshold       int                 `json:"threshold"`
	UseAutocomplete bool                `json:"autocomplete"`
}

type Pair

type Pair struct {
	// contains filtered or unexported fields
}

type Potential

type Potential struct {
	Term   string // Potential term string
	Score  int    // Score
	Leven  int    // Levenstein distance from the suggestion to the input
	Method Method // How this potential was matched
}

func (*Potential) String

func (pot *Potential) String() string

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL