Documentation ¶
Overview ¶
Package words provides methods to estimate (word) emission probabilities.
The parameters in Hidden Markov Models (HMM) come in two forms: transition and emission probabilities. In a trigram HMM tagger, the transition probabilities are P(t3|t1,t2) and the emission probabilities P(w|t), where 'w' is a word and 't' a tag.
This package concerns itself with estimating emission probabilities. Generally, the emission probabilities are estimated as follows: (1) for words seen in the training data, probability is the (smoothed) maximum likelihood estimation; (2) for words that are not seen in the training data the probabilies are usually estimated based on inflectional properties.
The `Lexicon` type implements (1), while the SuffixHandler type is a possible implementation of (2) based on Brants, 2000. Both types implement the WordHandler interface.
Index ¶
Constants ¶
This section is empty.
Variables ¶
This section is empty.
Functions ¶
This section is empty.
Types ¶
type Lexicon ¶
type Lexicon struct {
// contains filtered or unexported fields
}
Lexicon is an emission probability estimator for 'known words' (words seen in the training data).
func NewLexicon ¶
NewLexicon constructs a new Lexicon from word/tag frequencies and unigram frequencies.
func NewLexiconWithFallback ¶
func NewLexiconWithFallback(wtf map[string]map[model.Tag]int, uf map[model.Unigram]int, fallback WordHandler) Lexicon
NewLexiconWithFallback construct a new Lexicon from word/tag frequencies, unigram frequencies, and a fallback. The fallback is used to estimate the emission probabilities when the word is not in the lexicon. For instance, this permits use of Lexicon with SuffixHandler to estimate the emission probability for any word.
type LookupSuffixHandler ¶
type LookupSuffixHandler struct {
// contains filtered or unexported fields
}
LookupSuffixHandler estimates the emission probabilities P(w|t) using word suffixes. In contrast to SuffixHandler, it uses map-based lookups. The initial construction of a LookupSuffixHandler takes a small amount of extra time. However, it is much faster during taggin.
func NewLookupSuffixHandler ¶
func NewLookupSuffixHandler(sh SuffixHandler) LookupSuffixHandler
NewLookupSuffixHandler constructs a LookupSuffixHandler from a SuffixHandler. After construction, the SuffixHandler is discarded after construction.
type SubstLexicon ¶
type SubstLexicon struct {
// contains filtered or unexported fields
}
func NewSubstLexicon ¶
func NewSubstLexicon(lexicon Lexicon, substitutions []Substitution) SubstLexicon
NewSubstLexicon construct a new Lexicon with substitution rules from a lexicon. If the lexicon does not return results for a word, the substitutions are applied and another lookup is attempted.
func NewSubstLexiconWithFallback ¶
func NewSubstLexiconWithFallback(lexicon Lexicon, fallback WordHandler, substitutions []Substitution) SubstLexicon
NewSubstLexiconWithFallback construct a new Lexicon with substitution rules from a lexicon and a fallback. If the lexicon does not return results for a word, the substitutions are applied and another lookup is attempted. If this fails as well, the fallback is used.
func (SubstLexicon) TagProbs ¶
func (l SubstLexicon) TagProbs(word string) map[model.Tag]float64
TagProbs returns P(w|t) for a particular word 'w'. Probabilities are only returned for tags with which the word occurred in the training data, except if the word did not occur in the training data and a fallback is used.
type Substitution ¶
type SuffixHandler ¶
type SuffixHandler struct {
// contains filtered or unexported fields
}
SuffixHandler is an emission probability estimator that uses word suffices. It is normally used for words that were not seen in the training model.
Internally, this estimator uses four different distributions based properties of the token: (1) Tokens that start with an uppercase letter; (2) tokens that contain a dash (currently only '-'); (3) tokens that are recognized as cardinals; and (4) remaining tokens (typically lowercase words).
func NewSuffixHandler ¶
func NewSuffixHandler(config SuffixHandlerConfig, m model.Model) SuffixHandler
NewSuffixHandler constructs a new SuffixHandler from the given configuration and model.
type SuffixHandlerConfig ¶
type SuffixHandlerConfig struct { MaxSuffixLen int UpperMaxFreq int LowerMaxFreq int DashMaxFreq int CardinalMaxFreq int MaxTags int }
SuffixHandlerConfig stores the configuration for a SuffixHandler. It allows specification of the length of the suffix to be considered, maximum frequencies of tokens in order to be used as training data, and the maximum number of tags that a SuffixHandler should return p(w|t) for.
Tweaking this parameters can have a profound effect on the quality if the estimator. For instance, the typical length of inflectional suffixes is highly language-dependent. Good values for the maximum frequencies for the various types of tokens depends on the size of the training corpus - the distribution of unknown words is typically closer to that of low-frequency words than high-frequency words.
func DefaultSuffixHandlerConfig ¶
func DefaultSuffixHandlerConfig() SuffixHandlerConfig
DefaultSuffixHandlerConfig returns a SuffixHandlerConfig that works reasonably well on German and English with approximately 50,000 to 100,000 sentences.