words

package

v1.0.0 Latest Latest Go to latest Published: Aug 29, 2018 License: BSD-3-Clause Imports: 6 Imported by: 0

Details

Valid go.mod file

The Go module system was introduced in Go 1.11 and is the official dependency management solution for Go.
Redistributable license

Redistributable licenses place minimal restrictions on how software can be used, modified, and redistributed.
Tagged version

Modules with tagged versions give importers more predictable builds.
Stable version

When a project reaches major version v1 it is considered stable.
Learn more about best practices

Repository

github.com/danieldk/citar

Links

Open Source Insights

Documentation ¶

Overview ¶

Package words provides methods to estimate (word) emission probabilities.

The parameters in Hidden Markov Models (HMM) come in two forms: transition and emission probabilities. In a trigram HMM tagger, the transition probabilities are P(t3|t1,t2) and the emission probabilities P(w|t), where 'w' is a word and 't' a tag.

This package concerns itself with estimating emission probabilities. Generally, the emission probabilities are estimated as follows: (1) for words seen in the training data, probability is the (smoothed) maximum likelihood estimation; (2) for words that are not seen in the training data the probabilies are usually estimated based on inflectional properties.

The `Lexicon` type implements (1), while the SuffixHandler type is a possible implementation of (2) based on Brants, 2000. Both types implement the WordHandler interface.

Index ¶

type Lexicon
- func NewLexicon(wtf map[string]map[model.Tag]int, uf map[model.Unigram]int) Lexicon
- func NewLexiconWithFallback(wtf map[string]map[model.Tag]int, uf map[model.Unigram]int, ...) Lexicon
- func (l Lexicon) TagProbs(word string) map[model.Tag]float64
type LookupSuffixHandler
- func NewLookupSuffixHandler(sh SuffixHandler) LookupSuffixHandler
- func (h LookupSuffixHandler) TagProbs(word string) map[model.Tag]float64
type SubstLexicon
- func NewSubstLexicon(lexicon Lexicon, substitutions []Substitution) SubstLexicon
- func NewSubstLexiconWithFallback(lexicon Lexicon, fallback WordHandler, substitutions []Substitution) SubstLexicon
- func (l SubstLexicon) TagProbs(word string) map[model.Tag]float64
type Substitution
type SuffixHandler
- func NewSuffixHandler(config SuffixHandlerConfig, m model.Model) SuffixHandler
- func (h SuffixHandler) TagProbs(word string) map[model.Tag]float64
type SuffixHandlerConfig
- func DefaultSuffixHandlerConfig() SuffixHandlerConfig
type WordHandler

Constants ¶

This section is empty.

Variables ¶

This section is empty.

Functions ¶

This section is empty.

Types ¶

type Lexicon ¶

type Lexicon struct {
	// contains filtered or unexported fields
}

Lexicon is an emission probability estimator for 'known words' (words seen in the training data).

func NewLexicon ¶

func NewLexicon(wtf map[string]map[model.Tag]int, uf map[model.Unigram]int) Lexicon

NewLexicon constructs a new Lexicon from word/tag frequencies and unigram frequencies.

func NewLexiconWithFallback ¶

func NewLexiconWithFallback(wtf map[string]map[model.Tag]int, uf map[model.Unigram]int, fallback WordHandler) Lexicon

NewLexiconWithFallback construct a new Lexicon from word/tag frequencies, unigram frequencies, and a fallback. The fallback is used to estimate the emission probabilities when the word is not in the lexicon. For instance, this permits use of Lexicon with SuffixHandler to estimate the emission probability for any word.

func (Lexicon) TagProbs ¶

func (l Lexicon) TagProbs(word string) map[model.Tag]float64

TagProbs returns P(w|t) for a particular word 'w'. Probabilities are only returned for tags with which the word occurred in the training data, except if the word did not occur in the training data and a fallback is used.

type LookupSuffixHandler ¶

type LookupSuffixHandler struct {
	// contains filtered or unexported fields
}

LookupSuffixHandler estimates the emission probabilities P(w|t) using word suffixes. In contrast to SuffixHandler, it uses map-based lookups. The initial construction of a LookupSuffixHandler takes a small amount of extra time. However, it is much faster during taggin.

func NewLookupSuffixHandler ¶

func NewLookupSuffixHandler(sh SuffixHandler) LookupSuffixHandler

NewLookupSuffixHandler constructs a LookupSuffixHandler from a SuffixHandler. After construction, the SuffixHandler is discarded after construction.

func (LookupSuffixHandler) TagProbs ¶

func (h LookupSuffixHandler) TagProbs(word string) map[model.Tag]float64

TagProbs estimates P(w|t) for a particular word 'w'.

type SubstLexicon ¶

type SubstLexicon struct {
	// contains filtered or unexported fields
}

func NewSubstLexicon ¶

func NewSubstLexicon(lexicon Lexicon, substitutions []Substitution) SubstLexicon

NewSubstLexicon construct a new Lexicon with substitution rules from a lexicon. If the lexicon does not return results for a word, the substitutions are applied and another lookup is attempted.

func NewSubstLexiconWithFallback ¶

func NewSubstLexiconWithFallback(lexicon Lexicon, fallback WordHandler, substitutions []Substitution) SubstLexicon

NewSubstLexiconWithFallback construct a new Lexicon with substitution rules from a lexicon and a fallback. If the lexicon does not return results for a word, the substitutions are applied and another lookup is attempted. If this fails as well, the fallback is used.

func (SubstLexicon) TagProbs ¶

func (l SubstLexicon) TagProbs(word string) map[model.Tag]float64

TagProbs returns P(w|t) for a particular word 'w'. Probabilities are only returned for tags with which the word occurred in the training data, except if the word did not occur in the training data and a fallback is used.

type Substitution ¶

type Substitution struct {
	Pattern     *regexp.Regexp
	Replacement string
}

type SuffixHandler ¶

type SuffixHandler struct {
	// contains filtered or unexported fields
}

SuffixHandler is an emission probability estimator that uses word suffices. It is normally used for words that were not seen in the training model.

Internally, this estimator uses four different distributions based properties of the token: (1) Tokens that start with an uppercase letter; (2) tokens that contain a dash (currently only '-'); (3) tokens that are recognized as cardinals; and (4) remaining tokens (typically lowercase words).

func NewSuffixHandler ¶

func NewSuffixHandler(config SuffixHandlerConfig, m model.Model) SuffixHandler

NewSuffixHandler constructs a new SuffixHandler from the given configuration and model.

func (SuffixHandler) TagProbs ¶

func (h SuffixHandler) TagProbs(word string) map[model.Tag]float64

TagProbs estimates P(w|t) for a particular word 'w'.

type SuffixHandlerConfig ¶

type SuffixHandlerConfig struct {
	MaxSuffixLen    int
	UpperMaxFreq    int
	LowerMaxFreq    int
	DashMaxFreq     int
	CardinalMaxFreq int
	MaxTags         int
}

SuffixHandlerConfig stores the configuration for a SuffixHandler. It allows specification of the length of the suffix to be considered, maximum frequencies of tokens in order to be used as training data, and the maximum number of tags that a SuffixHandler should return p(w|t) for.

Tweaking this parameters can have a profound effect on the quality if the estimator. For instance, the typical length of inflectional suffixes is highly language-dependent. Good values for the maximum frequencies for the various types of tokens depends on the size of the training corpus - the distribution of unknown words is typically closer to that of low-frequency words than high-frequency words.

func DefaultSuffixHandlerConfig ¶

func DefaultSuffixHandlerConfig() SuffixHandlerConfig

DefaultSuffixHandlerConfig returns a SuffixHandlerConfig that works reasonably well on German and English with approximately 50,000 to 100,000 sentences.

type WordHandler ¶

type WordHandler interface {
	TagProbs(word string) map[model.Tag]float64
}

A WordHandler returns or estimates the emission probabilities P(w|t) for a given words.

Source Files ¶

View all Source files

?	: This menu
/	: Search site
f or F	: Jump to
y or Y	: Canonical URL