Documentation ¶
Overview ¶
Package token deals with breaking a text into tokens. It cleans names broken by new lines, concatenating pieces together. Tokens are connected to features. Features are used for heuristic and Bayes' approaches for finding names.
Index ¶
Constants ¶
This section is empty.
Variables ¶
This section is empty.
Functions ¶
func SetIndices ¶
func SetIndices(ts []Token, d *dict.Dictionary)
SetIndices takes a slice of tokens that correspond to a name candidate. It analyses the tokens and sets Token.Indices according to feasibility of the input tokens to form a scientific name. It checks if there is a possible species, ranks, and infraspecies.
func UpperIndex ¶ added in v0.8.4
UpperIndex takes an index of a token and length of the tokens slice and returns an upper index of what could be a slice of a name. We expect that that most of the names will fit into 5 words. Other cases would require more thorough algorithims that we can run later as plugins.
Types ¶
type Decision ¶
type Decision int
Decision definds possible kinds of name candidates.
const ( NotName Decision = iota Uninomial Binomial PossibleBinomial Trinomial BayesUninomial BayesBinomial BayesTrinomial )
Possible Decisions
func (Decision) Cardinality ¶
Cardinality returns number of elements in canonical form of a scientific name. If name is uninomial 1 is returned, for binomial 2, for trinomial 3.
type Features ¶
type Features struct { // Candidate to be a start of a uninomial or binomial. NameStartCandidate bool // The name looks like a possible genus name. PotentialBinomialGenus bool // The token has necessary qualities to be a start of a binomial. StartsWithLetter bool // The token has necessary quality to be a species part of trinomial. EndsWithLetter bool // Capitalized feature of the first alphabetic character. Capitalized bool // CapitalizedSpecies -- the first species lphabetic character is capitalized. CapitalizedSpecies bool // HasDash -- information if '-' character is part of the word HasDash bool // ParensEnd feature: token starts with parentheses. ParensStart bool // ParensEnd feature: token ends with parentheses. ParensEnd bool // ParensEndSpecies feature: species token ends with parentheses. ParensEndSpecies bool // Abbr feature: token ends with a period. Abbr bool // RankLike is true if token is a known infraspecific rank RankLike bool // UninomialDict defines which Genera or Uninomials dictionary (if any) // contained the token. UninomialDict dict.DictionaryType // SpeciesDict defines which Species dictionary (if any) contained the token. SpeciesDict dict.DictionaryType }
Features keep properties of a token as a possible candidate for a name part.
type NLP ¶
type NLP struct { // Odds are posterior odds. Odds float64 // OddsDetails are elements from which Odds are calculated. OddsDetails // LabelFreq is used to calculate prior odds of names appearing in a // document LabelFreq bayes.LabelFreq }
NLP collects data received from Bayes' algorithm
type OddsDetails ¶
type OddsDetails map[string]map[bayes.FeatureName]map[bayes.FeatureValue]float64
OddsDetails are elements from which Odds are calculated
func NewOddsDetails ¶
func NewOddsDetails(l bayes.Likelihoods) OddsDetails
type Token ¶
type Token struct { // Raw is a verbatim presentation of a token as it appears in a text. Raw []rune // Cleaned is a presentation of a token after normalization. Cleaned string // Start is the index of the first rune of a token. The first rune // does not have to be alpha-numeric. Start int // End is the index of the last rune of a token. The last rune does not // have to be alpha-numeric. End int // Decision tags the first token of a possible name with a classification // decision. Decision // Indices of semantic elements of a possible name. Indices // NLP data NLP // Features is a collection of features associated with the token Features }
Token represents a word separated by spaces in a text. Words split by new lines are concatenated.
func (*Token) Clean ¶
func (t *Token) Clean()
Clean converts a verbatim (Raw) string of a token into normalized cleaned up version.
func (*Token) InParentheses ¶
InParentheses is true if token is surrounded by parentheses.
func (*Token) SetRank ¶
func (t *Token) SetRank(d *dict.Dictionary)
func (*Token) SetSpeciesDict ¶
func (t *Token) SetSpeciesDict(d *dict.Dictionary)
func (*Token) SetUninomialDict ¶
func (t *Token) SetUninomialDict(d *dict.Dictionary)