Documentation
¶
Overview ¶
Package ngrams calculates monograms (1-gram), bigrams (2-grams), trigrams (3-grams), quadgrams (4-grams) and quintgrams (5-grams) for either letters or words given a corpora.
Index ¶
- func ParseLetterTokens(ctx context.Context, input io.Reader, language alphabet.Language, ...) error
- func ParseWordTokens(ctx context.Context, input io.Reader, language alphabet.Language, ...) error
- type Frequency
- type FrequencyProcessor
- func (p *FrequencyProcessor) FrequencyTable() *FrequencyTable
- func (p *FrequencyProcessor) LoadFrequenciesFromFile(path string) error
- func (p *FrequencyProcessor) ProcessFiles(ctx context.Context, paths []string) error
- func (p *FrequencyProcessor) Save(path string) error
- func (p *FrequencyProcessor) SetProgressReporter(reporter processor.ProgressReporter)
- type FrequencyTable
- func (ft *FrequencyTable) Add(token string, count int)
- func (ft *FrequencyTable) Entries() []Frequency
- func (ft *FrequencyTable) EntriesSortedByCount() []Frequency
- func (ft *FrequencyTable) Get(token string) (Frequency, bool)
- func (ft *FrequencyTable) Len() int
- func (ft *FrequencyTable) ParseLetterTokens(ctx context.Context, input io.Reader, language alphabet.Language, ...) error
- func (ft *FrequencyTable) ParseWordTokens(ctx context.Context, input io.Reader, language alphabet.Language, ...) error
- func (ft *FrequencyTable) Save(w io.Writer) error
- func (ft *FrequencyTable) Tokens() []string
- func (ft *FrequencyTable) Update()
- type ProcessorMode
- type RecvTokenFunc
Constants ¶
This section is empty.
Variables ¶
This section is empty.
Functions ¶
Types ¶
type FrequencyProcessor ¶
type FrequencyProcessor struct {
// contains filtered or unexported fields
}
FrequencyProcessor is used to parse letter or word ngrams from input sources.
func NewFrequencyProcessor ¶
func NewFrequencyProcessor(mode ProcessorMode, language alphabet.Language, tokenSize int) *FrequencyProcessor
NewFrequencyProcessor creates a new frequency table and does not report progress.
func (*FrequencyProcessor) FrequencyTable ¶
func (p *FrequencyProcessor) FrequencyTable() *FrequencyTable
Table returns the frequency table.
func (*FrequencyProcessor) LoadFrequenciesFromFile ¶
func (p *FrequencyProcessor) LoadFrequenciesFromFile(path string) error
LoadFrequenciesFromFile replaces the current frequency table by parsing frequencies from the given file path.
func (*FrequencyProcessor) ProcessFiles ¶
func (p *FrequencyProcessor) ProcessFiles(ctx context.Context, paths []string) error
ProcessFiles updates the frequency table by parsing letter or word ngrams from the given input paths.
func (*FrequencyProcessor) Save ¶
func (p *FrequencyProcessor) Save(path string) error
Save the frequency table to the given file path.
func (*FrequencyProcessor) SetProgressReporter ¶
func (p *FrequencyProcessor) SetProgressReporter(reporter processor.ProgressReporter)
SetProgressReporter sets the progress reporter to use.
type FrequencyTable ¶
type FrequencyTable struct {
// contains filtered or unexported fields
}
func LoadFrequencies ¶
func LoadFrequencies(r io.Reader) (*FrequencyTable, error)
LoadFrequencies parses a frequency table from an io.Reader.
Expected CSV format in UTF-8: token,count,percentage Lines starting with a # is ignored.
func LoadFrequenciesFromFile ¶
func LoadFrequenciesFromFile(path string) (*FrequencyTable, error)
Load a set of languages from a UTF-8 encoded text file. See LoadLanguages for more details.
func NewFrequencyTable ¶
func NewFrequencyTable() *FrequencyTable
NewFrequencyTable creates a new FrequencyTable.
func (*FrequencyTable) Add ¶
func (ft *FrequencyTable) Add(token string, count int)
Add a token with the given frequency count. If the token has already been added then it's count will be incremented.
func (*FrequencyTable) Entries ¶
func (ft *FrequencyTable) Entries() []Frequency
Entries returns the token frequencies in the table. NOTE: The order can not be guaranteed since the underlying data structure uses a map.
func (*FrequencyTable) EntriesSortedByCount ¶
func (ft *FrequencyTable) EntriesSortedByCount() []Frequency
EntriesSortedByCount returns the token frequencies in the table sorted by the count (descending) going from the token that appears the most to the least (highest to lowest frequency).
func (*FrequencyTable) Get ¶
func (ft *FrequencyTable) Get(token string) (Frequency, bool)
Get returns the frequency information for the given token. A bool is also returned to indicate if the token does exist in the table or not.
func (*FrequencyTable) Len ¶
func (ft *FrequencyTable) Len() int
Len returns the number of Frequency entries in the table.
func (*FrequencyTable) ParseLetterTokens ¶
func (ft *FrequencyTable) ParseLetterTokens(ctx context.Context, input io.Reader, language alphabet.Language, tokenSize int) error
ParseLetterTokens is used to parse ngrams for letter combinations of the given tokenSize and language from the io.Reader and then update the frequency table.
func (*FrequencyTable) ParseWordTokens ¶
func (ft *FrequencyTable) ParseWordTokens(ctx context.Context, input io.Reader, language alphabet.Language, tokenSize int) error
ParseWordTokens is used to parse ngrams for word combinations of the given tokenSize and language from the io.Reader and then update the frequency table.
func (*FrequencyTable) Save ¶
func (ft *FrequencyTable) Save(w io.Writer) error
Save the frequency table to the io.Writer in the same CSV format used by the Load functions.
func (*FrequencyTable) Tokens ¶
func (ft *FrequencyTable) Tokens() []string
Tokens returns the unique tokens present in the table. NOTE: The order can not be guaranteed since the underlying data structure uses a map.
func (*FrequencyTable) Update ¶
func (ft *FrequencyTable) Update()
Update will calculate and update the token frequencies.
type ProcessorMode ¶
type ProcessorMode bool
ProcessorMode specifies whether the processor works on letter or word ngrams.
const ( ProcessLetters ProcessorMode = false ProcessWords ProcessorMode = true )
type RecvTokenFunc ¶
RecvTokenFunc will be called when a new token has been parsed from the input stream. The parser will pass any encountered error to this function and you should assume that the parsing process will stop and that no more tokens will be produced. If this function returns an error then it will indicate to the parser to stop the parsing process.