ngrams

package
v0.0.0-...-8e01fea Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Apr 18, 2024 License: MIT Imports: 15 Imported by: 0

Documentation

Overview

Package ngrams calculates monograms (1-gram), bigrams (2-grams), trigrams (3-grams), quadgrams (4-grams) and quintgrams (5-grams) for either letters or words given a corpora.

Index

Constants

This section is empty.

Variables

This section is empty.

Functions

func ParseLetterTokens

func ParseLetterTokens(ctx context.Context, input io.Reader, language alphabet.Language,
	tokenSize int, recv RecvTokenFunc) error

ParseLetterTokens is used to parse ngrams for letter combinations of the given tokenSize and language from the io.Reader.

func ParseWordTokens

func ParseWordTokens(ctx context.Context, input io.Reader, language alphabet.Language,
	tokenSize int, recv RecvTokenFunc) error

ParseWordTokens is used to parse ngrams for word combinations of the given tokenSize and language from the io.Reader.

Types

type Frequency

type Frequency struct {
	Token      string
	Count      int
	Percentage float32
}

type FrequencyProcessor

type FrequencyProcessor struct {
	// contains filtered or unexported fields
}

FrequencyProcessor is used to parse letter or word ngrams from input sources.

func NewFrequencyProcessor

func NewFrequencyProcessor(mode ProcessorMode, language alphabet.Language, tokenSize int) *FrequencyProcessor

NewFrequencyProcessor creates a new frequency table and does not report progress.

func (*FrequencyProcessor) FrequencyTable

func (p *FrequencyProcessor) FrequencyTable() *FrequencyTable

Table returns the frequency table.

func (*FrequencyProcessor) LoadFrequenciesFromFile

func (p *FrequencyProcessor) LoadFrequenciesFromFile(path string) error

LoadFrequenciesFromFile replaces the current frequency table by parsing frequencies from the given file path.

func (*FrequencyProcessor) ProcessFiles

func (p *FrequencyProcessor) ProcessFiles(ctx context.Context, paths []string) error

ProcessFiles updates the frequency table by parsing letter or word ngrams from the given input paths.

func (*FrequencyProcessor) Save

func (p *FrequencyProcessor) Save(path string) error

Save the frequency table to the given file path.

func (*FrequencyProcessor) SetProgressReporter

func (p *FrequencyProcessor) SetProgressReporter(reporter processor.ProgressReporter)

SetProgressReporter sets the progress reporter to use.

type FrequencyTable

type FrequencyTable struct {
	// contains filtered or unexported fields
}

func LoadFrequencies

func LoadFrequencies(r io.Reader) (*FrequencyTable, error)

LoadFrequencies parses a frequency table from an io.Reader.

Expected CSV format in UTF-8: token,count,percentage Lines starting with a # is ignored.

func LoadFrequenciesFromFile

func LoadFrequenciesFromFile(path string) (*FrequencyTable, error)

Load a set of languages from a UTF-8 encoded text file. See LoadLanguages for more details.

func NewFrequencyTable

func NewFrequencyTable() *FrequencyTable

NewFrequencyTable creates a new FrequencyTable.

func (*FrequencyTable) Add

func (ft *FrequencyTable) Add(token string, count int)

Add a token with the given frequency count. If the token has already been added then it's count will be incremented.

func (*FrequencyTable) Entries

func (ft *FrequencyTable) Entries() []Frequency

Entries returns the token frequencies in the table. NOTE: The order can not be guaranteed since the underlying data structure uses a map.

func (*FrequencyTable) EntriesSortedByCount

func (ft *FrequencyTable) EntriesSortedByCount() []Frequency

EntriesSortedByCount returns the token frequencies in the table sorted by the count (descending) going from the token that appears the most to the least (highest to lowest frequency).

func (*FrequencyTable) Get

func (ft *FrequencyTable) Get(token string) (Frequency, bool)

Get returns the frequency information for the given token. A bool is also returned to indicate if the token does exist in the table or not.

func (*FrequencyTable) Len

func (ft *FrequencyTable) Len() int

Len returns the number of Frequency entries in the table.

func (*FrequencyTable) ParseLetterTokens

func (ft *FrequencyTable) ParseLetterTokens(ctx context.Context, input io.Reader, language alphabet.Language,
	tokenSize int) error

ParseLetterTokens is used to parse ngrams for letter combinations of the given tokenSize and language from the io.Reader and then update the frequency table.

func (*FrequencyTable) ParseWordTokens

func (ft *FrequencyTable) ParseWordTokens(ctx context.Context, input io.Reader, language alphabet.Language,
	tokenSize int) error

ParseWordTokens is used to parse ngrams for word combinations of the given tokenSize and language from the io.Reader and then update the frequency table.

func (*FrequencyTable) Save

func (ft *FrequencyTable) Save(w io.Writer) error

Save the frequency table to the io.Writer in the same CSV format used by the Load functions.

func (*FrequencyTable) Tokens

func (ft *FrequencyTable) Tokens() []string

Tokens returns the unique tokens present in the table. NOTE: The order can not be guaranteed since the underlying data structure uses a map.

func (*FrequencyTable) Update

func (ft *FrequencyTable) Update()

Update will calculate and update the token frequencies.

type ProcessorMode

type ProcessorMode bool

ProcessorMode specifies whether the processor works on letter or word ngrams.

const (
	ProcessLetters ProcessorMode = false
	ProcessWords   ProcessorMode = true
)

type RecvTokenFunc

type RecvTokenFunc func(token string, err error) error

RecvTokenFunc will be called when a new token has been parsed from the input stream. The parser will pass any encountered error to this function and you should assume that the parsing process will stop and that no more tokens will be produced. If this function returns an error then it will indicate to the parser to stop the parsing process.

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL