ngrams

package

v0.0.0-...-8e01fea Latest Latest Go to latest Published: Apr 18, 2024 License: MIT Imports: 15 Imported by: 0

Details

Valid go.mod file
Redistributable license
Tagged version
Stable version
Learn more about best practices

Repository

github.com/andrejacobs/go-analyse

Links

Open Source Insights

Documentation ¶

Overview ¶

Package ngrams calculates monograms (1-gram), bigrams (2-grams), trigrams (3-grams), quadgrams (4-grams) and quintgrams (5-grams) for either letters or words given a corpora.

Index ¶

func ParseLetterTokens(ctx context.Context, input io.Reader, language alphabet.Language, ...) error
func ParseWordTokens(ctx context.Context, input io.Reader, language alphabet.Language, ...) error
type Frequency
type FrequencyProcessor
- func NewFrequencyProcessor(mode ProcessorMode, language alphabet.Language, tokenSize int) *FrequencyProcessor
type FrequencyTable
type ProcessorMode
type RecvTokenFunc

Constants ¶

This section is empty.

Variables ¶

This section is empty.

Functions ¶

func ParseLetterTokens ¶

func ParseLetterTokens(ctx context.Context, input io.Reader, language alphabet.Language,
	tokenSize int, recv RecvTokenFunc) error

ParseLetterTokens is used to parse ngrams for letter combinations of the given tokenSize and language from the io.Reader.

func ParseWordTokens ¶

func ParseWordTokens(ctx context.Context, input io.Reader, language alphabet.Language,
	tokenSize int, recv RecvTokenFunc) error

ParseWordTokens is used to parse ngrams for word combinations of the given tokenSize and language from the io.Reader.

Types ¶

type Frequency ¶

type Frequency struct {
	Token      string
	Count      int
	Percentage float32
}

type FrequencyProcessor ¶

type FrequencyProcessor struct {
	// contains filtered or unexported fields
}

FrequencyProcessor is used to parse letter or word ngrams from input sources.

func NewFrequencyProcessor ¶

func NewFrequencyProcessor(mode ProcessorMode, language alphabet.Language, tokenSize int) *FrequencyProcessor

NewFrequencyProcessor creates a new frequency table and does not report progress.

func (*FrequencyProcessor) FrequencyTable ¶

func (p *FrequencyProcessor) FrequencyTable() *FrequencyTable

Table returns the frequency table.

func (*FrequencyProcessor) LoadFrequenciesFromFile ¶

func (p *FrequencyProcessor) LoadFrequenciesFromFile(path string) error

LoadFrequenciesFromFile replaces the current frequency table by parsing frequencies from the given file path.

func (*FrequencyProcessor) ProcessFiles ¶

func (p *FrequencyProcessor) ProcessFiles(ctx context.Context, paths []string) error

ProcessFiles updates the frequency table by parsing letter or word ngrams from the given input paths.

func (*FrequencyProcessor) Save ¶

func (p *FrequencyProcessor) Save(path string) error

Save the frequency table to the given file path.

func (*FrequencyProcessor) SetProgressReporter ¶

func (p *FrequencyProcessor) SetProgressReporter(reporter processor.ProgressReporter)

SetProgressReporter sets the progress reporter to use.

type FrequencyTable ¶

type FrequencyTable struct {
	// contains filtered or unexported fields
}

func LoadFrequencies ¶

func LoadFrequencies(r io.Reader) (*FrequencyTable, error)

LoadFrequencies parses a frequency table from an io.Reader.

Expected CSV format in UTF-8: token,count,percentage Lines starting with a # is ignored.

func LoadFrequenciesFromFile ¶

func LoadFrequenciesFromFile(path string) (*FrequencyTable, error)

Load a set of languages from a UTF-8 encoded text file. See LoadLanguages for more details.

func NewFrequencyTable ¶

func NewFrequencyTable() *FrequencyTable

NewFrequencyTable creates a new FrequencyTable.

func (*FrequencyTable) Add ¶

func (ft *FrequencyTable) Add(token string, count int)

Add a token with the given frequency count. If the token has already been added then it's count will be incremented.

func (*FrequencyTable) Entries ¶

func (ft *FrequencyTable) Entries() []Frequency

Entries returns the token frequencies in the table. NOTE: The order can not be guaranteed since the underlying data structure uses a map.

func (*FrequencyTable) EntriesSortedByCount ¶

func (ft *FrequencyTable) EntriesSortedByCount() []Frequency

EntriesSortedByCount returns the token frequencies in the table sorted by the count (descending) going from the token that appears the most to the least (highest to lowest frequency).

func (*FrequencyTable) Get ¶

func (ft *FrequencyTable) Get(token string) (Frequency, bool)

Get returns the frequency information for the given token. A bool is also returned to indicate if the token does exist in the table or not.

func (*FrequencyTable) Len ¶

func (ft *FrequencyTable) Len() int

Len returns the number of Frequency entries in the table.

func (*FrequencyTable) ParseLetterTokens ¶

func (ft *FrequencyTable) ParseLetterTokens(ctx context.Context, input io.Reader, language alphabet.Language,
	tokenSize int) error

ParseLetterTokens is used to parse ngrams for letter combinations of the given tokenSize and language from the io.Reader and then update the frequency table.

func (*FrequencyTable) ParseWordTokens ¶

func (ft *FrequencyTable) ParseWordTokens(ctx context.Context, input io.Reader, language alphabet.Language,
	tokenSize int) error

ParseWordTokens is used to parse ngrams for word combinations of the given tokenSize and language from the io.Reader and then update the frequency table.

func (*FrequencyTable) Save ¶

func (ft *FrequencyTable) Save(w io.Writer) error

Save the frequency table to the io.Writer in the same CSV format used by the Load functions.

func (*FrequencyTable) Tokens ¶

func (ft *FrequencyTable) Tokens() []string

Tokens returns the unique tokens present in the table. NOTE: The order can not be guaranteed since the underlying data structure uses a map.

func (*FrequencyTable) Update ¶

func (ft *FrequencyTable) Update()

Update will calculate and update the token frequencies.

type ProcessorMode ¶

type ProcessorMode bool

ProcessorMode specifies whether the processor works on letter or word ngrams.

const (
	ProcessLetters ProcessorMode = false
	ProcessWords   ProcessorMode = true
)

type RecvTokenFunc ¶

type RecvTokenFunc func(token string, err error) error

RecvTokenFunc will be called when a new token has been parsed from the input stream. The parser will pass any encountered error to this function and you should assume that the parsing process will stop and that no more tokens will be produced. If this function returns an error then it will indicate to the parser to stop the parsing process.

Source Files ¶

View all Source files

?	: This menu
/	: Search site
f or F	: Jump to
y or Y	: Canonical URL