corpus

package
v0.0.0-...-491e816 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Sep 18, 2020 License: MIT Imports: 18 Imported by: 10

Documentation

Index

Constants

This section is empty.

Variables

This section is empty.

Functions

func CombineInts

func CombineInts(ints []int) int

CombineInts takes a int slice, and tries to make it one integer. It works by taking advantage of english - anything more than 1000 has a repeated pattern e.g.

one hundred and fifty thousand two hundred and two

there are 2 repeated patterns (one hundred and fifty) and (two hundred and two)

This allows us to repeatedly combine by addition or multiplication until there is one left

func CosineSimilarity

func CosineSimilarity(a, b []string) float64

CosineSimilarity measures the cosine similarity of two strings.

func DamerauLevenshtein

func DamerauLevenshtein(s1 string, s2 string) (distance int)

DamerauLevenshtein calculates the Damerau-Levensthtein distance between two strings. See more at https://en.wikipedia.org/wiki/Damerau%E2%80%93Levenshtein_distance

func LongestCommonPrefix

func LongestCommonPrefix(strs ...string) string

LongestCommonPrefix takes a slice of strings, and finds the longest common prefix

func Pluralize

func Pluralize(word string) string

Pluralize pluralizes words based on rules known

func Singularize

func Singularize(word string) string

Singularize singularizes words based on rules known

func StrsToInts

func StrsToInts(strs []string) (retVal []int, err error)

StrsToInts converts a string slice into an int slice, with the help of NumberWords. The function assumes all helper words like "and" have been stripped.

"One hundred and five" -> []string{"one", "hundred", "five"}

This is a very primitive method, and doesn't take into account other words like "a hundred" or "a couple of hundred"

func ToDict

func ToDict(c *Corpus) map[string]int

ToDict returns a marshalable dict. It returns a copy of the ID mapping.

func ToDictWithFreq

func ToDictWithFreq(c *Corpus) map[string]struct{ ID, Freq int }

ToDictWithFreq returns a simple marshalable type. Conceptually it's a JSON object with the words as the keys. The values are a pair - ID and Freq.

func ViterbiSplit

func ViterbiSplit(input string, c *Corpus) []string

ViterbiSplit is a Viterbi algorithm for splitting words given a corpus

Types

type ConsOpt

type ConsOpt func(c *Corpus) error

ConsOpt is a construction option for manual creation of a Corpus

func FromDict

func FromDict(d map[string]int) ConsOpt

FromDict is a construction option to take a map[string]int where the int represents the word ID. This is useful for constructing corpuses from foreign sources where the ID mappings are important

func FromDictWithFreq

func FromDictWithFreq(d map[string]struct{ ID, Freq int }) ConsOpt

FromDictWithFreq is like FromDict, but also has a frequency.

func WithOrderedWords

func WithOrderedWords(a []string) ConsOpt

WithOrderedWords creates a Corpus with the given word order

func WithSize

func WithSize(size int) ConsOpt

WithSize preallocates all the things in Corpus

func WithWords

func WithWords(a []string) ConsOpt

WithWords creates a corpus from a word list. It may have repeated words

type Corpus

type Corpus struct {
	// contains filtered or unexported fields
}

Corpus is a data structure holding the relevant metadata and information for a corpus of text. It serves as vocabulary with ID for lookup. This is very useful as neural networks rely on the IDs rather than the text themselves

func Construct

func Construct(opts ...ConsOpt) (*Corpus, error)

Construct creates a Corpus given the construction options. This allows for more flexibility

func GenerateCorpus

func GenerateCorpus(sentenceTags []treebank.SentenceTag) *Corpus

GenerateCorpus creates a Corpus given a set of SentenceTag from a training set.

func New

func New() *Corpus

New creates a new *Corpus

func (*Corpus) Add

func (c *Corpus) Add(word string) int

Add adds a word to the corpus and returns its ID. If a word was previously in the corpus, it merely updates the frequency count and returns the ID

func (*Corpus) GobDecode

func (c *Corpus) GobDecode(buf []byte) error

GobDecode implements GobDecoder for *Corpus

func (*Corpus) GobEncode

func (c *Corpus) GobEncode() ([]byte, error)

GobEncode implements GobEncoder for *Corpus

func (*Corpus) IDFreq

func (c *Corpus) IDFreq(id int) int

IDFreq returns the frequency of a word given an ID. If the word isn't in the corpus it returns 0.

func (*Corpus) Id

func (c *Corpus) Id(word string) (int, bool)

ID returns the ID of a word and whether or not it was found in the corpus

func (*Corpus) LoadOneGram

func (c *Corpus) LoadOneGram(r io.Reader) error

LoadOneGram loads a 1_gram.txt file, which is a tab separated file which lists the frequency counts of words. Example:

the	23135851162
of	13151942776
and	12997637966
to	12136980858
a	9081174698
in	8469404971
for	5933321709

func (*Corpus) MaxWordLength

func (c *Corpus) MaxWordLength() int

MaxWordLength returns the length of the longest known word in the corpus.

func (*Corpus) Merge

func (c *Corpus) Merge(other *Corpus)

Merge combines two corpuses. The receiver is the one that is mutated.

func (*Corpus) Replace

func (c *Corpus) Replace(a, with string) error

Replace replaces the content of a word. The old reference remains.

e.g: c.Replace("foo", "bar") c.Id("foo") will still return a ID. The ID will be the same as c.Id("bar")

func (*Corpus) ReplaceWord

func (c *Corpus) ReplaceWord(id int, with string) error

ReplaceWord replaces the word associated with the given ID. The old reference remains.

func (*Corpus) Size

func (c *Corpus) Size() int

Size returns the size of the corpus.

func (*Corpus) TotalFreq

func (c *Corpus) TotalFreq() int

TotalFreq returns the total number of words ever seen by the corpus. This number includes the count of repeat words.

func (*Corpus) Word

func (c *Corpus) Word(id int) (string, bool)

Word returns the word given the ID, and whether or not it was found in the corpus

func (*Corpus) WordFreq

func (c *Corpus) WordFreq(word string) int

WordFreq returns the frequency of the word. If the word wasn't in the corpus, it returns 0.

func (*Corpus) WordProb

func (c *Corpus) WordProb(word string) (float64, bool)

WordProb returns the probability of a word appearing in the corpus.

type LDAModel

type LDAModel struct {
	// params
	Alpha tensor.Tensor // is a Row
	Eta   tensor.Tensor // is a Col

	// parameters needed for working
	Topics      int
	ChunkSize   int
	Terms       int
	UpdateEvery int
	EvalEvery   int

	// consts
	Iterations     int
	GammaThreshold float64

	MinimumProb float64

	// track current progress
	Updates int

	// type
	Dtype tensor.Dtype
}

LDAModel ... TODO https://en.wikipedia.org/wiki/Latent_Dirichlet_allocation

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL