Documentation ¶
Index ¶
- func CombineInts(ints []int) int
- func CosineSimilarity(a, b []string) float64
- func DamerauLevenshtein(s1 string, s2 string) (distance int)
- func LongestCommonPrefix(strs ...string) string
- func Pluralize(word string) string
- func Singularize(word string) string
- func StrsToInts(strs []string) (retVal []int, err error)
- func ViterbiSplit(input string, c *Corpus) []string
- type ConsOpt
- type Corpus
- func (c *Corpus) Add(word string) int
- func (c *Corpus) GobDecode(buf []byte) error
- func (c *Corpus) GobEncode() ([]byte, error)
- func (c *Corpus) IDFreq(id int) int
- func (c *Corpus) Id(word string) (int, bool)
- func (c *Corpus) LoadOneGram(r io.Reader) error
- func (c *Corpus) MaxWordLength() int
- func (c *Corpus) Merge(other *Corpus)
- func (c *Corpus) Size() int
- func (c *Corpus) TotalFreq() int
- func (c *Corpus) Word(id int) (string, bool)
- func (c *Corpus) WordFreq(word string) int
- func (c *Corpus) WordProb(word string) (float64, bool)
- type LDAModel
Constants ¶
This section is empty.
Variables ¶
This section is empty.
Functions ¶
func CombineInts ¶
CombineInts takes a int slice, and tries to make it one integer. It works by taking advantage of english - anything more than 1000 has a repeated pattern e.g.
one hundred and fifty thousand two hundred and two
there are 2 repeated patterns (one hundred and fifty) and (two hundred and two)
This allows us to repeatedly combine by addition or multiplication until there is one left
func CosineSimilarity ¶
CosineSimilarity measures the cosine similarity of two strings.
func DamerauLevenshtein ¶
DamerauLevenshtein calculates the Damerau-Levensthtein distance between two strings. See more at https://en.wikipedia.org/wiki/Damerau%E2%80%93Levenshtein_distance
func LongestCommonPrefix ¶
LongestCommonPrefix takes a slice of strings, and finds the longest common prefix
func Singularize ¶
Singularize singularizes words based on rules known
func StrsToInts ¶
StrsToInts converts a string slice into an int slice, with the help of NumberWords. The function assumes all helper words like "and" have been stripped.
"One hundred and five" -> []string{"one", "hundred", "five"}
This is a very primitive method, and doesn't take into account other words like "a hundred" or "a couple of hundred"
func ViterbiSplit ¶
ViterbiSplit is a Viterbi algorithm for splitting words given a corpus
Types ¶
type Corpus ¶
type Corpus struct {
// contains filtered or unexported fields
}
Corpus is a data structure holding the relevant metadata and information for a corpus of text. It serves as vocabulary with ID for lookup. This is very useful as neural networks rely on the IDs rather than the text themselves
func Construct ¶
Construct creates a Corpus given the construction options. This allows for more flexibility
func GenerateCorpus ¶
func GenerateCorpus(sentenceTags []treebank.SentenceTag) *Corpus
GenerateCorpus creates a Corpus given a set of SentenceTag from a training set.
func (*Corpus) Add ¶
Add adds a word to the corpus and returns its ID. If a word was previously in the corpus, it merely updates the frequency count and returns the ID
func (*Corpus) IDFreq ¶
IDFreq returns the frequency of a word given an ID. If the word isn't in the corpus it returns 0.
func (*Corpus) LoadOneGram ¶
LoadOneGram loads a 1_gram.txt file, which is a tab separated file which lists the frequency counts of words. Example:
the 23135851162 of 13151942776 and 12997637966 to 12136980858 a 9081174698 in 8469404971 for 5933321709
func (*Corpus) MaxWordLength ¶
MaxWordLength returns the length of the longest known word in the corpus.
func (*Corpus) TotalFreq ¶
TotalFreq returns the total number of words ever seen by the corpus. This number includes the count of repeat words.
func (*Corpus) Word ¶
Word returns the word given the ID, and whether or not it was found in the corpus
type LDAModel ¶
type LDAModel struct { // params Alpha tensor.Tensor // is a Row Eta tensor.Tensor // is a Col Kappa gorgonia.Scalar // Decay Tau0 gorgonia.Scalar // offset // parameters needed for working Topics int ChunkSize int Terms int UpdateEvery int EvalEvery int // consts Iterations int GammaThreshold float64 MinimumProb float64 // track current progress Updates int // type Dtype tensor.Dtype }
LDAModel ... TODO https://en.wikipedia.org/wiki/Latent_Dirichlet_allocation