Documentation ¶
Index ¶
- Variables
- func ToDict(c *Corpus) map[string]int
- func ToDictWithFreq(c *Corpus) map[string]struct{ ... }
- func ViterbiSplit(input string, c *Corpus) []string
- type ConsOpt
- type Corpus
- func (c *Corpus) Add(word string) int
- func (c *Corpus) GobDecode(buf []byte) error
- func (c *Corpus) GobEncode() ([]byte, error)
- func (c *Corpus) IDFreq(id int) int
- func (c *Corpus) Id(word string) (int, bool)
- func (c *Corpus) LoadOneGram(r io.Reader) error
- func (c *Corpus) MaxWordLength() int
- func (c *Corpus) Merge(other *Corpus)
- func (c *Corpus) Replace(a, with string) error
- func (c *Corpus) ReplaceWord(id int, with string) error
- func (c *Corpus) Size() int
- func (c *Corpus) TotalFreq() int
- func (c *Corpus) Word(id int) (string, bool)
- func (c *Corpus) WordFreq(word string) int
- func (c *Corpus) WordProb(word string) (float64, bool)
Constants ¶
This section is empty.
Variables ¶
var NumberWords = map[string]int{
"zero": 0,
"one": 1,
"two": 2,
"three": 3,
"four": 4,
"five": 5,
"six": 6,
"seven": 7,
"eight": 8,
"nine": 9,
"ten": 10,
"eleven": 11,
"twelve": 12,
"thirteen": 13,
"fourteen": 14,
"fifteen": 15,
"sixteen": 16,
"nineteen": 19,
"seventeen": 17,
"eighteen": 18,
"twenty": 20,
"thirty": 30,
"forty": 40,
"fifty": 50,
"sixty": 60,
"seventy": 70,
"eighty": 80,
"ninety": 90,
"hundred": 100,
"thousand": 1000,
"million": 1000000,
"billion": 1000000000,
"trillion": 1000000000000,
"quadrillion": 1000000000000000,
}
NumberWords was generated with this python code
numberWords = {} simple = '''zero one two three four five six seven eight nine ten eleven twelve thirteen fourteen fifteen sixteen seventeen eighteen nineteen twenty'''.split() for i, word in zip(xrange(0, 20+1), simple): numberWords[word] = i tense = '''thirty forty fifty sixty seventy eighty ninety hundred'''.split() for i, word in zip(xrange(30, 100+1, 10), tense): numberWords[word] = i larges = '''thousand million billion trillion quadrillion quintillion sextillion septillion'''.split() for i, word in zip(xrange(3, 24+1, 3), larges): numberWords[word] = 10**i
Functions ¶
func ToDictWithFreq ¶
ToDictWithFreq returns a simple marshalable type. Conceptually it's a JSON object with the words as the keys. The values are a pair - ID and Freq.
func ViterbiSplit ¶
ViterbiSplit is a Viterbi algorithm for splitting words given a corpus
Types ¶
type ConsOpt ¶
ConsOpt is a construction option for manual creation of a Corpus
func FromDict ¶
FromDict is a construction option to take a map[string]int where the int represents the word ID. This is useful for constructing corpuses from foreign sources where the ID mappings are important
func FromDictWithFreq ¶
FromDictWithFreq is like FromDict, but also has a frequency.
func WithOrderedWords ¶
WithOrderedWords creates a Corpus with the given word order
type Corpus ¶
type Corpus struct {
// contains filtered or unexported fields
}
Corpus is a data structure holding the relevant metadata and information for a corpus of text. It serves as vocabulary with ID for lookup. This is very useful as neural networks rely on the IDs rather than the text themselves
func Construct ¶
Construct creates a Corpus given the construction options. This allows for more flexibility
func FromTextCorpus ¶
func FromTextCorpus(r io.Reader, tokenizer func(a string) []string, normalizer func(a string) string) (*Corpus, error)
FromTextCorpus is a utility function to take in a text file, and return a Corpus.
func (*Corpus) Add ¶
Add adds a word to the corpus and returns its ID. If a word was previously in the corpus, it merely updates the frequency count and returns the ID
func (*Corpus) IDFreq ¶
IDFreq returns the frequency of a word given an ID. If the word isn't in the corpus it returns 0.
func (*Corpus) LoadOneGram ¶
LoadOneGram loads a 1_gram.txt file, which is a tab separated file which lists the frequency counts of words. Example:
the 23135851162 of 13151942776 and 12997637966 to 12136980858 a 9081174698 in 8469404971 for 5933321709
func (*Corpus) MaxWordLength ¶
MaxWordLength returns the length of the longest known word in the corpus.
func (*Corpus) Replace ¶
Replace replaces the content of a word. The old reference remains.
e.g: c.Replace("foo", "bar") c.Id("foo") will still return a ID. The ID will be the same as c.Id("bar")
func (*Corpus) ReplaceWord ¶
ReplaceWord replaces the word associated with the given ID. The old reference remains.
func (*Corpus) TotalFreq ¶
TotalFreq returns the total number of words ever seen by the corpus. This number includes the count of repeat words.
func (*Corpus) Word ¶
Word returns the word given the ID, and whether or not it was found in the corpus