Documentation ¶
Overview ¶
Package text holds models which make text classification easy. They are regular models, but take strings as arguments so you can feed in documents rather than large, hand-constructed word vectors. Although models might represent the words as these vectors, the munging of a document is hidden from the user.
The simplest model, although suprisingly effective, is Naive Bayes. If you want to read more about the specific model, check out the docs for the NaiveBayes struct/model.
The following example is an online Naive Bayes model used for sentiment analysis.
Example Online Naive Bayes Text Classifier (multiclass):
// create the channel of data and errors stream := make(chan base.TextDatapoint, 100) errors := make(chan error) // make a new NaiveBayes model with // 2 classes expected (classes in // datapoints will now expect {0,1}. // in general, given n as the classes // variable, the model will expect // datapoint classes in {0,...,n-1}) // // Note that the model is filtering // the text to omit anything except // words and numbers (and spaces // obviously) model := NewNaiveBayes(stream, 2, base.OnlyWordsAndNumbers) go model.OnlineLearn(errors) stream <- base.TextDatapoint{ X: "I love the city", Y: 1, } stream <- base.TextDatapoint{ X: "I hate Los Angeles", Y: 0, } stream <- base.TextDatapoint{ X: "My mother is not a nice lady", Y: 0, } close(stream) for { err, more := <- errors if err != nil { fmt.Fprintf(b.Output, "Error passed: %v", err) } else { // training is done! break } } // now you can predict like normal class := model.Predict("My mother is in Los Angeles") // 0
Index ¶
- type Frequencies
- type Frequency
- type NaiveBayes
- func (b *NaiveBayes) OnlineLearn(errors chan<- error)
- func (b *NaiveBayes) PersistToFile(path string) error
- func (b *NaiveBayes) Predict(sentence string) uint8
- func (b *NaiveBayes) Probability(sentence string) (uint8, float64)
- func (b *NaiveBayes) Restore(data []byte) error
- func (b *NaiveBayes) RestoreFromFile(path string) error
- func (b *NaiveBayes) RestoreWithFuncs(data io.Reader, sanitizer func(rune) bool, tokenizer Tokenizer) error
- func (b *NaiveBayes) String() string
- func (b *NaiveBayes) UpdateSanitize(sanitize func(rune) bool)
- func (b *NaiveBayes) UpdateStream(stream chan base.TextDatapoint)
- func (b *NaiveBayes) UpdateTokenizer(tokenizer Tokenizer)
- type SimpleTokenizer
- type TFIDF
- type Tokenizer
- type Word
Constants ¶
This section is empty.
Variables ¶
This section is empty.
Functions ¶
This section is empty.
Types ¶
type Frequencies ¶
type Frequencies []Frequency
Frequencies is an array of word frequencies (stored as separate type to be able to sort)
func TermFrequencies ¶
func TermFrequencies(document []string) Frequencies
TermFrequencies gives the TermFrequency of all words in a document, and is more efficient at doing so than calling that function multiple times
func (Frequencies) Less ¶
func (f Frequencies) Less(i, j int) bool
Less gives whether the ith element of a frequency list has is lesser than the jth element by comparing their TFIDF values
func (Frequencies) Swap ¶
func (f Frequencies) Swap(i, j int)
Swap swaps two indexed values in a frequency slice
type Frequency ¶
type Frequency struct { Word string `json:"word"` Frequency float64 `json:"frequency,omitempty"` TFIDF float64 `json:"tfidf_score,omitempty"` }
Frequency holds word frequency information so you don't have to hold a map[string]float64 and can, then, sort
type NaiveBayes ¶
type NaiveBayes struct { // Words holds a map of words // to their corresponding Word // structure Words concurrentMap `json:"words"` // Count holds the number of times // class i was seen as Count[i] Count []uint64 `json:"count"` // Probabilities holds the probability // that class Y is class i as // Probabilities[i] for Probabilities []float64 `json:"probabilities"` // DocumentCount holds the number of // documents that have been seen DocumentCount uint64 `json:"document_count"` // DictCount holds the size of the // NaiveBayes model's vocabulary DictCount uint64 `json:"vocabulary_size"` // tokenizer is used by a model // to split the input into tokens Tokenizer Tokenizer `json:"tokenizer"` // Output is the io.Writer used for logging // and printing. Defaults to os.Stdout. Output io.Writer `json:"-"` // contains filtered or unexported fields }
NaiveBayes is a general classification model that calculates the probability that a datapoint is part of a class by using Bayes Rule:
P(y|x) = P(x|y)*P(y)/P(x)
The unique part of this model is that it assumes words are unrelated to eachother. For example, the probability of seeing the word 'penis' in spam emails if you've already seen 'viagra' might be different than if you hadn't seen it. The model ignores this fact because the computation of full Bayesian model would take much longer, and would grow significantly with each word you see.
https://en.wikipedia.org/wiki/Naive_Bayes_classifier http://cs229.stanford.edu/notes/cs229-notes2.pdf
Based on Bayes Rule, we can easily calculate the numerator (x | y is just the number of times x is seen and the class=y, and P(y) is just the number of times y=class / the number of positive training examples/words.) The denominator is also easy to calculate, but if you recognize that it's just a constant because it's just the probability of seeing a certain document given the dataset we can make the following transformation to be able to classify without as much classification:
Class(x) = argmax_c{P(y = c) * ∏P(x|y = c)}
And we can use logarithmic transformations to make this calculation more computer-practical (multiplying a bunch of probabilities on [0,1] will always result in a very small number which could easily underflow the float value):
Class(x) = argmax_c{log(P(y = c)) + Σ log(P(x|y = c)0}
Much better. That's our model!
func NewNaiveBayes ¶
func NewNaiveBayes(stream <-chan base.TextDatapoint, classes uint8, sanitize func(rune) bool) *NaiveBayes
NewNaiveBayes returns a NaiveBayes model the given number of classes instantiated, ready to learn off the given data stream. The sanitization function is set to the given function. It must comply with the transform.RemoveFunc interface
func (*NaiveBayes) OnlineLearn ¶
func (b *NaiveBayes) OnlineLearn(errors chan<- error)
OnlineLearn lets the NaiveBayes model learn from the datastream, waiting for new data to come into the stream from a separate goroutine
func (*NaiveBayes) PersistToFile ¶
func (b *NaiveBayes) PersistToFile(path string) error
PersistToFile takes in an absolute filepath and saves the parameter vector θ to the file, which can be restored later. The function will take paths from the current directory, but functions
The data is stored as JSON because it's one of the most efficient storage method (you only need one comma extra per feature + two brackets, total!) And it's extendable.
func (*NaiveBayes) Predict ¶
func (b *NaiveBayes) Predict(sentence string) uint8
Predict takes in a document, predicts the class of the document based on the training data passed so far, and returns the class estimated for the document.
func (*NaiveBayes) Probability ¶
func (b *NaiveBayes) Probability(sentence string) (uint8, float64)
Probability takes in a small document, returns the estimated class of the document based on the model as well as the probability that the model is part of that class
NOTE: you should only use this for small documents because, as discussed in the docs for the model, the probability will often times underflow because you are multiplying together a bunch of probabilities which range on [0,1]. As such, the returned float could be NaN, and the predicted class could be 0 always.
Basically, use Predict to be robust for larger documents. Use Probability only on relatively small (MAX of maybe a dozen words - basically just sentences and words) documents.
func (*NaiveBayes) Restore ¶
func (b *NaiveBayes) Restore(data []byte) error
Restore takes the bytes of a NaiveBayes model and restores a model to it. It defaults the sanitizer to base.OnlyWordsAndNumbers and the tokenizer to to a SimpleTokenizer that splits on spaces.
This would be useful if training a model and saving it into a project using go-bindata (look it up) so you don't have to persist a large file and deal with paths on a production system. This option is included in text models vs. others because the text models usually have much larger storage requirements.
func (*NaiveBayes) RestoreFromFile ¶
func (b *NaiveBayes) RestoreFromFile(path string) error
RestoreFromFile takes in a path to a parameter vector theta and assigns the model it's operating on's parameter vector to that. The only parameters not in the vector are the sanitization and tokenization functions which default to base.OnlyWordsAndNumbers and SimpleTokenizer{SplitOn: " "}
The path must ba an absolute path or a path from the current directory
This would be useful in persisting data between running a model on data.
func (*NaiveBayes) RestoreWithFuncs ¶
func (b *NaiveBayes) RestoreWithFuncs(data io.Reader, sanitizer func(rune) bool, tokenizer Tokenizer) error
RestoreWithFuncs takes raw JSON data of a model and restores a model from it. The tokenizer and sanitizer passed in will be assigned to the restored model.
func (*NaiveBayes) String ¶
func (b *NaiveBayes) String() string
String implements the fmt interface for clean printing. Here we're using it to print the model as the equation h(θ)=... where h is the perceptron hypothesis model.
func (*NaiveBayes) UpdateSanitize ¶
func (b *NaiveBayes) UpdateSanitize(sanitize func(rune) bool)
UpdateSanitize updates the NaiveBayes model's text sanitization transformation function
func (*NaiveBayes) UpdateStream ¶
func (b *NaiveBayes) UpdateStream(stream chan base.TextDatapoint)
UpdateStream updates the NaiveBayes model's text datastream
func (*NaiveBayes) UpdateTokenizer ¶
func (b *NaiveBayes) UpdateTokenizer(tokenizer Tokenizer)
UpdateTokenizer updates NaiveBayes model's tokenizer function. The default implementation will convert the input to lower case and split on the space character.
type SimpleTokenizer ¶
type SimpleTokenizer struct {
SplitOn string
}
SimpleTokenizer splits sentences into tokens delimited by its SplitOn string – space, for example
func (*SimpleTokenizer) Tokenize ¶
func (t *SimpleTokenizer) Tokenize(sentence string) []string
Tokenize splits input sentences into a lowecase slice of strings. The tokenizer's SlitOn string is used as a delimiter and it
type TFIDF ¶
type TFIDF NaiveBayes
TFIDF is a Term Frequency- Inverse Document Frequency model that is created from a trained NaiveBayes model (they are very similar so you can just train NaiveBayes and convert into TDIDF)
This is not a probabalistic model, necessarily, and doesn't give classification. It can be used to determine the 'importance' of a word in a document, though, which is useful in, say, keyword tagging.
Term frequency is basically just adjusted frequency of a word within a document/sentence: termFrequency(word, doc) = 0.5 * ( 0.5 * word.Count ) / max{ w.Count | w ∈ doc }
Inverse document frequency is basically how little the term is mentioned within all of your documents: invDocumentFrequency(word, Docs) = log( len(Docs) ) - log( 1 + |{ d ∈ Docs | t ∈ d}| )
TFIDF is the multiplication of those two functions, giving you a term that is larger when the word is more important, and less when the word is less important
func (*TFIDF) InverseDocumentFrequency ¶
InverseDocumentFrequency returns the 'uniqueness' of a word within the corpus defined within a trained NaiveBayes model.
Look at the TFIDF docs to see more about how this is calculated
func (*TFIDF) MostImportantWords ¶
func (t *TFIDF) MostImportantWords(sentence string, n int) Frequencies
MostImportantWords runs TFIDF on a whole document, returning the n most important words in the document. If n is greater than the number of words then all words will be returned.
The returned keyword slice is sorted by importance
type Word ¶
type Word struct { // Count holds the number of times, // (i in Count[i] is the given class) Count []uint64 // Seen holds the number of times // the world has been seen. This // is than same as // foldl (+) 0 Count // in Haskell syntax, but is included // you wouldn't have to calculate // this every time you wanted to // recalc the probabilities (foldl // is the same as reduce, basically.) Seen uint64 // DocsSeen is the same as Seen but // a word is only counted once even // if it's in a document multiple times DocsSeen uint64 `json:"-"` }
Word holds the structural information needed to calculate the probability of