Documentation ¶
Overview ¶
Package bayesian is a Naive Bayesian Classifier
Jake Brukhman <jbrukh@gmail.com> BAYESIAN CLASSIFICATION REFRESHER: suppose you have a set of classes (e.g. categories) C := {C_1, ..., C_n}, and a document D consisting of words D := {W_1, ..., W_k}. We wish to ascertain the probability that the document belongs to some class C_j given some set of training data associating documents and classes. By Bayes' Theorem, we have that P(C_j|D) = P(D|C_j)*P(C_j)/P(D). The LHS is the probability that the document belongs to class C_j given the document itself (by which is meant, in practice, the word frequencies occurring in this document), and our program will calculate this probability for each j and spit out the most likely class for this document. P(C_j) is referred to as the "prior" probability, or the probability that a document belongs to C_j in general, without seeing the document first. P(D|C_j) is the probability of seeing such a document, given that it belongs to C_j. Here, by assuming that words appear independently in documents (this being the "naive" assumption), we can estimate P(D|C_j) ~= P(W_1|C_j)*...*P(W_k|C_j) where P(W_i|C_j) is the probability of seeing the given word in a document of the given class. Finally, P(D) can be seen as merely a scaling factor and is not strictly relevant to classificiation, unless you want to normalize the resulting scores and actually see probabilities. In this case, note that P(D) = SUM_j(P(D|C_j)*P(C_j)) One practical issue with performing these calculations is the possibility of float64 underflow when calculating P(D|C_j), as individual word probabilities can be arbitrarily small, and a document can have an arbitrarily large number of them. A typical method for dealing with this case is to transform the probability to the log domain and perform additions instead of multiplications: log P(C_j|D) ~ log(P(C_j)) + SUM_i(log P(W_i|C_j)) where i = 1, ..., k. Note that by doing this, we are discarding the scaling factor P(D) and our scores are no longer probabilities; however, the monotonic relationship of the scores is preserved by the log function.
Index ¶
- Variables
- type Class
- type Classifier
- func (c *Classifier) ConvertTermsFreqToTfIdf()
- func (c *Classifier) IsTfIdf() bool
- func (c *Classifier) Learn(document []string, which Class)
- func (c *Classifier) Learned() int
- func (c *Classifier) LogScores(document []string) (scores []float64, inx int, strict bool)
- func (c *Classifier) Observe(word string, count int, which Class)
- func (c *Classifier) ProbScores(doc []string) (scores []float64, inx int, strict bool)
- func (c *Classifier) ReadClassFromFile(class Class, location string) (err error)
- func (c *Classifier) SafeProbScores(doc []string) (scores []float64, inx int, strict bool, err error)
- func (c *Classifier) Seen() int
- func (c *Classifier) WordCount() (result []int)
- func (c *Classifier) WordFrequencies(words []string) (freqMatrix [][]float64)
- func (c *Classifier) WordsByClass(class Class) (freqMap map[string]float64)
- func (c *Classifier) WriteClassToFile(name Class, rootPath string) (err error)
- func (c *Classifier) WriteClassesToFile(rootPath string) (err error)
- func (c *Classifier) WriteTo(w io.Writer) (err error)
- func (c *Classifier) WriteToFile(name string) (err error)
Constants ¶
This section is empty.
Variables ¶
var ErrUnderflow = errors.New("possible underflow detected")
ErrUnderflow is returned when an underflow is detected.
Functions ¶
This section is empty.
Types ¶
type Class ¶
type Class string
Class defines a class that the classifier will filter: C = {C_1, ..., C_n}. You should define your classes as a set of constants, for example as follows:
const ( Good Class = "Good" Bad Class = "Bad )
Class values should be unique.
type Classifier ¶
type Classifier struct { Classes []Class DidConvertTfIdf bool // we can't classify a TF-IDF classifier if we haven't yet // contains filtered or unexported fields }
Classifier implements the Naive Bayesian Classifier.
func NewClassifier ¶
func NewClassifier(classes ...Class) (c *Classifier)
NewClassifier returns a new classifier. The classes provided should be at least 2 in number and unique, or this method will panic.
func NewClassifierFromFile ¶
func NewClassifierFromFile(name string) (c *Classifier, err error)
NewClassifierFromFile loads an existing classifier from file. The classifier was previously saved with a call to c.WriteToFile(string).
func NewClassifierFromReader ¶
func NewClassifierFromReader(r io.Reader) (c *Classifier, err error)
NewClassifierFromReader: This actually does the deserializing of a Gob encoded classifier
func NewClassifierTfIdf ¶
func NewClassifierTfIdf(classes ...Class) (c *Classifier)
NewClassifierTfIdf returns a new classifier. The classes provided should be at least 2 in number and unique, or this method will panic.
func (*Classifier) ConvertTermsFreqToTfIdf ¶
func (c *Classifier) ConvertTermsFreqToTfIdf()
ConvertTermsFreqToTfIdf uses all the TF samples for the class and converts them to TF-IDF https://en.wikipedia.org/wiki/Tf%E2%80%93idf once we have finished learning all the classes and have the totals.
func (*Classifier) IsTfIdf ¶
func (c *Classifier) IsTfIdf() bool
IsTfIdf returns true if we are a classifier of type TfIdf
func (*Classifier) Learn ¶
func (c *Classifier) Learn(document []string, which Class)
Learn will accept new training documents for supervised learning.
func (*Classifier) Learned ¶
func (c *Classifier) Learned() int
Learned returns the number of documents ever learned in the lifetime of this classifier.
func (*Classifier) LogScores ¶
func (c *Classifier) LogScores(document []string) (scores []float64, inx int, strict bool)
LogScores produces "log-likelihood"-like scores that can be used to classify documents into classes.
The value of the score is proportional to the likelihood, as determined by the classifier, that the given document belongs to the given class. This is true even when scores returned are negative, which they will be (since we are taking logs of probabilities).
The index j of the score corresponds to the class given by c.Classes[j].
Additionally returned are "inx" and "strict" values. The inx corresponds to the maximum score in the array. If more than one of the scores holds the maximum values, then strict is false.
Unlike c.Probabilities(), this function is not prone to floating point underflow and is relatively safe to use.
func (*Classifier) Observe ¶
func (c *Classifier) Observe(word string, count int, which Class)
Observe should be used when word-frequencies have been already been learned externally (e.g., hadoop)
func (*Classifier) ProbScores ¶
func (c *Classifier) ProbScores(doc []string) (scores []float64, inx int, strict bool)
ProbScores works the same as LogScores, but delivers actual probabilities as discussed above. Note that float64 underflow is possible if the word list contains too many words that have probabilities very close to 0.
Notes on underflow: underflow is going to occur when you're trying to assess large numbers of words that you have never seen before. Depending on the application, this may or may not be a concern. Consider using SafeProbScores() instead.
func (*Classifier) ReadClassFromFile ¶
func (c *Classifier) ReadClassFromFile(class Class, location string) (err error)
ReadClassFromFile loads existing class data from a file.
func (*Classifier) SafeProbScores ¶
func (c *Classifier) SafeProbScores(doc []string) (scores []float64, inx int, strict bool, err error)
SafeProbScores works the same as ProbScores, but is able to detect underflow in those cases where underflow results in the reverse classification. If an underflow is detected, this method returns an ErrUnderflow, allowing the user to deal with it as necessary. Note that underflow, under certain rare circumstances, may still result in incorrect probabilities being returned, but this method guarantees that all error-less invokations are properly classified.
Underflow detection is more costly because it also has to make additional log score calculations.
func (*Classifier) Seen ¶
func (c *Classifier) Seen() int
Seen returns the number of documents ever classified in the lifetime of this classifier.
func (*Classifier) WordCount ¶
func (c *Classifier) WordCount() (result []int)
WordCount returns the number of words counted for each class in the lifetime of the classifier.
func (*Classifier) WordFrequencies ¶
func (c *Classifier) WordFrequencies(words []string) (freqMatrix [][]float64)
WordFrequencies returns a matrix of word frequencies that currently exist in the classifier for each class state for the given input words. In other words, if you obtain the frequencies
freqs := c.WordFrequencies(/* [j]string */)
then the expression freq[i][j] represents the frequency of the j-th word within the i-th class.
func (*Classifier) WordsByClass ¶
func (c *Classifier) WordsByClass(class Class) (freqMap map[string]float64)
WordsByClass returns a map of words and their probability of appearing in the given class.
func (*Classifier) WriteClassToFile ¶
func (c *Classifier) WriteClassToFile(name Class, rootPath string) (err error)
WriteClassToFile writes a single class to file.
func (*Classifier) WriteClassesToFile ¶
func (c *Classifier) WriteClassesToFile(rootPath string) (err error)
WriteClassesToFile writes all classes to files.
func (*Classifier) WriteTo ¶
func (c *Classifier) WriteTo(w io.Writer) (err error)
WriteTo serializes this classifier to GOB and write to Writer.
func (*Classifier) WriteToFile ¶
func (c *Classifier) WriteToFile(name string) (err error)
WriteToFile serializes this classifier to a file.