Documentation ¶
Index ¶
Constants ¶
This section is empty.
Variables ¶
This section is empty.
Functions ¶
This section is empty.
Types ¶
type TFIDF ¶
type TFIDF struct { // train document index in TermFreqs DocIndex map[string]int // term frequency for each train document TermFreqs []map[string]int // documents number for each term in train data TermDocs map[string]int // number of documents in train data N int // words to be filtered StopWords map[string]struct{} // tokenizer, space is used as default Tokenizer string }
TFIDF is a Term Frequency- Inverse Document Frequency model that is created from a trained NaiveBayes model (they are very similar so you can just train NaiveBayes and convert into TDIDF)
This is not a probabalistic model, necessarily, and doesn't give classification. It can be used to determine the 'importance' of a word in a document, though, which is useful in, say, keyword tagging.
Term frequency is basically just adjusted frequency of a word within a document/sentence: termFrequency(word, doc) = 0.5 * ( 0.5 * word.Count ) / max{ w.Count | w ∈ doc }
Inverse document frequency is basically how little the term is mentioned within all of your documents: invDocumentFrequency(word, Docs) = log( len(Docs) ) - log( 1 + |{ d ∈ Docs | t ∈ d}| )
TFIDF is the multiplication of those two functions, giving you a term that is larger when the word is more important, and less when the word is less important
TFIDF tfidf model
func NewTokenizer ¶
func NewTokenizer(tokenizer tokenizers.Tokenizer) *TFIDF
NewTokenizer new with specified tokenizer works well in GOLD
func (*TFIDF) AddStopWords ¶
AddStopWords add stop words to be filtered
func (*TFIDF) AddStopWordsFile ¶
AddStopWordsFile add stop words file to be filtered, with one word a line