nlptools

package

v0.0.0-...-62718c5 Latest Latest Go to latest Published: May 15, 2021 License: MIT Imports: 5 Imported by: 0

Details

Valid go.mod file
Redistributable license
Tagged version
Stable version
Learn more about best practices

Repository

github.com/broosaction/gotext

Links

Open Source Insights

Documentation ¶

Index ¶

type TFIDF
- func NewTFIDF() *TFIDF
- func NewTokenizer(tokenizer tokenizers.Tokenizer) *TFIDF

Constants ¶

This section is empty.

Variables ¶

This section is empty.

Functions ¶

This section is empty.

Types ¶

type TFIDF ¶

type TFIDF struct {
	// train document index in TermFreqs
	DocIndex map[string]int
	// term frequency for each train document
	TermFreqs []map[string]int
	// documents number for each term in train data
	TermDocs map[string]int
	// number of documents in train data
	N int
	// words to be filtered
	StopWords map[string]struct{}
	// tokenizer, space is used as default
	Tokenizer string
}

TFIDF is a Term Frequency- Inverse Document Frequency model that is created from a trained NaiveBayes model (they are very similar so you can just train NaiveBayes and convert into TDIDF)

This is not a probabalistic model, necessarily, and doesn't give classification. It can be used to determine the 'importance' of a word in a document, though, which is useful in, say, keyword tagging.

Term frequency is basically just adjusted frequency of a word within a document/sentence: termFrequency(word, doc) = 0.5 * ( 0.5 * word.Count ) / max{ w.Count | w ∈ doc }

Inverse document frequency is basically how little the term is mentioned within all of your documents: invDocumentFrequency(word, Docs) = log( len(Docs) ) - log( 1 + |{ d ∈ Docs | t ∈ d}| )

TFIDF is the multiplication of those two functions, giving you a term that is larger when the word is more important, and less when the word is less important

TFIDF tfidf model

func NewTFIDF ¶

func NewTFIDF() *TFIDF

New new model with default

func NewTokenizer ¶

func NewTokenizer(tokenizer tokenizers.Tokenizer) *TFIDF

NewTokenizer new with specified tokenizer works well in GOLD

func (*TFIDF) AddDocs ¶

func (f *TFIDF) AddDocs(docs ...string)

AddDocs add train documents

func (*TFIDF) AddStopWords ¶

func (f *TFIDF) AddStopWords(words ...string)

AddStopWords add stop words to be filtered

func (*TFIDF) AddStopWordsFile ¶

func (f *TFIDF) AddStopWordsFile(file string) (err error)

AddStopWordsFile add stop words file to be filtered, with one word a line

func (*TFIDF) Cal ¶

func (f *TFIDF) Cal(doc string) (weight map[string]float64)

Cal calculate tf-idf weight for specified document

Source Files ¶

View all Source files

tfidf.go

Directories ¶

Path	Synopsis
en

?	: This menu
/	: Search site
f or F	: Jump to
y or Y	: Canonical URL