nlp

package
v0.1.15 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Oct 30, 2022 License: MIT Imports: 8 Imported by: 2

Documentation

Overview

Package nlp provides basic NLP utilities.

Index

Constants

This section is empty.

Variables

View Source
var LdaVerbose = false

LdaVerbose determines whether progress information should be printed during LDA. For debugging.

View Source
var StopWords = map[string]bool{}/* 569 elements not displayed */

StopWords is a map of stop words, for token filtering. Modifying this map will affect the Tokenize function.

Taken from: http://www.ranks.nl/stopwords

View Source
var Tokenizer = regexp.MustCompile("\\w([\\w']*\\w)?")

Tokenizer splits text into tokens. This regexp represents a single word. Changing this regexp will affect the Tokenize function.

Functions

func Lda

func Lda(docTokens [][]string, k int) (map[string][]float64, [][]int)

Lda performs LDA on the given data. docTokens should contain tokenized documents, such that docTokens[i][j] is the j'th token in the i'th document. k is the number of topics. Returns the topics and token-topic assignment, respective to docTokens.

Topics are returned in a map from word to a probability vector, such that the i'th position is the probability of the i'th topic generating that word. For each i, the i'th position of all words sum to 1.

func LdaThreads

func LdaThreads(docTokens [][]string, k, numThreads int) (map[string][]float64,
	[][]int)

LdaThreads is like the function Lda but runs on multiple subroutines. Calling this function with 1 thread is equivalent to calling Lda.

func Stem

func Stem(s string) string

Stem porter-stems the given word.

func TfIdf

func TfIdf(docTokens [][]string) []map[string]float64

TfIdf returns the TF-IDF scores of the given corpus. For each documet, returns a map from token to TF-IDF score.

TF = count(token in document) / count(all tokens in document)

IDF = log(count(documents) / count(documents with token))

func Tokenize

func Tokenize(s string, keepStopWords bool) []string

Tokenize splits a given text to a slice of stemmed, lowercase words. If keepStopWords is false, will drop stop words.

Types

This section is empty.

Directories

Path Synopsis
Command lda-tool performs LDA on the input documents.
Command lda-tool performs LDA on the input documents.
Package wordnet provides a WordNet parser and interface.
Package wordnet provides a WordNet parser and interface.

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL