Documentation ¶
Overview ¶
Package stopwords allows you to customize the list of stopwords
Package stopwords implements the Levenshtein Distance algorithm to evaluate the diference between 2 strings
Package stopwords implements Charikar's simhash algorithm to generate a 64-bit fingerprint of a given document.
Package stopwords contains various algorithms of text comparison (Simhash, Levenshtein)
Index ¶
- func Clean(content []byte, langCode string, cleanHTML bool) []byte
- func CleanString(content string, langCode string, cleanHTML bool) string
- func CompareSimhash(a uint64, b uint64) uint8
- func DontStripDigits()
- func LevenshteinDistance(contentA []byte, contentB []byte, langCode string, cleanHTML bool) int
- func LoadStopWordsFromFile(filePath string, langCode string, sep string)
- func LoadStopWordsFromString(wordsList string, langCode string, sep string)
- func OverwriteWordSegmenter(expression string)
- func Simhash(content []byte, langCode string, cleanHTML bool) uint64
Constants ¶
This section is empty.
Variables ¶
This section is empty.
Functions ¶
func Clean ¶
Clean removes useless spaces and stop words from a byte slice. BCP 47 or ISO 639-1 language code (if unknown, we'll apply english filters). If cleanHTML is TRUE, remove HTML tags from content and unescape HTML entities.
func CleanString ¶
CleanString removes useless spaces and stop words from string content. BCP 47 or ISO 639-1 language code (if unknown, we'll apply english filters). If cleanHTML is TRUE, remove HTML tags from content and unescape HTML entities.
func CompareSimhash ¶
CompareSimhash calculates the Hamming distance between two 64-bit integers using the Kernighan method.
func DontStripDigits ¶
func DontStripDigits()
DontStripDigits changes the behaviour of the default word segmenter by including 'Number, Decimal Digit' Unicode Category as words
func LevenshteinDistance ¶
LevenshteinDistance compute the LevenshteinDistance between 2 strings it removes useless spaces and stop words from a byte slice. BCP 47 or ISO 639-1 language code (if unknown, we'll apply english filters). If cleanHTML is TRUE, remove HTML tags from content and unescape HTML entities.
func LoadStopWordsFromFile ¶
LoadStopWordsFromFile loads a list of stop words from a file filePath is the full path to the file to be loaded langCode is a BCP 47 or ISO 639-1 language code (e.g. "en" for English). sep is the string separator (e.g. "\n" for newline)
func LoadStopWordsFromString ¶
LoadStopWordsFromString loads a list of stop words from a string filePath is the full path to the file to be loaded langCode is a BCP 47 or ISO 639-1 language code (e.g. "en" for English). sep is the string separator (e.g. "\n" for newline)
func OverwriteWordSegmenter ¶
func OverwriteWordSegmenter(expression string)
OverwriteWordSegmenter allows you to overwrite the default word segmenter with your own regular expression
func Simhash ¶
Simhash returns a 64-bit simhash representing the content of the string removes useless spaces and stop words from a byte slice. BCP 47 or ISO 639-1 language code (if unknown, we'll apply english filters). If cleanHTML is TRUE, remove HTML tags from content and unescape HTML entities.
Types ¶
This section is empty.