stopwords

package module

v1.0.2 Latest Latest Go to latest Published: May 23, 2021 License: BSD-2-Clause Imports: 8 Imported by: 0

Details

Valid go.mod file
Redistributable license
Tagged version
Stable version
Learn more about best practices

Repository

github.com/anhcraft/stopwords

Documentation ¶

Overview ¶

Package stopwords allows you to customize the list of stopwords

Package stopwords implements the Levenshtein Distance algorithm to evaluate the diference between 2 strings

Package stopwords implements Charikar's simhash algorithm to generate a 64-bit fingerprint of a given document.

Package stopwords contains various algorithms of text comparison (Simhash, Levenshtein)

Index ¶

func Clean(content []byte, langCode string, cleanHTML bool) []byte
func CleanString(content string, langCode string, cleanHTML bool) string
func CompareSimhash(a uint64, b uint64) uint8
func DontStripDigits()
func LevenshteinDistance(contentA []byte, contentB []byte, langCode string, cleanHTML bool) int
func LoadStopWordsFromFile(filePath string, langCode string, sep string)
func LoadStopWordsFromString(wordsList string, langCode string, sep string)
func OverwriteWordSegmenter(expression string)
func Simhash(content []byte, langCode string, cleanHTML bool) uint64

Constants ¶

This section is empty.

Variables ¶

This section is empty.

Functions ¶

func Clean ¶

func Clean(content []byte, langCode string, cleanHTML bool) []byte

Clean removes useless spaces and stop words from a byte slice. BCP 47 or ISO 639-1 language code (if unknown, we'll apply english filters). If cleanHTML is TRUE, remove HTML tags from content and unescape HTML entities.

func CleanString ¶

func CleanString(content string, langCode string, cleanHTML bool) string

CleanString removes useless spaces and stop words from string content. BCP 47 or ISO 639-1 language code (if unknown, we'll apply english filters). If cleanHTML is TRUE, remove HTML tags from content and unescape HTML entities.

func CompareSimhash ¶

func CompareSimhash(a uint64, b uint64) uint8

CompareSimhash calculates the Hamming distance between two 64-bit integers using the Kernighan method.

func DontStripDigits ¶

func DontStripDigits()

DontStripDigits changes the behaviour of the default word segmenter by including 'Number, Decimal Digit' Unicode Category as words

func LevenshteinDistance ¶

func LevenshteinDistance(contentA []byte, contentB []byte, langCode string, cleanHTML bool) int

LevenshteinDistance compute the LevenshteinDistance between 2 strings it removes useless spaces and stop words from a byte slice. BCP 47 or ISO 639-1 language code (if unknown, we'll apply english filters). If cleanHTML is TRUE, remove HTML tags from content and unescape HTML entities.

func LoadStopWordsFromFile ¶

func LoadStopWordsFromFile(filePath string, langCode string, sep string)

LoadStopWordsFromFile loads a list of stop words from a file filePath is the full path to the file to be loaded langCode is a BCP 47 or ISO 639-1 language code (e.g. "en" for English). sep is the string separator (e.g. "\n" for newline)

func LoadStopWordsFromString ¶

func LoadStopWordsFromString(wordsList string, langCode string, sep string)

LoadStopWordsFromString loads a list of stop words from a string filePath is the full path to the file to be loaded langCode is a BCP 47 or ISO 639-1 language code (e.g. "en" for English). sep is the string separator (e.g. "\n" for newline)

func OverwriteWordSegmenter ¶

func OverwriteWordSegmenter(expression string)

OverwriteWordSegmenter allows you to overwrite the default word segmenter with your own regular expression

func Simhash ¶

func Simhash(content []byte, langCode string, cleanHTML bool) uint64

Simhash returns a 64-bit simhash representing the content of the string removes useless spaces and stop words from a byte slice. BCP 47 or ISO 639-1 language code (if unknown, we'll apply english filters). If cleanHTML is TRUE, remove HTML tags from content and unescape HTML entities.

Types ¶

This section is empty.

Source Files ¶

View all Source files

?	: This menu
/	: Search site
f or F	: Jump to
y or Y	: Canonical URL