Documentation ¶
Overview ¶
implements the Levenshtein Distance algorithm to evaluate the diference between 2 strings
implements Charikar's simhash algorithm to generate a 64-bit fingerprint of a given document.
stopwords package removes most frequent words from a text content. It can be used to improve the accuracy of SimHash algo for example. It uses a list of most frequent words used in various languages :
arabic, bulgarian, czech, danish, english, finnish, french, german, hungarian, italian, japanese, latvian, norwegian, persian, polish,
portuguese, romanian, russian, slovak, spanish, swedish, turkish
It contains various algorithms of text comparisons (Simhash, Levenshtein)
Index ¶
- func Clean(content []byte, langCode string, cleanHTML bool) []byte
- func CleanString(content string, langCode string, cleanHTML bool) string
- func CompareSimhash(a uint64, b uint64) uint8
- func LevenshteinDistance(contentA []byte, contentB []byte, langCode string, cleanHTML bool) int
- func Simhash(content []byte, langCode string, cleanHTML bool) uint64
Constants ¶
This section is empty.
Variables ¶
This section is empty.
Functions ¶
func Clean ¶
Clean removes useless spaces and stop words from a byte slice. BCP 47 or ISO 639-1 language code (if unknown, we'll apply english filters). If cleanHTML is TRUE, remove HTML tags from content and unescape HTML entities.
func CleanString ¶
CleanString removes useless spaces and stop words from string content. BCP 47 or ISO 639-1 language code (if unknown, we'll apply english filters). If cleanHTML is TRUE, remove HTML tags from content and unescape HTML entities.
func CompareSimhash ¶
Compare calculates the Hamming distance between two 64-bit integers using the Kernighan method.
func LevenshteinDistance ¶
LevenshteinDistance compute the LevenshteinDistance between 2 strings it removes useless spaces and stop words from a byte slice. BCP 47 or ISO 639-1 language code (if unknown, we'll apply english filters). If cleanHTML is TRUE, remove HTML tags from content and unescape HTML entities.
func Simhash ¶
Simhash returns a 64-bit simhash representing the content of the string removes useless spaces and stop words from a byte slice. BCP 47 or ISO 639-1 language code (if unknown, we'll apply english filters). If cleanHTML is TRUE, remove HTML tags from content and unescape HTML entities.
Types ¶
This section is empty.
Source Files ¶
- levenshtein.go
- simhash.go
- stopwords.go
- stopwords_ar.go
- stopwords_bg.go
- stopwords_cs.go
- stopwords_da.go
- stopwords_de.go
- stopwords_el.go
- stopwords_en.go
- stopwords_es.go
- stopwords_fa.go
- stopwords_fi.go
- stopwords_fr.go
- stopwords_hu.go
- stopwords_it.go
- stopwords_ja.go
- stopwords_lv.go
- stopwords_nl.go
- stopwords_no.go
- stopwords_pl.go
- stopwords_pt.go
- stopwords_ro.go
- stopwords_ru.go
- stopwords_sk.go
- stopwords_sv.go
- stopwords_th.go
- stopwords_tr.go