stopwords

package module
v0.0.0-...-881d3d3 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Feb 10, 2016 License: BSD-2-Clause Imports: 6 Imported by: 0

README

stopwords is a go package that removes stop words from a text content. If instructed to do so, it will remove HTML tags and parse HTML entities. The objective is to prepare a text in view to be used by natural processing algos or text comparison algorithms such as SimHash.

GoDoc

Build Status

codecov.io

It uses a curated list of the most frequent words used in these languages:

  • arabic
  • bulgarian
  • czech
  • danish
  • english
  • finnish
  • french
  • german
  • hungarian
  • italian
  • japanese
  • latvian
  • norwegian
  • persian
  • polish
  • portuguese
  • romanian
  • russian
  • slovak
  • spanish
  • swedish
  • thai
  • turkish

If the function is used with an unsupported language, it doesn't fail, but will apply english filter to the content.

How to use this package

You can find an example here https:github.com/bbalet/gorelated where stopwords package is used in conjunction with SimHash algorithm in order to find a list of related content for a static website generator:

import (
      "github.com/bbalet/stopwords"
)

//Example with 2 strings containing P html tags
//"la", "un", etc. are (stop) words without lexical value in French
string1 := []byte("<p>la fin d'un bel après-midi d'été</p>")
string2 := []byte("<p>cet été, nous avons eu un bel après-midi</p>")

//Return a string where HTML tags and French stop words has been removed
cleanContent := stopwords.CleanContent(string1, "fr", true)

//Get two (Sim) hash representing the content of each string
hash1 := stopwords.Simhash(string1, "fr", true)
hash2 := stopwords.Simhash(string2, "fr", true)

//Hamming distance between the two strings (diffference between contents)
distance := stopwords.CompareSimhash(hash1, hash2)

//Clean the content of string1 and string2, compute the Levenshtein Distance
stopwords.LevenshteinDistance(string1, string2, "fr", true)

Where fr is the ISO 639-1 code for French (it accepts a BCP 47 tag as well). https:en.wikipedia.org/wiki/List_of_ISO_639-1_codes

Credits

Most of the lists were built by IR Multilingual Resources at UniNE http:members.unine.ch/jacques.savoy/clef/index.html

License

stopwords is released under the BSD license.

Documentation

Overview

implements the Levenshtein Distance algorithm to evaluate the diference between 2 strings

implements Charikar's simhash algorithm to generate a 64-bit fingerprint of a given document.

stopwords package removes most frequent words from a text content. It can be used to improve the accuracy of SimHash algo for example. It uses a list of most frequent words used in various languages :

arabic, bulgarian, czech, danish, english, finnish, french, german, hungarian, italian, japanese, latvian, norwegian, persian, polish,

portuguese, romanian, russian, slovak, spanish, swedish, turkish

It contains various algorithms of text comparisons (Simhash, Levenshtein)

Index

Constants

This section is empty.

Variables

This section is empty.

Functions

func Clean

func Clean(content []byte, langCode string, cleanHTML bool) []byte

Clean removes useless spaces and stop words from a byte slice. BCP 47 or ISO 639-1 language code (if unknown, we'll apply english filters). If cleanHTML is TRUE, remove HTML tags from content and unescape HTML entities.

func CleanString

func CleanString(content string, langCode string, cleanHTML bool) string

CleanString removes useless spaces and stop words from string content. BCP 47 or ISO 639-1 language code (if unknown, we'll apply english filters). If cleanHTML is TRUE, remove HTML tags from content and unescape HTML entities.

func CompareSimhash

func CompareSimhash(a uint64, b uint64) uint8

Compare calculates the Hamming distance between two 64-bit integers using the Kernighan method.

func LevenshteinDistance

func LevenshteinDistance(contentA []byte, contentB []byte, langCode string, cleanHTML bool) int

LevenshteinDistance compute the LevenshteinDistance between 2 strings it removes useless spaces and stop words from a byte slice. BCP 47 or ISO 639-1 language code (if unknown, we'll apply english filters). If cleanHTML is TRUE, remove HTML tags from content and unescape HTML entities.

func Simhash

func Simhash(content []byte, langCode string, cleanHTML bool) uint64

Simhash returns a 64-bit simhash representing the content of the string removes useless spaces and stop words from a byte slice. BCP 47 or ISO 639-1 language code (if unknown, we'll apply english filters). If cleanHTML is TRUE, remove HTML tags from content and unescape HTML entities.

Types

This section is empty.

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL