stringosim

package module
v0.0.0-...-9d0b3e9 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Sep 22, 2017 License: MIT Imports: 5 Imported by: 5

README

stringosim

The plan for this package is to have Go implementation of different string distance/similarity functions, like Levenshtein (normalized, weighted, Damerau), Jaro-Winkler, Jaccard index, Euclidean distance, Hamming distance...

Currently it has implemented:

  • Levenshtein
  • Jaccard
  • Hamming
  • LCS
  • Q-gram
  • n-gram based Cosine distanc

Work in progress...

Import and installation

To get the library just run:

    go get github.com/dexyk/stringosim

To use the library just import it in your code:

    import "github.com/dexyk/stringosim"

To run the tests, go to the directory where stringosim package is installed and run:

    go test

Usage

Currently only Levenshtein, Jaccard, Hamming, LCS string, Q-gram and Cosine distances are implemented.

Levenshtein

Levenshtein distance can be calculated with default parameters (use DefaultSimilarityOptions) where cost of insert, delete and substitute operation are 1. You can also use it with other parameters by using SimilarityOptions type. Setting CaseInsensitive to true in SimilarityOptions the comparison will be done without considering character cases.

Example:

    fmt.Println(stringosim.Levenshtein([]rune("stringosim"), []rune("stingobim")))

    fmt.Println(stringosim.Levenshtein([]rune("stringosim"), []rune("stingobim"),
    stringosim.LevenshteinSimilarityOptions{
        InsertCost:     3,
        DeleteCost:     5,
        SubstituteCost: 2,
    }))

    fmt.Println(stringosim.Levenshtein([]rune("stringosim"), []rune("STRINGOSIM"),
    stringosim.LevenshteinSimilarityOptions{
        InsertCost:      3,
        DeleteCost:      4,
        SubstituteCost:  5,
        CaseInsensitive: true,
    }))
Jaccard

Jaccard distance can be calculated by setting the size of the n-gram which will be used for comparison. If the size is omitted the default value of 1 will be used.

Example:

    fmt.Println(stringosim.Jaccard([]rune("stringosim"), []rune("stingobim")))

    fmt.Println(stringosim.Jaccard([]rune("stringosim"), []rune("stingobim"), []int{2}))

    fmt.Println(stringosim.Jaccard([]rune("stringosim"), []rune("stingobim"), []int{3}))
Hamming

Hamming distance can be calculated with options. Default function will calculate standard hamming distance with case sensitive option. It can be also used without case sensitive option.

If the strings to compare have different lengths, the error will be returned.

Example:

    dis, _ := stringosim.Hamming([]rune("testing"), []rune("restink"))
    fmt.Println(dis)

    dis, _ = stringosim.Hamming([]rune("testing"), []rune("FESTING"), stringosim.HammingSimilarityOptions{
        CaseInsensitive: true,
    })
    fmt.Println(dis)

    _, err := stringosim.Hamming([]rune("testing"), []rune("testin"))
    fmt.Println(err)
Longest Common Subsequence (LCS)

LCS between two strings can be calculated with options. Default function will calculate the LCS with case insensitive option. It can be also used without case sensitive option.

Example:

    fmt.Println(stringosim.LCS([]rune("testing lcs algorithm"), []rune("another l c s example")))

    fmt.Println(stringosim.LCS([]rune("testing lcs algorithm"), []rune("ANOTHER L C S EXAMPLE"),
    stringosim.LCSSimilarityOptions{
        CaseInsensitive: true,
    }))
Jaro and Jaro-Winkler

Jaro and Jaro-Winkler can be calculated with options: case insensitive, and specific values for Jaro-Winkler - threshold, p value and l value.

Example:

    fmt.Println(stringosim.Jaro([]rune("abaccbabaacbcb"), []rune("bababbcabbaaca")))
    fmt.Println(stringosim.Jaro([]rune("abaccbabaacbcb"), []rune("ABABAbbCABbaACA"),
    stringosim.JaroSimilarityOptions{
        CaseInsensitive: true,
    }))

    fmt.Println(stringosim.JaroWinkler([]rune("abaccbabaacbcb"), []rune("bababbcabbaaca")))
    fmt.Println(stringosim.JaroWinkler([]rune("abaccbabaacbcb"), []rune("BABAbbCABbaACA"),
    stringosim.JaroSimilarityOptions{
        CaseInsensitive: true,
        Threshold:       0.7,
        PValue:          0.1,
        LValue:          4,
    }))
Q-gram

Q-gram distance can be calculated using default options (DefaultQGramOptions): length of q-grams is 2 and comparison is case sensitive. Using QGramSimilarityOptions as the parameter of the function we can set custom q-gram length and if the comparison is case sensitive or not.

Example:

    fmt.Println(stringosim.QGram([]rune("abcde"), []rune("abdcde")))

    fmt.Println(stringosim.QGram([]rune("abcde"), []rune("ABDCDE"),
    stringosim.QGramSimilarityOptions{
        CaseInsensitive: true,
        NGramSizes:     []int{3},
    }))
Cosine

Cosine distance can be calculated using default options (DefaultCosineOptions): length of n-grams is 2 and comparison is case sensitive. Using CosineSimilarityOptions as the parameter of the function we can set custom n-gram length and if the comparison is case sensitive or not.

Example:

    fmt.Println(stringosim.Cosine([]rune("abcde"), []rune("abdcde")))

    fmt.Println(stringosim.Cosine(Cosine[]rune("abcde"), []rune("ABDCDE"),
    stringosim.CosineSimilarityOptions{
        CaseInsensitive: true,
        NGramSizes:     []int{3},
    }))

Documentation

Index

Constants

View Source
const EPS = 0.000000001

Variables

View Source
var DefaultCosineSimilarityOptions = CosineSimilarityOptions{
	CaseInsensitive: false,
	NGramSizes:      []int{2},
}
View Source
var DefaultHammingSimilarityOptions = HammingSimilarityOptions{
	CaseInsensitive: false,
}
View Source
var DefaultJaroSimilarityOptions = JaroSimilarityOptions{
	CaseInsensitive: false,
	Threshold:       0.7,
	PValue:          0.1,
	LValue:          4,
}
View Source
var DefaultLCSSimilarityOptions = LCSSimilarityOptions{
	CaseInsensitive: false,
}
View Source
var DefaultLevenshteinSimilarityOptions = LevenshteinSimilarityOptions{
	InsertCost:     1,
	DeleteCost:     1,
	SubstituteCost: 1,
}
View Source
var DefaultQGramSimilarityOptions = QGramSimilarityOptions{
	CaseInsensitive: false,
	NGramSizes:      []int{2},
}
View Source
var HAMMING_ERROR_DIFFERENT_LENGTH = errors.New("Can't compare strings of different lengths")

Functions

func AbsInt

func AbsInt(v int) int

func CompareErrors

func CompareErrors(e1 error, e2 error) bool

func Cosine

func Cosine(s []rune, t []rune, options ...CosineSimilarityOptions) float64

func DotProductNGrams

func DotProductNGrams(m1, m2 map[string]int) int

func EqualFloat64

func EqualFloat64(x, y float64) bool

func GetNGram

func GetNGram(s string, NGramSizes []int) map[string]int

func Hamming

func Hamming(s []rune, t []rune, options ...HammingSimilarityOptions) (int, error)

func Jaccard

func Jaccard(s []rune, t []rune, NGramSizes []int) float64

func Jaro

func Jaro(s []rune, t []rune, options ...JaroSimilarityOptions) float64

func JaroWinkler

func JaroWinkler(s []rune, t []rune, options ...JaroSimilarityOptions) float64

func LCS

func LCS(s []rune, t []rune, options ...LCSSimilarityOptions) int

func Levenshtein

func Levenshtein(s []rune, t []rune, options ...LevenshteinSimilarityOptions) int

func Max

func Max(a int, b int) int

func Min

func Min(a int, b int) int

func NormNGram

func NormNGram(m map[string]int) float64

func QGram

func QGram(s []rune, t []rune, options ...QGramSimilarityOptions) int

func SameRune

func SameRune(a rune, b rune, caseInsensitive bool) bool

Types

type CosineSimilarityOptions

type CosineSimilarityOptions struct {
	CaseInsensitive bool
	NGramSizes      []int
}

type HammingSimilarityOptions

type HammingSimilarityOptions struct {
	CaseInsensitive bool
}

type JaroSimilarityOptions

type JaroSimilarityOptions struct {
	Threshold       float64
	PValue          float64
	LValue          float64
	CaseInsensitive bool
}

type LCSSimilarityOptions

type LCSSimilarityOptions struct {
	CaseInsensitive bool
}

type LevenshteinSimilarityOptions

type LevenshteinSimilarityOptions struct {
	InsertCost      int
	DeleteCost      int
	SubstituteCost  int
	CaseInsensitive bool
}

type QGramSimilarityOptions

type QGramSimilarityOptions struct {
	CaseInsensitive bool
	NGramSizes      []int
}

type SimilarityOptions

type SimilarityOptions struct {
	CaseInsensitive bool
}

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL