strutil

package module
v0.3.1 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Sep 27, 2023 License: MIT Imports: 2 Imported by: 59

README

strutil

Build status Code coverage pkg.go.dev documentation MIT license Go report card GitHub issues Buy me a coffee

strutil provides a collection of string metrics for calculating string similarity as well as other string utility functions.
Full documentation can be found at https://pkg.go.dev/github.com/adrg/strutil.

Installation

go get github.com/adrg/strutil

String metrics

The package defines the StringMetric interface, which is implemented by all the string metrics. The interface is used with the Similarity function, which calculates the similarity between the specified strings, using the provided string metric.

type StringMetric interface {
    Compare(a, b string) float64
}

func Similarity(a, b string, metric StringMetric) float64 {
}

All defined string metrics can be found in the metrics package.

Hamming

Calculate similarity.

similarity := strutil.Similarity("text", "test", metrics.NewHamming())
fmt.Printf("%.2f\n", similarity) // Output: 0.75

Calculate distance.

ham := metrics.NewHamming()
fmt.Printf("%d\n", ham.Distance("one", "once")) // Output: 2

More information and additional examples can be found on pkg.go.dev.

Levenshtein

Calculate similarity using default options.

similarity := strutil.Similarity("graph", "giraffe", metrics.NewLevenshtein())
fmt.Printf("%.2f\n", similarity) // Output: 0.43

Configure edit operation costs.

lev := metrics.NewLevenshtein()
lev.CaseSensitive = false
lev.InsertCost = 1
lev.ReplaceCost = 2
lev.DeleteCost = 1

similarity := strutil.Similarity("make", "Cake", lev)
fmt.Printf("%.2f\n", similarity) // Output: 0.50

Calculate distance.

lev := metrics.NewLevenshtein()
fmt.Printf("%d\n", lev.Distance("graph", "giraffe")) // Output: 4

More information and additional examples can be found on pkg.go.dev.

Jaro
similarity := strutil.Similarity("think", "tank", metrics.NewJaro())
fmt.Printf("%.2f\n", similarity) // Output: 0.78

More information and additional examples can be found on pkg.go.dev.

Jaro-Winkler
similarity := strutil.Similarity("think", "tank", metrics.NewJaroWinkler())
fmt.Printf("%.2f\n", similarity) // Output: 0.80

More information and additional examples can be found on pkg.go.dev.

Smith-Waterman-Gotoh

Calculate similarity using default options.

swg := metrics.NewSmithWatermanGotoh()
similarity := strutil.Similarity("times roman", "times new roman", swg)
fmt.Printf("%.2f\n", similarity) // Output: 0.82

Customize gap penalty and substitution function.

swg := metrics.NewSmithWatermanGotoh()
swg.CaseSensitive = false
swg.GapPenalty = -0.1
swg.Substitution = metrics.MatchMismatch {
    Match:    1,
    Mismatch: -0.5,
}

similarity := strutil.Similarity("Times Roman", "times new roman", swg)
fmt.Printf("%.2f\n", similarity) // Output: 0.96

More information and additional examples can be found on pkg.go.dev.

Sorensen-Dice

Calculate similarity using default options.

sd := metrics.NewSorensenDice()
similarity := strutil.Similarity("time to make haste", "no time to waste", sd)
fmt.Printf("%.2f\n", similarity) // Output: 0.62

Customize n-gram size.

sd := metrics.NewSorensenDice()
sd.CaseSensitive = false
sd.NgramSize = 3

similarity := strutil.Similarity("Time to make haste", "no time to waste", sd)
fmt.Printf("%.2f\n", similarity) // Output: 0.53

More information and additional examples can be found on pkg.go.dev.

Jaccard

Calculate similarity using default options.

j := metrics.NewJaccard()
similarity := strutil.Similarity("time to make haste", "no time to waste", j)
fmt.Printf("%.2f\n", similarity) // Output: 0.45

Customize n-gram size.

j := metrics.NewJaccard()
j.CaseSensitive = false
j.NgramSize = 3

similarity := strutil.Similarity("Time to make haste", "no time to waste", j)
fmt.Printf("%.2f\n", similarity) // Output: 0.36

The input of the Sorensen-Dice example is the same as the one of Jaccard because the metrics bear a resemblance to each other. In fact, each of the coefficients can be used to calculate the other one.

Sorensen-Dice to Jaccard.

J = SD/(2-SD)

where SD is the Sorensen-Dice coefficient and J is the Jaccard index.

Jaccard to Sorensen-Dice.

SD = 2*J/(1+J)

where SD is the Sorensen-Dice coefficient and J is the Jaccard index.

More information and additional examples can be found on pkg.go.dev.

Overlap Coefficient

Calculate similarity using default options.

oc := metrics.NewOverlapCoefficient()
similarity := strutil.Similarity("time to make haste", "no time to waste", oc)
fmt.Printf("%.2f\n", similarity) // Output: 0.67

Customize n-gram size.

oc := metrics.NewOverlapCoefficient()
oc.CaseSensitive = false
oc.NgramSize = 3

similarity := strutil.Similarity("Time to make haste", "no time to waste", oc)
fmt.Printf("%.2f\n", similarity) // Output: 0.57

More information and additional examples can be found on pkg.go.dev.

References

For more information see:

Stargazers over time

Stargazers over time

Contributing

Contributions in the form of pull requests, issues or just general feedback, are always welcome.
See CONTRIBUTING.MD.

License

Copyright (c) 2019 Adrian-George Bostan.

This project is licensed under the MIT license. See LICENSE for more details.

Documentation

Overview

Package strutil provides string metrics for calculating string similarity as well as other string utility functions. Documentation for all the metrics can be found at https://pkg.go.dev/github.com/adrg/strutil/metrics.

Included string metrics:

  • Hamming
  • Jaro
  • Jaro-Winkler
  • Levenshtein
  • Smith-Waterman-Gotoh
  • Sorensen-Dice
  • Jaccard
  • Overlap coefficient

Index

Examples

Constants

This section is empty.

Variables

This section is empty.

Functions

func CommonPrefix

func CommonPrefix(a, b string) string

CommonPrefix returns the common prefix of the specified strings. An empty string is returned if the parameters have no prefix in common.

Example
package main

import (
	"fmt"

	"github.com/adrg/strutil"
)

func main() {
	fmt.Println("(answer, anvil):", strutil.CommonPrefix("answer", "anvil"))

}
Output:

(answer, anvil): an

func NgramCount added in v0.3.0

func NgramCount(term string, size int) int

NgramCount returns the n-gram count of the specified size for the provided term. An n-gram size of 1 is used if the provided size is less than or equal to 0.

Example
package main

import (
	"fmt"

	"github.com/adrg/strutil"
)

func main() {
	fmt.Println("abbcd n-gram count (size 2):", strutil.NgramCount("abbcd", 2))
	fmt.Println("abbcd n-gram count (size 3):", strutil.NgramCount("abbcd", 3))

}
Output:

abbcd n-gram count (size 2): 4
abbcd n-gram count (size 3): 3

func NgramIntersection added in v0.2.0

func NgramIntersection(a, b string, size int) (map[string]int, int, int, int)

NgramIntersection returns a map of the n-grams of the specified size found in both terms, along with their frequency. The function also returns the number of common n-grams (the sum of all the values in the output map), the total number of n-grams in the first term and the total number of n-grams in the second term. An n-gram size of 1 is used if the provided size is less than or equal to 0.

Example
package main

import (
	"fmt"

	"github.com/adrg/strutil"
)

func main() {
	ngrams, common, totalA, totalB := strutil.NgramIntersection("ababc", "ababd", 2)
	fmt.Printf("(ababc, ababd) n-gram intersection: %v (%d/%d n-grams)\n",
		ngrams, common, totalA+totalB)

}
Output:

(ababc, ababd) n-gram intersection: map[ab:2 ba:1] (3/8 n-grams)

func NgramMap added in v0.2.0

func NgramMap(term string, size int) (map[string]int, int)

NgramMap returns a map of all n-grams of the specified size for the provided term, along with their frequency. The function also returns the total number of n-grams, which is the sum of all the values in the output map. An n-gram size of 1 is used if the provided size is less than or equal to 0.

Example
package main

import (
	"fmt"

	"github.com/adrg/strutil"
)

func main() {
	// 2 character n-gram map.
	ngrams, total := strutil.NgramMap("abbcabb", 2)
	fmt.Printf("abbcabb n-gram map (size 2): %v (%d ngrams)\n", ngrams, total)

	// 3 character n-gram map.
	ngrams, total = strutil.NgramMap("abbcabb", 3)
	fmt.Printf("abbcabb n-gram map (size 3): %v (%d ngrams)\n", ngrams, total)

}
Output:

abbcabb n-gram map (size 2): map[ab:2 bb:2 bc:1 ca:1] (6 ngrams)
abbcabb n-gram map (size 3): map[abb:2 bbc:1 bca:1 cab:1] (5 ngrams)

func Ngrams added in v0.2.0

func Ngrams(term string, size int) []string

Ngrams returns all the n-grams of the specified size for the provided term. The n-grams in the output slice are in the order in which they occur in the input term. An n-gram size of 1 is used if the provided size is less than or equal to 0.

Example
package main

import (
	"fmt"

	"github.com/adrg/strutil"
)

func main() {
	fmt.Println("abbcd n-grams (size 2):", strutil.Ngrams("abbcd", 2))
	fmt.Println("abbcd n-grams (size 3):", strutil.Ngrams("abbcd", 3))

}
Output:

abbcd n-grams (size 2): [ab bb bc cd]
abbcd n-grams (size 3): [abb bbc bcd]

func Similarity

func Similarity(a, b string, metric StringMetric) float64

Similarity returns the similarity of a and b, computed using the specified string metric. The returned similarity is a number between 0 and 1. Larger similarity numbers indicate closer matches.

Example
package main

import (
	"fmt"

	"github.com/adrg/strutil"
	"github.com/adrg/strutil/metrics"
)

func main() {
	sim := strutil.Similarity("riddle", "needle", metrics.NewJaroWinkler())
	fmt.Printf("(riddle, needle) similarity: %.2f\n", sim)

}
Output:

(riddle, needle) similarity: 0.56

func SliceContains

func SliceContains(terms []string, q string) bool

SliceContains returns true if terms contains q, or false otherwise.

Example
package main

import (
	"fmt"

	"github.com/adrg/strutil"
)

func main() {
	terms := []string{"a", "b", "c"}
	fmt.Println("([a b c], b):", strutil.SliceContains(terms, "b"))
	fmt.Println("([a b c], d):", strutil.SliceContains(terms, "d"))

}
Output:

([a b c], b): true
([a b c], d): false

func UniqueSlice

func UniqueSlice(items []string) []string

UniqueSlice returns a slice containing the unique items from the specified string slice. The items in the output slice are in the order in which they occur in the input slice.

Example
package main

import (
	"fmt"

	"github.com/adrg/strutil"
)

func main() {
	sample := []string{"a", "b", "a", "b", "b", "c"}
	fmt.Println("[a b a b b c]:", strutil.UniqueSlice(sample))

}
Output:

[a b a b b c]: [a b c]

Types

type StringMetric

type StringMetric interface {
	Compare(a, b string) float64
}

StringMetric represents a metric for measuring the similarity between strings. The metrics package implements the following string metrics:

  • Hamming
  • Jaro
  • Jaro-Winkler
  • Levenshtein
  • Smith-Waterman-Gotoh
  • Sorensen-Dice
  • Jaccard
  • Overlap coefficient

For more information see https://pkg.go.dev/github.com/adrg/strutil/metrics.

Directories

Path Synopsis
internal

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL