strutil

package module

v0.3.1 Latest Latest Go to latest Published: Sep 27, 2023 License: MIT Imports: 2 Imported by: 57

Details

Valid go.mod file

The Go module system was introduced in Go 1.11 and is the official dependency management solution for Go.
Redistributable license

Redistributable licenses place minimal restrictions on how software can be used, modified, and redistributed.
Tagged version

Modules with tagged versions give importers more predictable builds.
Stable version

When a project reaches major version v1 it is considered stable.
Learn more about best practices

Repository

github.com/adrg/strutil

Links

Open Source Insights

README ¶

strutil

strutil provides a collection of string metrics for calculating string similarity as well as other string utility functions.
Full documentation can be found at https://pkg.go.dev/github.com/adrg/strutil.

Installation

go get github.com/adrg/strutil

The package defines the StringMetric interface, which is implemented by all the string metrics. The interface is used with the Similarity function, which calculates the similarity between the specified strings, using the provided string metric.

type StringMetric interface {
    Compare(a, b string) float64
}

func Similarity(a, b string, metric StringMetric) float64 {
}

All defined string metrics can be found in the metrics package.

Hamming

Calculate similarity.

similarity := strutil.Similarity("text", "test", metrics.NewHamming())
fmt.Printf("%.2f\n", similarity) // Output: 0.75

Calculate distance.

ham := metrics.NewHamming()
fmt.Printf("%d\n", ham.Distance("one", "once")) // Output: 2

More information and additional examples can be found on pkg.go.dev.

Levenshtein

Calculate similarity using default options.

similarity := strutil.Similarity("graph", "giraffe", metrics.NewLevenshtein())
fmt.Printf("%.2f\n", similarity) // Output: 0.43

Configure edit operation costs.

lev := metrics.NewLevenshtein()
lev.CaseSensitive = false
lev.InsertCost = 1
lev.ReplaceCost = 2
lev.DeleteCost = 1

similarity := strutil.Similarity("make", "Cake", lev)
fmt.Printf("%.2f\n", similarity) // Output: 0.50

Calculate distance.

lev := metrics.NewLevenshtein()
fmt.Printf("%d\n", lev.Distance("graph", "giraffe")) // Output: 4

More information and additional examples can be found on pkg.go.dev.

Jaro

similarity := strutil.Similarity("think", "tank", metrics.NewJaro())
fmt.Printf("%.2f\n", similarity) // Output: 0.78

More information and additional examples can be found on pkg.go.dev.

Jaro-Winkler

similarity := strutil.Similarity("think", "tank", metrics.NewJaroWinkler())
fmt.Printf("%.2f\n", similarity) // Output: 0.80

More information and additional examples can be found on pkg.go.dev.

Smith-Waterman-Gotoh

Calculate similarity using default options.

swg := metrics.NewSmithWatermanGotoh()
similarity := strutil.Similarity("times roman", "times new roman", swg)
fmt.Printf("%.2f\n", similarity) // Output: 0.82

Customize gap penalty and substitution function.

swg := metrics.NewSmithWatermanGotoh()
swg.CaseSensitive = false
swg.GapPenalty = -0.1
swg.Substitution = metrics.MatchMismatch {
    Match:    1,
    Mismatch: -0.5,
}

similarity := strutil.Similarity("Times Roman", "times new roman", swg)
fmt.Printf("%.2f\n", similarity) // Output: 0.96

More information and additional examples can be found on pkg.go.dev.

Sorensen-Dice

Calculate similarity using default options.

sd := metrics.NewSorensenDice()
similarity := strutil.Similarity("time to make haste", "no time to waste", sd)
fmt.Printf("%.2f\n", similarity) // Output: 0.62

Customize n-gram size.

sd := metrics.NewSorensenDice()
sd.CaseSensitive = false
sd.NgramSize = 3

similarity := strutil.Similarity("Time to make haste", "no time to waste", sd)
fmt.Printf("%.2f\n", similarity) // Output: 0.53

More information and additional examples can be found on pkg.go.dev.

Jaccard

Calculate similarity using default options.

j := metrics.NewJaccard()
similarity := strutil.Similarity("time to make haste", "no time to waste", j)
fmt.Printf("%.2f\n", similarity) // Output: 0.45

Customize n-gram size.

j := metrics.NewJaccard()
j.CaseSensitive = false
j.NgramSize = 3

similarity := strutil.Similarity("Time to make haste", "no time to waste", j)
fmt.Printf("%.2f\n", similarity) // Output: 0.36

The input of the Sorensen-Dice example is the same as the one of Jaccard because the metrics bear a resemblance to each other. In fact, each of the coefficients can be used to calculate the other one.

Sorensen-Dice to Jaccard.

J = SD/(2-SD)

where SD is the Sorensen-Dice coefficient and J is the Jaccard index.

Jaccard to Sorensen-Dice.

SD = 2*J/(1+J)

where SD is the Sorensen-Dice coefficient and J is the Jaccard index.

More information and additional examples can be found on pkg.go.dev.

Overlap Coefficient

Calculate similarity using default options.

oc := metrics.NewOverlapCoefficient()
similarity := strutil.Similarity("time to make haste", "no time to waste", oc)
fmt.Printf("%.2f\n", similarity) // Output: 0.67

Customize n-gram size.

oc := metrics.NewOverlapCoefficient()
oc.CaseSensitive = false
oc.NgramSize = 3

similarity := strutil.Similarity("Time to make haste", "no time to waste", oc)
fmt.Printf("%.2f\n", similarity) // Output: 0.57

More information and additional examples can be found on pkg.go.dev.

References

For more information see:

Stargazers over time

Contributing

Contributions in the form of pull requests, issues or just general feedback, are always welcome.
See CONTRIBUTING.MD.

License

This project is licensed under the MIT license. See LICENSE for more details.

Documentation ¶

Overview ¶

Package strutil provides string metrics for calculating string similarity as well as other string utility functions. Documentation for all the metrics can be found at https://pkg.go.dev/github.com/adrg/strutil/metrics.

Included string metrics:

Hamming
Jaro
Jaro-Winkler
Levenshtein
Smith-Waterman-Gotoh
Sorensen-Dice
Jaccard
Overlap coefficient

Index ¶

func CommonPrefix(a, b string) string
func NgramCount(term string, size int) int
func NgramIntersection(a, b string, size int) (map[string]int, int, int, int)
func NgramMap(term string, size int) (map[string]int, int)
func Ngrams(term string, size int) []string
func Similarity(a, b string, metric StringMetric) float64
func SliceContains(terms []string, q string) bool
func UniqueSlice(items []string) []string
type StringMetric

Examples ¶

CommonPrefix
NgramCount
NgramIntersection
NgramMap
Ngrams
Similarity
SliceContains
UniqueSlice

Constants ¶

This section is empty.

Variables ¶

This section is empty.

Functions ¶

func CommonPrefix ¶

func CommonPrefix(a, b string) string

CommonPrefix returns the common prefix of the specified strings. An empty string is returned if the parameters have no prefix in common.

Example ¶

package main

import (
	"fmt"

	"github.com/adrg/strutil"
)

func main() {
	fmt.Println("(answer, anvil):", strutil.CommonPrefix("answer", "anvil"))

}

Output:

(answer, anvil): an

func NgramCount ¶ added in v0.3.0

func NgramCount(term string, size int) int

NgramCount returns the n-gram count of the specified size for the provided term. An n-gram size of 1 is used if the provided size is less than or equal to 0.

Example ¶

package main

import (
	"fmt"

	"github.com/adrg/strutil"
)

func main() {
	fmt.Println("abbcd n-gram count (size 2):", strutil.NgramCount("abbcd", 2))
	fmt.Println("abbcd n-gram count (size 3):", strutil.NgramCount("abbcd", 3))

}

Output:

abbcd n-gram count (size 2): 4
abbcd n-gram count (size 3): 3

func NgramIntersection ¶ added in v0.2.0

func NgramIntersection(a, b string, size int) (map[string]int, int, int, int)

NgramIntersection returns a map of the n-grams of the specified size found in both terms, along with their frequency. The function also returns the number of common n-grams (the sum of all the values in the output map), the total number of n-grams in the first term and the total number of n-grams in the second term. An n-gram size of 1 is used if the provided size is less than or equal to 0.

Example ¶

package main

import (
	"fmt"

	"github.com/adrg/strutil"
)

func main() {
	ngrams, common, totalA, totalB := strutil.NgramIntersection("ababc", "ababd", 2)
	fmt.Printf("(ababc, ababd) n-gram intersection: %v (%d/%d n-grams)\n",
		ngrams, common, totalA+totalB)

}

Output:

(ababc, ababd) n-gram intersection: map[ab:2 ba:1] (3/8 n-grams)

func NgramMap ¶ added in v0.2.0

func NgramMap(term string, size int) (map[string]int, int)

NgramMap returns a map of all n-grams of the specified size for the provided term, along with their frequency. The function also returns the total number of n-grams, which is the sum of all the values in the output map. An n-gram size of 1 is used if the provided size is less than or equal to 0.

Example ¶

package main

import (
	"fmt"

	"github.com/adrg/strutil"
)

func main() {
	// 2 character n-gram map.
	ngrams, total := strutil.NgramMap("abbcabb", 2)
	fmt.Printf("abbcabb n-gram map (size 2): %v (%d ngrams)\n", ngrams, total)

	// 3 character n-gram map.
	ngrams, total = strutil.NgramMap("abbcabb", 3)
	fmt.Printf("abbcabb n-gram map (size 3): %v (%d ngrams)\n", ngrams, total)

}

Output:

abbcabb n-gram map (size 2): map[ab:2 bb:2 bc:1 ca:1] (6 ngrams)
abbcabb n-gram map (size 3): map[abb:2 bbc:1 bca:1 cab:1] (5 ngrams)

func Ngrams ¶ added in v0.2.0

func Ngrams(term string, size int) []string

Ngrams returns all the n-grams of the specified size for the provided term. The n-grams in the output slice are in the order in which they occur in the input term. An n-gram size of 1 is used if the provided size is less than or equal to 0.

Example ¶

package main

import (
	"fmt"

	"github.com/adrg/strutil"
)

func main() {
	fmt.Println("abbcd n-grams (size 2):", strutil.Ngrams("abbcd", 2))
	fmt.Println("abbcd n-grams (size 3):", strutil.Ngrams("abbcd", 3))

}

Output:

abbcd n-grams (size 2): [ab bb bc cd]
abbcd n-grams (size 3): [abb bbc bcd]

func Similarity ¶

func Similarity(a, b string, metric StringMetric) float64

Similarity returns the similarity of a and b, computed using the specified string metric. The returned similarity is a number between 0 and 1. Larger similarity numbers indicate closer matches.

Example ¶

package main

import (
	"fmt"

	"github.com/adrg/strutil"
	"github.com/adrg/strutil/metrics"
)

func main() {
	sim := strutil.Similarity("riddle", "needle", metrics.NewJaroWinkler())
	fmt.Printf("(riddle, needle) similarity: %.2f\n", sim)

}

Output:

(riddle, needle) similarity: 0.56

func SliceContains ¶

func SliceContains(terms []string, q string) bool

SliceContains returns true if terms contains q, or false otherwise.

Example ¶

package main

import (
	"fmt"

	"github.com/adrg/strutil"
)

func main() {
	terms := []string{"a", "b", "c"}
	fmt.Println("([a b c], b):", strutil.SliceContains(terms, "b"))
	fmt.Println("([a b c], d):", strutil.SliceContains(terms, "d"))

}

Output:

([a b c], b): true
([a b c], d): false

func UniqueSlice ¶

func UniqueSlice(items []string) []string

UniqueSlice returns a slice containing the unique items from the specified string slice. The items in the output slice are in the order in which they occur in the input slice.

Example ¶

package main

import (
	"fmt"

	"github.com/adrg/strutil"
)

func main() {
	sample := []string{"a", "b", "a", "b", "b", "c"}
	fmt.Println("[a b a b b c]:", strutil.UniqueSlice(sample))

}

Output:

[a b a b b c]: [a b c]

Types ¶

type StringMetric ¶

type StringMetric interface {
	Compare(a, b string) float64
}

StringMetric represents a metric for measuring the similarity between strings. The metrics package implements the following string metrics:

Hamming
Jaro
Jaro-Winkler
Levenshtein
Smith-Waterman-Gotoh
Sorensen-Dice
Jaccard
Overlap coefficient

For more information see https://pkg.go.dev/github.com/adrg/strutil/metrics.

Source Files ¶

View all Source files

strutil.go

Directories ¶

Path	Synopsis
internal
mathutil
ngram
stringutil
metrics

?	: This menu
/	: Search site
f or F	: Jump to
y or Y	: Canonical URL