levenshtein

package module
v0.4.0 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Sep 29, 2023 License: MIT Imports: 6 Imported by: 1

README

levenshtein

The package calculates edit distance between two strings according to Levenshtein algorithm. It also can show differences beteen two strings. This package extends an excellent levenshtein implementation(https://github.com/agnivade/levenshtein) by @agnivade.

Installation

Installation of a binary file

You can download binary fzdiff from the latest release and install it somewhere in your PATH.

Installation with Go

If you have Go installed on your system, run:

go get github.com/gnames/levenshtein/fzdiff

Usage

Command line interface
  • Run fzdiff on two strings.

    fzdiff "Something" "smoething"
    # output:
    String1,String2,Tags1,Tags2,EditDistance,Aborted
    Something,smoething,,,3,false
    
  • Change output.

    fzdiff "Something" "smoething" -f compact
    {"string1":"Something","string2":"smoething","editDistance":3}
    
    
    fzdiff "Something" "smoething" -f pretty
    {
      "string1": "Something",
      "string2": "smoething",
      "editDistance": 3
    }
    
  • Run fzdiff with max edit distance constraint:

    fzdiff "Something" "smoething" -m 1
    String1,String2,Tags1,Tags2,EditDistance,Aborted
    Something,smoething,,,1,true
    
  • Run fzdiff with tags output:

    fzdiff "Something" "smoething" -t
    String1,String2,Tags1,Tags2,EditDistance,Aborted
    Something,smoething,<subst>Som</subst>ething,<subst>smo</subst>ething,3,false
    
  • Run fzdiff on a CSV file to compare the first 2 fields.

    fzdiff strings.csv -t
    
  • Run fzdiff using pipes with STDIN, STDOUT

    echo "Something,smoething" | fzdiff -t
    Id,Verbatim,Cardinality,CanonicalFull,CanonicalSimple,CanonicalStem,Authorship,Year,Quality
    Something,smoething,<subst>Som</subst>ething,<subst>smo</subst>ething,3,false
    
    # or
    
    cat strings.csv | fzdiff -t > diffs.csv
    
Usage as a library
package main

import (
 "fmt"

 "github.com/gnames/levenshtein"
)

func main() {
 opts := []levenshtein.Option{
  levenshtein.OptWithDiff(true),
  levenshtein.OptMaxEditDist(1),
 }
 l := levenshtein.NewLevenshtein(opts...)
 out := l.Compare("Something", "smoething")
 fmt.Printf("%+v\n", out)

 strs := []levenshtein.Strings{
  {String1: "Something", String2: "smoething"},
  {String1: "one", String2: "two"},
 }

 outs := l.CompareMult(strs)
 for _, out := range outs {
  fmt.Printf("%+v\n", out)
 }
}

Testing

From the root of the project:

go test ./...
Benchmarking

You need to install benchstat for more readable restuls.

cd ent/editdist
go test -bench=. -benchmem -count=10 -run=XXX > bench.txt
benchstat bench.txt

An example of the benchmarking:

cd ent/editdist
benchstat bench.txt

name                        time/op
Dist/CompareOnceMaxOff-8     740ns ± 1%
Dist/CompareOnceMax-8        580ns ± 1%
Dist/CompareDiffOffEqual-8  3.93ns ± 1%
Dist/CompareDiffOnEqual-8   3.85ns ± 5%
Dist/CompareDiffOff-8        488ns ± 1%
Dist/CompareDiffOn-8        2.37µs ± 6%

name                        alloc/op
Dist/CompareOnceMaxOff-8     32.0B ± 0%
Dist/CompareOnceMax-8        32.0B ± 0%
Dist/CompareDiffOffEqual-8   0.00B
Dist/CompareDiffOnEqual-8    0.00B
Dist/CompareDiffOff-8        32.0B ± 0%
Dist/CompareDiffOn-8        1.17kB ± 0%

name                        allocs/op
Dist/CompareOnceMaxOff-8      1.00 ± 0%
Dist/CompareOnceMax-8         1.00 ± 0%
Dist/CompareDiffOffEqual-8    0.00
Dist/CompareDiffOnEqual-8     0.00
Dist/CompareDiffOff-8         1.00 ± 0%
Dist/CompareDiffOn-8          7.00 ± 0%

License

Released under MIT license

Documentation

Index

Constants

This section is empty.

Variables

View Source
var (
	Version = "v0.4.0"
	Build   = "N/A"
)
View Source
var Batch = 10_000

Batch can be used to break a very large input into chunks.

Functions

func Example added in v0.1.0

func Example()

func NewLevenshtein

func NewLevenshtein(opts ...Option) levenshtein

NewLevenshtein returns an object that implements Levenshtein interface.

Types

type Levenshtein

type Levenshtein interface {
	// Compare calculates edit distance between two strings.
	Compare(str1, str2 string) presenter.Output

	// CompareMult, calculates edit distance between many name-strings.
	// The job is parallelized, and then reassembled in the same order as
	// the input.
	CompareMult(input []Strings) []presenter.Output

	// Option returns back options applied to the Levenshtein implementation.
	Opts() []Option
}

Levenshtein is the main API interface for the library. It provides methods for calculating edit distance between two strings, or between a slice of two strings.

type Option

type Option func(*levenshtein)

Option is an 'interface' for creating Options for Levensthsein, that modify its behavior.

func OptMaxEditDist

func OptMaxEditDist(i int) Option

OptMaxEditDist sets the maximum edit distance after which the progression of calculations will be aborted. Such constraint does not seem to be always beneficial.

func OptWithDiff

func OptWithDiff(b bool) Option

OptWithDiff if set to true will return restuls where difference between strings will be marked by "ins", "del", "subst" tags. The option will slowdown calculations 4-5 times.

type Strings

type Strings struct {
	String1, String2 string
}

Strings structure is used to feed CompareMult function with data.

Directories

Path Synopsis
ent
editdist
Package editdist includes a Levenshtein automaton as well as a traditional implementation to calculate Levenshtein Distance.
Package editdist includes a Levenshtein automaton as well as a traditional implementation to calculate Levenshtein Distance.
cmd

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL