approx
A clone of python fuzzysearch library in Go.
It is based off the v0.1.0 version of the Python fuzzysearch library by Tal Einat.
Synopsis
fuzzyfind uses a modified levenshtein algorithm to find approximate matches of a subsequence in a sequence. My reason for using this tools is for extracting regions of sequencing reads that might have mutations.
Benchmarks
Short Pattern Short Text:
BenchmarkShortPShortTApproxFind-8 300000 4301 ns/op
BenchmarkShortPShortTapproxPigeon-8 100000 13708 ns/op
Short Pattern Long Text:
BenchmarkShortPLongTApproxFind-8 100000 18044 ns/op
BenchmarkShortPLongTapproxPigeon-8 100000 17393 ns/op
Long Pattern Short Text:
BenchmarkLongPShortTApproxFind-8 300000 4567 ns/op
BenchmarkLongPShortTapproxPigeon-8 200000 10575 ns/op
Long Pattern Long Text:
BenchmarkLongPLongTApproxFind-8 100000 18173 ns/op
BenchmarkLongPLongTapproxPigeon-8 30000 46241 ns/op
This is a work in progress
For now just use the ApproxFind function, it has solid performance and is the most correct.
Install
go get github.com/sstadick/fuzzyfind
Usage in code
package main
import (
"github.com/davecgh/go-spew/spew"
"github.com/sstadick/fuzzyfind/approx"
)
func main() {
haystack := "GACTAGCACTGTAGGGATAACAATTTCACACAGGTGGACAATTACATTGAAAATCACAGATTGGTCACACACACATTGGACATACATAGAAACACACACACATACATTAGATACGAACATAGAAACACACATTAGACGCGTACATAGACACAAACACATTGACAGGCAGTTCAGATGATGACGCCCGACTGATACTCGCGTAGTCGTGGGAGGCAAGGCACACAGGGGATAGG"
needle := "TGCACTGTAGGGATAACAAT" // distance 1
maxDist := 2
result, _ := approx.ApproxFind(needle, haystack, maxDist, approx.DefaultOptions)
spew.Dump(result)
}
Output
([]fuzzyfind.Match) (len=4 cap=4) {
(fuzzyfind.Match) {
Start: (int) 6,
End: (int) 24,
Dist: (int) 2
},
(fuzzyfind.Match) {
Start: (int) 5,
End: (int) 24,
Dist: (int) 1
},
(fuzzyfind.Match) {
Start: (int) 3,
End: (int) 24,
Dist: (int) 1
},
(fuzzyfind.Match) {
Start: (int) 3,
End: (int) 24,
Dist: (int) 2
}
}
If you want to .... just get going:
Use the approx method, it will choose the best method for you depending on your pattern and text sizes
If you want to .... match the same pattern against multiple texts:
This has yet to be implemented. It will likely use boyer moore to create a lookup table for the pattern.
If you want to .... match multiple patterns against the same text:
This has yet to be implemented. It will likely use a kmer index of the text
If you want to .... specifically use just the modified levenshtien algorithm:
Use ApproxFind. This should work best on short patterns.
if you want to .... specifically use the pigeonhole method:
use approxPigeon. This will only work when you have a len(pattern) / (maxDist + 1) >= 1
and should really only be used when greater than 3.
This is also not yet implemented.
Futher readings
Notes
- Allow a custom penalty matrix like smith-waterman and co?