zdpgo_sim

package module
v0.0.0-...-9d185f5 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Aug 2, 2022 License: MIT Imports: 4 Imported by: 0

README

zdpgo_sim

计算文本相似度

支持的算法

参考项目:https://github.com/antlabs/strsim

  • 莱文斯坦-编辑距离(Levenshtein)
  • Dice's coefficient
  • jaro
  • JaroWinkler
  • Hamming
  • Cosine
  • Simhash

Documentation

Index

Constants

This section is empty.

Variables

This section is empty.

Functions

func Compare

func Compare(s1, s2 string, opts ...Option) float64

Compare 比较两个字符串相似度

func DamerauLevenshteinDistance

func DamerauLevenshteinDistance(s1, s2 string) int

DamerauLevenshteinDistance Algorithm is an extension to the Levenshtein Algorithm which solves the edit distance problem between a source string and a target string with the following operations:

Read https://en.wikipedia.org/wiki/Damerau%E2%80%93Levenshtein_distance

func FindBestMatch

func FindBestMatch(s string, targets []string, opts ...Option) *similarity.MatchResult

FindBestMatch 返回相似度最高的那个字符串, 以及索引位置

func FindBestMatchOne

func FindBestMatchOne(s string, targets []string, opts ...Option) *similarity.Match

FindBestMatchOne 返回相似度最高的那个字符串

func JaroDistance

func JaroDistance(s1, s2 string) float32

JaroDistance distance between two words is the minimum number of single-character transpositions required to change one word into the other.

func JaroWinklerDistance

func JaroWinklerDistance(s1, s2 string, p float32) float32

JaroWinklerDistance uses a prefix scale which gives more favourable ratings to strings that match from the beginning for a set prefix length

p argument is constant scaling factor for how much the score is adjusted upwards for having common prefixes. The standard value for this constant in Winkler’s work is p=0.1

func Levenshtein

func Levenshtein(str1, str2 string, costIns, costRep, costDel int) int

Levenshtein levenshtein() costIns: Defines the cost of insertion. costRep: Defines the cost of replacement. costDel: Defines the cost of deletion.

func LevenshteinDistance

func LevenshteinDistance(s1, s2 string) int

LevenshteinDistance is the minimum number of single-character edits required to change one word into the other, so the result is a positive integer, sensitive to string length . Which make it more difficult to draw pattern.

Read https://github.com/mhutter/string-similarity and https://en.wikipedia.org/wiki/Levenshtein_distance

func SimilarText

func SimilarText(first, second string, percent *float64) int

SimilarText 实现PHP中的similar_text函数,用于比较两个文本的相似度

func TrigramCompare

func TrigramCompare(s1, s2 string) float32

TrigramCompare is a case of n-gram, a contiguous sequence of n (three, in this case) items from a given sample. In our case, an application name is a sample and a character is an item.

Types

type Option

type Option interface {
	Apply(*option) // 参数是一个对象,这个对象要有Apply方法
}

Option 参数接口

type OptionFunc

type OptionFunc func(*option)

OptionFunc 参数方法类型

func Cosine

func Cosine() OptionFunc

Cosine CosineConf是余弦相似度的配置结构。

func Default

func Default() OptionFunc

Default 默认参数

func DiceCoefficient

func DiceCoefficient(ngram ...int) OptionFunc

DiceCoefficient ngram 是筛子系数需要用的一个值

func Hamming

func Hamming() OptionFunc

Hamming 汉明距离

func IgnoreCase

func IgnoreCase() OptionFunc

IgnoreCase 忽略大小写

func IgnoreSpace

func IgnoreSpace() OptionFunc

IgnoreSpace 忽略空白字符

func Jaro

func Jaro(matchWindow ...int) OptionFunc

Jaro ngram 是筛子系数需要用的一个值

func JaroWinkler

func JaroWinkler(matchWindow ...int) OptionFunc

JaroWinkler ngram 是筛子系数需要用的一个值

func SimHash

func SimHash() OptionFunc

func UseASCII

func UseASCII() OptionFunc

UseASCII 使用ascii编码

func UseBase64

func UseBase64() OptionFunc

UseBase64 使用base64编码

func (OptionFunc) Apply

func (o OptionFunc) Apply(opt *option)

Apply 执行方法

type StringDiff

type StringDiff struct {
	S1 string
	S2 string
}

StringDiff is a utility struct to compare similarity between two string.

read https://medium.com/@appaloosastore/string-similarity-algorithms-compared-3f7b4d12f0ff

func NewStringDiff

func NewStringDiff(s1, s2 string) *StringDiff

NewStringDiff will create a new instance of StringDiff

func (*StringDiff) DamerauLevenshteinDistance

func (sd *StringDiff) DamerauLevenshteinDistance(deleteCost, insertCost,
	replaceCost, swapCost int) int

DamerauLevenshteinDistance Algorithm is an extension to the Levenshtein Algorithm which solves the edit distance problem between a source string and a target string with the following operations:

- Character Insertion - Character Deletion - Character Replacement - Adjacent Character Swap

Note that the adjacent character swap operation is an edit that may be applied when two adjacent characters in the source string match two adjacent characters in the target string, but in reverse order, rather than a general allowance for adjacent character swaps.

This implementation allows the client to specify the costs of the various edit operations with the restriction that the cost of two swap operations must not be less than the cost of a delete operation followed by an insert operation. This restriction is required to preclude two swaps involving the same character being required for optimality which, in turn, enables a fast dynamic programming solution.

The running time of the Damerau-Levenshtein algorithm is O(n*m) where n is the length of the source string and m is the length of the target string. This implementation consumes O(n*m) space.

This code is an adaptation from https://github.com/KevinStern/software-and-algorithms/blob/master/src/main/java/blogspot/software_and_algorithms/stern_library/string/DamerauLevenshteinAlgorithm.java

func (*StringDiff) JaroDistance

func (sd *StringDiff) JaroDistance() float32

JaroDistance distance between two words is the minimum number of single-character transpositions required to change one word into the other.

func (*StringDiff) JaroWinklerDistance

func (sd *StringDiff) JaroWinklerDistance(p float32) float32

JaroWinklerDistance uses a prefix scale which gives more favourable ratings to strings that match from the beginning for a set prefix length

p argument is constant scaling factor for how much the score is adjusted upwards for having common prefixes. The standard value for this constant in Winkler’s work is p=0.1

Read https://github.com/flori/amatch Read https://fr.wikipedia.org/wiki/Distance_de_Jaro-Winkler Read https://en.wikipedia.org/wiki/Jaro%E2%80%93Winkler_distance

func (*StringDiff) LevenshteinDistance

func (sd *StringDiff) LevenshteinDistance() int

LevenshteinDistance is the minimum number of single-character edits required to change one word into the other, so the result is a positive integer, sensitive to string length . Which make it more difficult to draw pattern.

Read https://github.com/mhutter/string-similarity and https://en.wikipedia.org/wiki/Levenshtein_distance

func (*StringDiff) TrigramCompare

func (sd *StringDiff) TrigramCompare() float32

TrigramCompare is a case of n-gram, a contiguous sequence of n (three, in this case) items from a given sample. In our case, an application name is a sample and a character is an item.

Read https://github.com/milk1000cc/trigram/blob/master/lib/trigram.rb Read http://search.cpan.org/dist/String-Trigram/Trigram.pm Read https://en.wikipedia.org/wiki/N-gram

Directories

Path Synopsis
examples
simhash package implements Charikar's simhash algorithm to generate a 64-bit fingerprint of a given document.
simhash package implements Charikar's simhash algorithm to generate a 64-bit fingerprint of a given document.

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL