Documentation ¶
Overview ¶
Package textsimilarity provides features to analyze files for similarities between them, such as lines of text copied and pasted.
Index ¶
Constants ¶
const ( // IgnoreWhitespaceFlag specifies that leading and trailing whitespace of text lines should be ignored. IgnoreWhitespaceFlag = Flag(1 << iota) // IgnoreBlankLinesFlag specifies that blank lines should be ignored. IgnoreBlankLinesFlag )
const ( // SimilarSimilarityLevel is the similarity level used for lines or occurrences that are similar, but not completely equal. SimilarSimilarityLevel // EqualSimilarityLevel is the similarity level used for lines or occurrences that are completely equal. EqualSimilarityLevel )
const DefaultMaxEditDistance = 5
DefaultMaxEditDistance is the Levenshtein distance used when Options.MaxEditDistance <= 0.
Variables ¶
This section is empty.
Functions ¶
func Similarities ¶
func Similarities(ctx context.Context, files []*File, opts *Options) (<-chan *Similarity, <-chan Progress, error)
Similarities scans files for similarities between them, according to opts. Detected similarities will be sent into the returned channel. Progress is reported via the returned progress channel. Both channels must be drained by the caller.
Types ¶
type File ¶
type File struct { // Name is an arbitrary name for the file. Name string // R is read from to get the file's contents. The contents is expected to be UTF-8 text. R io.Reader // contains filtered or unexported fields }
A File is a source of text lines read from a Reader.
type FileOccurrence ¶
type FileOccurrence struct { // File is the file the range of text was found in. File *File // Start is the starting line number (zero-based.) Start int // End is the ending line number (zero-based, exclusive.) End int // contains filtered or unexported fields }
A FileOccurrence is a range of text within a single File.
type Flag ¶
type Flag uint8
A Flag is a single flag (a single set bit), or a set of flags (multiple set bits), depending on the context.
type Options ¶
type Options struct { // Flags is a set of flags specifying different behaviour in determining similarities, such as ignoring whitespace or blank lines. Flags Flag // MinLineLength is the minimum length of a line to be considered (in runes.) Lines shorter than that will be ignored. MinLineLength int // MinSimilarLines is the minimum number of lines a similarity between files must have. Similarities with // fewer lines will not be reported. MinSimilarLines int // MaxEditDistance is the maximum Levenshtein distance between similar lines that will be considered "similar." // Lines that have a larger distance between them will be considered different. MaxEditDistance int // IgnoreLineRegex, if set, is an expression that a line must match to be ignored. Note that leading/trailing // whitespace on lines as well as blank lines may be ignored by using Flags. IgnoreLineRegex *regexp.Regexp }
Options specifies several options for determining similarities.
type Progress ¶
type Progress struct { // File is the file that has just been processed. File *File // Done is an overall progress percentage value from 0 to 1. Done float64 // ETA is an estimate of the time of completion. ETA time.Time Err error }
Progress is reported when determining similarities.
type Similarity ¶
type Similarity struct { // Occurrences is a set of text ranges in files. Occurrences []*FileOccurrence // Level is the level of similarity between Occurrences. Level SimilarityLevel }
A Similarity is a match of ranges of text between different Files.
type SimilarityLevel ¶
type SimilarityLevel int
SimilarityLevel is the level of similarity between ranges of text.