textsimilarity

package module
v0.2.2 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Feb 7, 2024 License: MIT Imports: 17 Imported by: 0

README

GoDoc

textsimilarity

A Go package to analyze files for copied and pasted (and possibly slightly modified) text.

This can be used to not only find exactly equal occurrences, but also occurrences that have slight modifications, such as names of variables changed in source code.

Command Line Usage

The cmd/textsimilarity/ folder provides a standalone command line utility. You can build it like this:

go build -o textsimilarity ./cmd/textsimilarity/

To install globally, run the following:

go install github.com/blizzy78/textsimilarity/cmd/textsimilarity@latest

You can also run it directly

go run github.com/blizzy78/textsimilarity/cmd/textsimilarity@latest ...

Usage Example

After installation, you can now run it against restic source code, with icdiff as a diff tool:

$ git clone https://github.com/restic/restic.git
$ cd restic

$ textsimilarity -progress \
	-ignoreWS -ignoreBlank -minLen 6 -minLines 10 -maxDist 3 \
	-ignoreRE '^(package|import|[ \t]*(\*|//))' \
	-diffTool 'icdiff --cols=150 --tabsize=4 --no-headers -W {{.File1}} {{.File2}}' \
	-ignoreDiffToolRC \
	-printEqual \
	$(find . -type f -name '*\.go' |egrep -v '_test.go|build.go|helpers/')

This will result in the following output:

Sample Output

(Click to enlarge. This is only part of the output.)

License

This package is licensed under the MIT license.

Documentation

Overview

Package textsimilarity provides features to analyze files for similarities between them, such as lines of text copied and pasted.

Index

Constants

View Source
const (
	// IgnoreWhitespaceFlag specifies that leading and trailing whitespace of text lines should be ignored.
	IgnoreWhitespaceFlag = Flag(1 << iota)

	// IgnoreBlankLinesFlag specifies that blank lines should be ignored.
	IgnoreBlankLinesFlag
)
View Source
const (

	// SimilarSimilarityLevel is the similarity level used for lines or occurrences that are similar, but not completely equal.
	SimilarSimilarityLevel

	// EqualSimilarityLevel is the similarity level used for lines or occurrences that are completely equal.
	EqualSimilarityLevel
)
View Source
const DefaultMaxEditDistance = 5

DefaultMaxEditDistance is the Levenshtein distance used when Options.MaxEditDistance <= 0.

Variables

This section is empty.

Functions

func Similarities

func Similarities(ctx context.Context, files []*File, opts *Options) (<-chan *Similarity, <-chan Progress, error)

Similarities scans files for similarities between them, according to opts. Detected similarities will be sent into the returned channel. Progress is reported via the returned progress channel. Both channels must be drained by the caller.

Types

type File

type File struct {
	// Name is an arbitrary name for the file.
	Name string

	// R is read from to get the file's contents. The contents is expected to be UTF-8 text.
	R io.Reader
	// contains filtered or unexported fields
}

A File is a source of text lines read from a Reader.

type FileOccurrence

type FileOccurrence struct {
	// File is the file the range of text was found in.
	File *File

	// Start is the starting line number (zero-based.)
	Start int

	// End is the ending line number (zero-based, exclusive.)
	End int
	// contains filtered or unexported fields
}

A FileOccurrence is a range of text within a single File.

type Flag

type Flag uint8

A Flag is a single flag (a single set bit), or a set of flags (multiple set bits), depending on the context.

type Options

type Options struct {
	// Flags is a set of flags specifying different behaviour in determining similarities, such as ignoring whitespace or blank lines.
	Flags Flag

	// MinLineLength is the minimum length of a line to be considered (in runes.) Lines shorter than that will be ignored.
	MinLineLength int

	// MinSimilarLines is the minimum number of lines a similarity between files must have. Similarities with
	// fewer lines will not be reported.
	MinSimilarLines int

	// MaxEditDistance is the maximum Levenshtein distance between similar lines that will be considered "similar."
	// Lines that have a larger distance between them will be considered different.
	MaxEditDistance int

	// IgnoreLineRegex, if set, is an expression that a line must match to be ignored. Note that leading/trailing
	// whitespace on lines as well as blank lines may be ignored by using Flags.
	IgnoreLineRegex *regexp.Regexp
}

Options specifies several options for determining similarities.

type Progress

type Progress struct {
	// File is the file that has just been processed.
	File *File

	// Done is an overall progress percentage value from 0 to 1.
	Done float64

	// ETA is an estimate of the time of completion.
	ETA time.Time

	Err error
}

Progress is reported when determining similarities.

type Similarity

type Similarity struct {
	// Occurrences is a set of text ranges in files.
	Occurrences []*FileOccurrence

	// Level is the level of similarity between Occurrences.
	Level SimilarityLevel
}

A Similarity is a match of ranges of text between different Files.

type SimilarityLevel

type SimilarityLevel int

SimilarityLevel is the level of similarity between ranges of text.

Directories

Path Synopsis
cmd
internal
io
Package io contains helpful I/O functions.
Package io contains helpful I/O functions.

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL