Documentation ¶
Overview ¶
Package searchset generates hashes for all substrings of a text. Potential matches between two SearchSet objects can then be determined quickly. Generating the hashes can be expensive, so it's best to perform it once. If the text is part of a known corpus, then the SearchSet can be serialized and kept in an archive.
Matching occurs by "mapping" ranges from the source text into the target text but still retaining the source order:
SOURCE: |-----------------------------| TARGET: |*****************************************| MAP SOURCE SECTIONS ONTO TARGET IN SOURCE ORDER: S: |-[--]-----[---]------[----]------| / | \ |---| |---------| |-------------| T: |*****************************************|
Note that a single source range may match many different ranges in the target. The matching algorithm untangles these so that all matched ranges are in order with respect to the source ranges. This is especially important since the source text may occur more than once in the target text. The algorithm finds each potential occurrence of S in T and returns all as potential matched ranges.
Index ¶
Constants ¶
const DefaultGranularity = 3
DefaultGranularity is the minimum size (in words) of the hash chunks.
Variables ¶
This section is empty.
Functions ¶
Types ¶
type MatchRange ¶
type MatchRange struct {
// Offsets into the source tokens.
SrcStart, SrcEnd int
// Offsets into the target tokens.
TargetStart, TargetEnd int
}
MatchRange is the range within the source text that is a match to the range in the target text.
func (*MatchRange) String ¶
func (m *MatchRange) String() string
type MatchRanges ¶
type MatchRanges []*MatchRange
MatchRanges is a list of "MatchRange"s. The ranges are monotonically increasing in value and indicate a single potential occurrence of the source text in the target text.
func FindPotentialMatches ¶
func FindPotentialMatches(src, target *SearchSet) []MatchRanges
FindPotentialMatches returns the ranges in the target (unknown) text that are best potential matches to the source (known) text.
func (MatchRanges) Len ¶
func (m MatchRanges) Len() int
func (MatchRanges) Less ¶
func (m MatchRanges) Less(i, j int) bool
func (MatchRanges) Size ¶
func (m MatchRanges) Size() int
Size is the number of source tokens that were matched.
func (MatchRanges) Swap ¶
func (m MatchRanges) Swap(i, j int)
func (MatchRanges) TargetRange ¶
func (m MatchRanges) TargetRange(target *SearchSet) (start, end int)
TargetRange is the start and stop token offsets into the target text.
type SearchSet ¶
type SearchSet struct { // Tokens is a tokenized list of the original input string. Tokens tokenizer.Tokens // Hashes is a map of checksums to a range of tokens. Hashes tokenizer.Hash // Checksums is a list of checksums ordered from longest range to // shortest. Checksums []uint32 // ChecksumRanges are the token ranges for the above checksums. ChecksumRanges tokenizer.TokenRanges // contains filtered or unexported fields }
SearchSet is a set of substrings that have hashes associated with them, making it fast to search for potential matches.
func (*SearchSet) GenerateNodeList ¶
func (s *SearchSet) GenerateNodeList()
GenerateNodeList creates a node list out of the search set.