searchset

package

v0.0.0-...-e36dbc7 Latest Latest Go to latest Published: Apr 9, 2024 License: Apache-2.0, Apache-2.0 Imports: 5 Imported by: 0

Documentation ¶

Overview ¶

Package searchset generates hashes for all substrings of a text. Potential matches between two SearchSet objects can then be determined quickly. Generating the hashes can be expensive, so it's best to perform it once. If the text is part of a known corpus, then the SearchSet can be serialized and kept in an archive.

Matching occurs by "mapping" ranges from the source text into the target text but still retaining the source order:

SOURCE: |-----------------------------|

TARGET: |*****************************************|

MAP SOURCE SECTIONS ONTO TARGET IN SOURCE ORDER:

  S:  |-[--]-----[---]------[----]------|
         /         |           \
      |---|   |---------|   |-------------|
  T: |*****************************************|

Note that a single source range may match many different ranges in the target. The matching algorithm untangles these so that all matched ranges are in order with respect to the source ranges. This is especially important since the source text may occur more than once in the target text. The algorithm finds each potential occurrence of S in T and returns all as potential matched ranges.

Index ¶

Constants
func Deserialize(r io.Reader, s *SearchSet) error
type MatchRange
- func (m *MatchRange) String() string
type MatchRanges
- func FindPotentialMatches(src, target *SearchSet) []MatchRanges
type SearchSet
- func New(s string, granularity int) *SearchSet
- func (s *SearchSet) GenerateNodeList()
- func (s *SearchSet) Serialize(w io.Writer) error

Constants ¶

View Source

const DefaultGranularity = 3

DefaultGranularity is the minimum size (in words) of the hash chunks.

Variables ¶

This section is empty.

Functions ¶

func Deserialize ¶

func Deserialize(r io.Reader, s *SearchSet) error

Deserialize reads a file with a serialized SearchSet in it and reconstructs it.

Types ¶

type MatchRange ¶

type MatchRange struct {
	// Offsets into the source tokens.
	SrcStart, SrcEnd int
	// Offsets into the target tokens.
	TargetStart, TargetEnd int
}

MatchRange is the range within the source text that is a match to the range in the target text.

func (*MatchRange) String ¶

func (m *MatchRange) String() string

type MatchRanges ¶

type MatchRanges []*MatchRange

MatchRanges is a list of "MatchRange"s. The ranges are monotonically increasing in value and indicate a single potential occurrence of the source text in the target text.

func FindPotentialMatches ¶

func FindPotentialMatches(src, target *SearchSet) []MatchRanges

FindPotentialMatches returns the ranges in the target (unknown) text that are best potential matches to the source (known) text.

func (MatchRanges) Len ¶

func (m MatchRanges) Len() int

func (MatchRanges) Less ¶

func (m MatchRanges) Less(i, j int) bool

func (MatchRanges) Size ¶

func (m MatchRanges) Size() int

Size is the number of source tokens that were matched.

func (MatchRanges) Swap ¶

func (m MatchRanges) Swap(i, j int)

func (MatchRanges) TargetRange ¶

func (m MatchRanges) TargetRange(target *SearchSet) (start, end int)

TargetRange is the start and stop token offsets into the target text.

type SearchSet ¶

type SearchSet struct {
	// Tokens is a tokenized list of the original input string.
	Tokens tokenizer.Tokens
	// Hashes is a map of checksums to a range of tokens.
	Hashes tokenizer.Hash
	// Checksums is a list of checksums ordered from longest range to
	// shortest.
	Checksums []uint32
	// ChecksumRanges are the token ranges for the above checksums.
	ChecksumRanges tokenizer.TokenRanges
	// contains filtered or unexported fields
}

SearchSet is a set of substrings that have hashes associated with them, making it fast to search for potential matches.

func New ¶

func New(s string, granularity int) *SearchSet

New creates a new SearchSet object. It generates a hash for each substring of "s".

func (*SearchSet) GenerateNodeList ¶

func (s *SearchSet) GenerateNodeList()

GenerateNodeList creates a node list out of the search set.

func (*SearchSet) Serialize ¶

func (s *SearchSet) Serialize(w io.Writer) error

Serialize emits the SearchSet out so that it can be recreated at a later time.

Source Files ¶

View all Source files

searchset.go

Directories ¶

Path	Synopsis
tokenizer Package tokenizer converts a text into a stream of tokens.	Package tokenizer converts a text into a stream of tokens.

?	: This menu
/	: Search site
f or F	: Jump to
y or Y	: Canonical URL