searchset

package
v0.0.0-...-c1ed8fc Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Oct 4, 2022 License: Apache-2.0, Apache-2.0 Imports: 5 Imported by: 3

Documentation

Overview

Package searchset generates hashes for all substrings of a text. Potential matches between two SearchSet objects can then be determined quickly. Generating the hashes can be expensive, so it's best to perform it once. If the text is part of a known corpus, then the SearchSet can be serialized and kept in an archive.

Matching occurs by "mapping" ranges from the source text into the target text but still retaining the source order:

SOURCE: |-----------------------------|

TARGET: |*****************************************|

MAP SOURCE SECTIONS ONTO TARGET IN SOURCE ORDER:

  S:  |-[--]-----[---]------[----]------|
         /         |           \
      |---|   |---------|   |-------------|
  T: |*****************************************|

Note that a single source range may match many different ranges in the target. The matching algorithm untangles these so that all matched ranges are in order with respect to the source ranges. This is especially important since the source text may occur more than once in the target text. The algorithm finds each potential occurrence of S in T and returns all as potential matched ranges.

Index

Constants

View Source
const DefaultGranularity = 3

DefaultGranularity is the minimum size (in words) of the hash chunks.

Variables

This section is empty.

Functions

func Deserialize

func Deserialize(r io.Reader, s *SearchSet) error

Deserialize reads a file with a serialized SearchSet in it and reconstructs it.

Types

type MatchRange

type MatchRange struct {
	// Offsets into the source tokens.
	SrcStart, SrcEnd int
	// Offsets into the target tokens.
	TargetStart, TargetEnd int
}

MatchRange is the range within the source text that is a match to the range in the target text.

func (*MatchRange) String

func (m *MatchRange) String() string

type MatchRanges

type MatchRanges []*MatchRange

MatchRanges is a list of "MatchRange"s. The ranges are monotonically increasing in value and indicate a single potential occurrence of the source text in the target text.

func FindPotentialMatches

func FindPotentialMatches(src, target *SearchSet) []MatchRanges

FindPotentialMatches returns the ranges in the target (unknown) text that are best potential matches to the source (known) text.

func (MatchRanges) Len

func (m MatchRanges) Len() int

func (MatchRanges) Less

func (m MatchRanges) Less(i, j int) bool

func (MatchRanges) Size

func (m MatchRanges) Size() int

Size is the number of source tokens that were matched.

func (MatchRanges) Swap

func (m MatchRanges) Swap(i, j int)

func (MatchRanges) TargetRange

func (m MatchRanges) TargetRange(target *SearchSet) (start, end int)

TargetRange is the start and stop token offsets into the target text.

type SearchSet

type SearchSet struct {
	// Tokens is a tokenized list of the original input string.
	Tokens tokenizer.Tokens
	// Hashes is a map of checksums to a range of tokens.
	Hashes tokenizer.Hash
	// Checksums is a list of checksums ordered from longest range to
	// shortest.
	Checksums []uint32
	// ChecksumRanges are the token ranges for the above checksums.
	ChecksumRanges tokenizer.TokenRanges
	// contains filtered or unexported fields
}

SearchSet is a set of substrings that have hashes associated with them, making it fast to search for potential matches.

func New

func New(s string, granularity int) *SearchSet

New creates a new SearchSet object. It generates a hash for each substring of "s".

func (*SearchSet) GenerateNodeList

func (s *SearchSet) GenerateNodeList()

GenerateNodeList creates a node list out of the search set.

func (*SearchSet) Serialize

func (s *SearchSet) Serialize(w io.Writer) error

Serialize emits the SearchSet out so that it can be recreated at a later time.

Directories

Path Synopsis
Package tokenizer converts a text into a stream of tokens.
Package tokenizer converts a text into a stream of tokens.

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL