Documentation ¶
Overview ¶
Package tokenizer converts a text into a stream of tokens.
Index ¶
Constants ¶
This section is empty.
Variables ¶
This section is empty.
Functions ¶
This section is empty.
Types ¶
type Hash ¶
type Hash map[uint32]TokenRanges
Hash is a map of the hashes of a section of text to the token range covering that text.
type TokenRange ¶
TokenRange indicates the range of tokens that map to a particular checksum.
func (*TokenRange) String ¶
func (t *TokenRange) String() string
type TokenRanges ¶
type TokenRanges []*TokenRange
TokenRanges is a list of TokenRange objects. The chance that two different strings map to the same checksum is very small, but unfortunately isn't zero, so we use this instead of making the assumption that they will all be unique.
func (TokenRanges) CombineUnique ¶
func (t TokenRanges) CombineUnique(other TokenRanges) TokenRanges
CombineUnique returns the combination of both token ranges with no duplicates.
func (TokenRanges) Len ¶
func (t TokenRanges) Len() int
func (TokenRanges) Less ¶
func (t TokenRanges) Less(i, j int) bool
func (TokenRanges) Swap ¶
func (t TokenRanges) Swap(i, j int)
type Tokens ¶
type Tokens []*token
Tokens is a list of Token objects.
func (Tokens) GenerateHashes ¶
func (t Tokens) GenerateHashes(h Hash, size int) ([]uint32, TokenRanges)
GenerateHashes generates hashes for "size" length substrings. The "stringifyTokens" call takes a long time to run, so not all substrings have hashes, i.e. we skip some of the smaller substrings.