tokenizer

package

v0.0.0-...-e36dbc7 Latest Latest Go to latest Published: Apr 9, 2024 License: Apache-2.0, Apache-2.0 Imports: 6 Imported by: 0

Details

Valid go.mod file

The Go module system was introduced in Go 1.11 and is the official dependency management solution for Go.
Redistributable license

Redistributable licenses place minimal restrictions on how software can be used, modified, and redistributed.
Tagged version

Modules with tagged versions give importers more predictable builds.
Stable version

When a project reaches major version v1 it is considered stable.
Learn more about best practices

Repository

github.com/bugfender-contrib/licenseclassifier

Links

Open Source Insights

Documentation ¶

Overview ¶

Package tokenizer converts a text into a stream of tokens.

Index ¶

type Hash
type TokenRange
- func (t *TokenRange) String() string
type TokenRanges
type Tokens
- func Tokenize(s string) (toks Tokens)
- func (t Tokens) GenerateHashes(h Hash, size int) ([]uint32, TokenRanges)

Constants ¶

This section is empty.

Variables ¶

This section is empty.

Functions ¶

This section is empty.

Types ¶

type Hash ¶

type Hash map[uint32]TokenRanges

Hash is a map of the hashes of a section of text to the token range covering that text.

type TokenRange ¶

type TokenRange struct {
	Start int
	End   int
}

TokenRange indicates the range of tokens that map to a particular checksum.

func (*TokenRange) String ¶

func (t *TokenRange) String() string

type TokenRanges ¶

type TokenRanges []*TokenRange

TokenRanges is a list of TokenRange objects. The chance that two different strings map to the same checksum is very small, but unfortunately isn't zero, so we use this instead of making the assumption that they will all be unique.

func (TokenRanges) CombineUnique ¶

func (t TokenRanges) CombineUnique(other TokenRanges) TokenRanges

CombineUnique returns the combination of both token ranges with no duplicates.

func (TokenRanges) Len ¶

func (t TokenRanges) Len() int

func (TokenRanges) Less ¶

func (t TokenRanges) Less(i, j int) bool

func (TokenRanges) Swap ¶

func (t TokenRanges) Swap(i, j int)

type Tokens ¶

type Tokens []*token

Tokens is a list of Token objects.

func Tokenize ¶

func Tokenize(s string) (toks Tokens)

Tokenize converts a string into a stream of tokens.

func (Tokens) GenerateHashes ¶

func (t Tokens) GenerateHashes(h Hash, size int) ([]uint32, TokenRanges)

GenerateHashes generates hashes for "size" length substrings. The "stringifyTokens" call takes a long time to run, so not all substrings have hashes, i.e. we skip some of the smaller substrings.

Source Files ¶

View all Source files

tokenizer.go

?	: This menu
/	: Search site
f or F	: Jump to
y or Y	: Canonical URL