Documentation ¶
Overview ¶
Package tokenizers is an interim solution while developing `gotokenizers` (https://github.com/nlpodyssey/gotokenizers). APIs and implementations may be subject to frequent refactoring.
Index ¶
Constants ¶
This section is empty.
Variables ¶
This section is empty.
Functions ¶
func GetStrings ¶
func GetStrings(tokens []StringOffsetsPair) []string
GetStrings returns a sequence of string values from the given slice of StringOffsetsPair.
Types ¶
type OffsetsType ¶
OffsetsType represents a (start, end) offsets pair. It usually represents a lower inclusive index position, and an upper exclusive position.
func GetOffsets ¶
func GetOffsets(tokens []StringOffsetsPair) []OffsetsType
GetOffsets returns a sequence of offsets values from the given slice of StringOffsetsPair.
type StringOffsetsPair ¶
type StringOffsetsPair struct { String string Offsets OffsetsType }
StringOffsetsPair represents a string value paired with offsets bounds. It usually represents a token string and its offsets positions in the original string.
type Tokenizer ¶
type Tokenizer interface {
Tokenize(text string) []StringOffsetsPair
}
Tokenizer is implemented by any value that has the Tokenize method.
Directories ¶
Path | Synopsis |
---|---|
Package basetokenizer provides an implementations of a very simple tokenizer that splits by white-spaces (and alike) and punctuation symbols.
|
Package basetokenizer provides an implementations of a very simple tokenizer that splits by white-spaces (and alike) and punctuation symbols. |
internal/sentencepiece
Package sentencepiece implements the SentencePiece encoder (Kudo and Richardson, 2018).
|
Package sentencepiece implements the SentencePiece encoder (Kudo and Richardson, 2018). |