Documentation ¶
Index ¶
Constants ¶
const ( // DefaultClassToken is the default class token value for the WordPiece tokenizer. DefaultClassToken = "[CLS]" // DefaultSequenceSeparator is the default sequence separator value for the WordPiece tokenizer. DefaultSequenceSeparator = "[SEP]" // DefaultUnknownToken is the default unknown token value for the WordPiece tokenizer. DefaultUnknownToken = "[UNK]" // DefaultMaskToken is the default mask token value for the WordPiece tokenizer. DefaultMaskToken = "[MASK]" // DefaultSplitPrefix is the default split prefix value for the WordPiece tokenizer. DefaultSplitPrefix = "##" // DefaultMaxWordChars is the default maximum word length for the WordPiece tokenizer. DefaultMaxWordChars = 100 )
Variables ¶
This section is empty.
Functions ¶
func GroupSubWords ¶
func GroupSubWords(tokens []tokenizers.StringOffsetsPair) []tokenizers.StringOffsetsPair
GroupSubWords returns a list of tokens range each of which represents the start and the end index of the tokens that form a complete word.
func IsDefaultSpecial ¶
IsDefaultSpecial return whether the word matches a special token, or not.
Types ¶
type WordPieceTokenizer ¶
type WordPieceTokenizer struct {
// contains filtered or unexported fields
}
WordPieceTokenizer is a tokenizer that breaks tokens into sub-word units based on a supplied vocabulary. See https://arxiv.org/pdf/1609.08144.pdf Section 4.1 for details. WordPieceTokenizers uses BaseTokenizer to preprocess the input text.
func New ¶
func New(vocabulary *vocabulary.Vocabulary) *WordPieceTokenizer
New returns a new WordPieceTokenizer.
func (*WordPieceTokenizer) Tokenize ¶
func (t *WordPieceTokenizer) Tokenize(text string) []tokenizers.StringOffsetsPair
Tokenize converts the input text to a slice of words or sub-words token units based on the supplied vocabulary. The resulting tokens preserve the alignment with the portion of the original text they belong to.
func (*WordPieceTokenizer) WordPieceTokenize ¶
func (t *WordPieceTokenizer) WordPieceTokenize(tokens []tokenizers.StringOffsetsPair) []tokenizers.StringOffsetsPair
WordPieceTokenize transforms the input token in a new slice of words or sub-words units based on the supplied vocabulary. The resulting tokens preserve the alignment with the portion of the original text they belong to.