wordpiecetokenizer

package

v0.2.1 Latest Latest Go to latest Published: Nov 8, 2023 License: BSD-2-Clause Imports: 4 Imported by: 0

Details

Valid go.mod file
Redistributable license
Tagged version
Stable version
Learn more about best practices

Repository

github.com/nlpodyssey/cybertron

Links

Open Source Insights

Documentation ¶

Index ¶

Constants
func GroupSubWords(tokens []tokenizers.StringOffsetsPair) []tokenizers.StringOffsetsPair
func IsDefaultSpecial(word string) bool
type WordPieceTokenizer
- func New(vocabulary *vocabulary.Vocabulary) *WordPieceTokenizer
- func (t *WordPieceTokenizer) Tokenize(text string) []tokenizers.StringOffsetsPair
- func (t *WordPieceTokenizer) WordPieceTokenize(tokens []tokenizers.StringOffsetsPair) []tokenizers.StringOffsetsPair

Constants ¶

View Source

const (
	// DefaultClassToken is the default class token value for the WordPiece tokenizer.
	DefaultClassToken = "[CLS]"
	// DefaultSequenceSeparator is the default sequence separator value for the WordPiece tokenizer.
	DefaultSequenceSeparator = "[SEP]"
	// DefaultUnknownToken is the default unknown token value for the WordPiece tokenizer.
	DefaultUnknownToken = "[UNK]"
	// DefaultMaskToken is the default mask token value for the WordPiece tokenizer.
	DefaultMaskToken = "[MASK]"
	// DefaultSplitPrefix is the default split prefix value for the WordPiece tokenizer.
	DefaultSplitPrefix = "##"
	// DefaultMaxWordChars is the default maximum word length for the WordPiece tokenizer.
	DefaultMaxWordChars = 100
)

Variables ¶

This section is empty.

Functions ¶

func GroupSubWords ¶

func GroupSubWords(tokens []tokenizers.StringOffsetsPair) []tokenizers.StringOffsetsPair

GroupSubWords returns a list of tokens range each of which represents the start and the end index of the tokens that form a complete word.

func IsDefaultSpecial ¶

func IsDefaultSpecial(word string) bool

IsDefaultSpecial return whether the word matches a special token, or not.

Types ¶

type WordPieceTokenizer ¶

type WordPieceTokenizer struct {
	// contains filtered or unexported fields
}

WordPieceTokenizer is a tokenizer that breaks tokens into sub-word units based on a supplied vocabulary. See https://arxiv.org/pdf/1609.08144.pdf Section 4.1 for details. WordPieceTokenizers uses BaseTokenizer to preprocess the input text.

func New ¶

func New(vocabulary *vocabulary.Vocabulary) *WordPieceTokenizer

New returns a new WordPieceTokenizer.

func (*WordPieceTokenizer) Tokenize ¶

func (t *WordPieceTokenizer) Tokenize(text string) []tokenizers.StringOffsetsPair

Tokenize converts the input text to a slice of words or sub-words token units based on the supplied vocabulary. The resulting tokens preserve the alignment with the portion of the original text they belong to.

func (*WordPieceTokenizer) WordPieceTokenize ¶

func (t *WordPieceTokenizer) WordPieceTokenize(tokens []tokenizers.StringOffsetsPair) []tokenizers.StringOffsetsPair

WordPieceTokenize transforms the input token in a new slice of words or sub-words units based on the supplied vocabulary. The resulting tokens preserve the alignment with the portion of the original text they belong to.

Source Files ¶

View all Source files

tokenizer.go

?	: This menu
/	: Search site
f or F	: Jump to
y or Y	: Canonical URL