Documentation ¶
Overview ¶
Package tokenize supplies tokenization operations for BERT. Ports the tokenizer.py capability from the core BERT repo
NOTE: All definitions are related to BERT and may vary from unicode definitions, for example, BERT considers '$' punctuation, but unicode does not.
Index ¶
- Constants
- type Basic
- type Feature
- type FeatureFactory
- type Full
- type Option
- type Tokenizer
- type VocabTokenizer
- type Wordpiece
- func (wp Wordpiece) CharLoop(text string)
- func (wp Wordpiece) CheckIsLargeThanMaxWordChars(text string) bool
- func (wp Wordpiece) SetMaxWordChars(c int)
- func (wp Wordpiece) SetUnknownToken(tok string)
- func (wp Wordpiece) SubTokenize(text string) bool
- func (wp Wordpiece) Tokenize(text string) []string
Constants ¶
const ( ClassToken = "[CLS]" SeparatorToken = "[SEP]" SequenceSeparator = " ||| " )
Static tokens
const DefaultMaxWordChars = 200
DefaultMaxWordChars is the max length of a token for it to be tokenized, otherwise marked as unknown
const DefaultUnknownToken = "[UNK]"
DefaultUnknownToken is the token used to signify an unknown token
Variables ¶
This section is empty.
Functions ¶
This section is empty.
Types ¶
type Basic ¶
type Basic struct { // Lower will apply a lower case filter to input Lower bool }
Basic is a BasicTokenizer that run basic tokenize (punctuation splitting, lower casing, etc.).
type Feature ¶
type Feature struct { ID int32 Text string Tokens []string TokenIDs []int32 Mask []int32 // short? TypeIDs []int32 // sequence ids, short? }
Feature is an input feature for a BERT model. Maps to extract_features.InputFeature in ref-impl
type FeatureFactory ¶
type FeatureFactory struct { Tokenizer VocabTokenizer SeqLen int32 // contains filtered or unexported fields }
FeatureFactory will create features with the supplied tokenizer and sequence length
func (*FeatureFactory) Feature ¶
func (ff *FeatureFactory) Feature(text string) Feature
Feature will create a single feature from the factory ID creation is thread safe and incremental
func (*FeatureFactory) Features ¶
func (ff *FeatureFactory) Features(texts ...string) []Feature
Features will create multiple features with incremental IDs
type Full ¶
Full is a FullTokenizer which comprises of a Basic & Wordpiece tokenizer
type Option ¶
Option alter the behavior of the tokenizer TODO add tests for these behavior changes
func WithLower ¶
WithLower will lowercase all input if set to true, or skip lowering if false NOTE: kink from reference implementation is that lowering also strips accents
func WithMaxChars ¶
WithMaxChars sets the maximum len of a token to be tokenized, if longer will be labeled as unknown
func WithUnknownToken ¶
WithUnknownToken will alter the unknown token from default [UNK]
type Tokenizer ¶
Tokenizer is an interface for chunking a string into it's tokens as per the BERT implementation
type VocabTokenizer ¶
VocabTokenizer comprises of a Tokenizer and VocabProvider
func NewTokenizer ¶
func NewTokenizer(voc vocab.Dict, buf *bytebufferpool.ByteBuffer, opts ...Option) VocabTokenizer
NewTokenizer returns a new FullTokenizer Use Option array to modify default behavior
type Wordpiece ¶
type Wordpiece struct {
// contains filtered or unexported fields
}
Wordpiece is a tokenizer that breaks tokens into sub-word units based on a supplied vocabulary https://arxiv.org/pdf/1609.08144.pdf Section 4.1 for details
func NewWordpiece ¶
func NewWordpiece(voc vocab.Dict, buffer *bytebufferpool.ByteBuffer) Wordpiece
NewWordpiece returns a WordpieceTokenizer with the default settings. Generally should be used in a FullTokenizer
func (Wordpiece) CheckIsLargeThanMaxWordChars ¶ added in v0.2.1
CheckIsLargeThanMaxWordChars check text is larger than wp.maxWordChars
func (Wordpiece) SetMaxWordChars ¶
SetMaxWordChars will set the max chars for a word to be tokenized, generally this should be configured through the FullTokenizer
func (Wordpiece) SetUnknownToken ¶
SetUnknownToken will set the unknown token, generally this should be configured through the FullTokenizer
func (Wordpiece) SubTokenize ¶ added in v0.2.0
SubTokenize impl for old method
func (Wordpiece) Tokenize ¶
Tokenize will segment the text into sub-word tokens from the supplied vocabulary NOTE: This implementation does not EXACTLY match the ref-impl and behaves slightly differently See https://github.com/google-research/bert/issues/763