tokenize

package

v0.2.4 Latest Latest Go to latest Published: Sep 1, 2021 License: MIT Imports: 6 Imported by: 0

Details

Valid go.mod file
Redistributable license
Tagged version
Stable version
Learn more about best practices

Repository

github.com/sunhailin-leo/gobert

Links

Open Source Insights

Documentation ¶

Overview ¶

Package tokenize supplies tokenization operations for BERT. Ports the tokenizer.py capability from the core BERT repo

NOTE: All definitions are related to BERT and may vary from unicode definitions, for example, BERT considers '$' punctuation, but unicode does not.

Index ¶

Constants
type Basic
- func NewBasic() Basic
- func (bt Basic) Tokenize(text string) (toks []string)
type Feature
- func (f Feature) Count() int
type FeatureFactory
- func (ff *FeatureFactory) Feature(text string) Feature
- func (ff *FeatureFactory) Features(texts ...string) []Feature
type Full
- func (f Full) Tokenize(text string) []string
- func (f Full) Vocab() vocab.Dict
type Option
type Tokenizer
type VocabTokenizer
- func NewTokenizer(voc vocab.Dict, buf *bytebufferpool.ByteBuffer, opts ...Option) VocabTokenizer
type Wordpiece
- func NewWordpiece(voc vocab.Dict, buffer *bytebufferpool.ByteBuffer) Wordpiece

Constants ¶

View Source

const (
	ClassToken        = "[CLS]"
	SeparatorToken    = "[SEP]"
	SequenceSeparator = " ||| "
)

Static tokens

View Source

const DefaultMaxWordChars = 200

DefaultMaxWordChars is the max length of a token for it to be tokenized, otherwise marked as unknown

View Source

const DefaultUnknownToken = "[UNK]"

DefaultUnknownToken is the token used to signify an unknown token

Variables ¶

This section is empty.

Functions ¶

This section is empty.

Types ¶

type Basic ¶

type Basic struct {
	// Lower will apply a lower case filter to input
	Lower bool
}

Basic is a BasicTokenizer that run basic tokenize (punctuation splitting, lower casing, etc.).

func NewBasic ¶

func NewBasic() Basic

NewBasic returns a basic tokenizer. Method is supplied to match constructor of other tokenizers

func (Basic) Tokenize ¶

func (bt Basic) Tokenize(text string) (toks []string)

Tokenize will segment a text into individual tokens. Follows algorithm from ref-imp Clean, PadChinese, Whitespace Split, Lower?, SplitPunc, Whitespace Split

type Feature ¶

type Feature struct {
	ID       int32
	Text     string
	Tokens   []string
	TokenIDs []int32
	Mask     []int32 // short?
	TypeIDs  []int32 // sequence ids, short?
}

Feature is an input feature for a BERT model. Maps to extract_features.InputFeature in ref-impl

func (Feature) Count ¶

func (f Feature) Count() int

Count will return the number of tokens in the feature by counting the mask bits

type FeatureFactory ¶

type FeatureFactory struct {
	Tokenizer VocabTokenizer
	SeqLen    int32
	// contains filtered or unexported fields
}

FeatureFactory will create features with the supplied tokenizer and sequence length

func (*FeatureFactory) Feature ¶

func (ff *FeatureFactory) Feature(text string) Feature

Feature will create a single feature from the factory ID creation is thread safe and incremental

func (*FeatureFactory) Features ¶

func (ff *FeatureFactory) Features(texts ...string) []Feature

Features will create multiple features with incremental IDs

type Full ¶

type Full struct {
	Basic     Basic
	Wordpiece Wordpiece
}

Full is a FullTokenizer which comprises of a Basic & Wordpiece tokenizer

func (Full) Tokenize ¶

func (f Full) Tokenize(text string) []string

Tokenize will tokenize the input text First basic is applited, then wordpiece on the tokens from basic

func (Full) Vocab ¶

func (f Full) Vocab() vocab.Dict

Vocab returns the vocab used for this tokenizer

type Option ¶

type Option func(tkz Full) Full

Option alter the behavior of the tokenizer TODO add tests for these behavior changes

func WithLower ¶

func WithLower(lower bool) Option

WithLower will lowercase all input if set to true, or skip lowering if false NOTE: kink from reference implementation is that lowering also strips accents

func WithMaxChars ¶

func WithMaxChars(wc int) Option

WithMaxChars sets the maximum len of a token to be tokenized, if longer will be labeled as unknown

func WithUnknownToken ¶

func WithUnknownToken(unk string) Option

WithUnknownToken will alter the unknown token from default [UNK]

type Tokenizer ¶

type Tokenizer interface {
	Tokenize(text string) (tokens []string)
}

Tokenizer is an interface for chunking a string into it's tokens as per the BERT implementation

type VocabTokenizer ¶

type VocabTokenizer interface {
	Tokenizer
	vocab.Provider
}

VocabTokenizer comprises of a Tokenizer and VocabProvider

func NewTokenizer ¶

func NewTokenizer(voc vocab.Dict, buf *bytebufferpool.ByteBuffer, opts ...Option) VocabTokenizer

NewTokenizer returns a new FullTokenizer Use Option array to modify default behavior

type Wordpiece ¶

type Wordpiece struct {
	// contains filtered or unexported fields
}

Wordpiece is a tokenizer that breaks tokens into sub-word units based on a supplied vocabulary https://arxiv.org/pdf/1609.08144.pdf Section 4.1 for details

func NewWordpiece ¶

func NewWordpiece(voc vocab.Dict, buffer *bytebufferpool.ByteBuffer) Wordpiece

NewWordpiece returns a WordpieceTokenizer with the default settings. Generally should be used in a FullTokenizer

func (Wordpiece) CharLoop ¶ added in v0.2.1

func (wp Wordpiece) CharLoop(text string)

CharLoop simplify logic and avoid slice memory leak

func (Wordpiece) CheckIsLargeThanMaxWordChars ¶ added in v0.2.1

func (wp Wordpiece) CheckIsLargeThanMaxWordChars(text string) bool

CheckIsLargeThanMaxWordChars check text is larger than wp.maxWordChars

func (Wordpiece) SetMaxWordChars ¶

func (wp Wordpiece) SetMaxWordChars(c int)

SetMaxWordChars will set the max chars for a word to be tokenized, generally this should be configured through the FullTokenizer

func (Wordpiece) SetUnknownToken ¶

func (wp Wordpiece) SetUnknownToken(tok string)

SetUnknownToken will set the unknown token, generally this should be configured through the FullTokenizer

func (Wordpiece) SubTokenize ¶ added in v0.2.0

func (wp Wordpiece) SubTokenize(text string) bool

SubTokenize impl for old method

func (Wordpiece) Tokenize ¶

func (wp Wordpiece) Tokenize(text string) []string

Tokenize will segment the text into sub-word tokens from the supplied vocabulary NOTE: This implementation does not EXACTLY match the ref-impl and behaves slightly differently See https://github.com/google-research/bert/issues/763

Source Files ¶

View all Source files

Directories ¶

Path	Synopsis
vocab

?	: This menu
/	: Search site
f or F	: Jump to
y or Y	: Canonical URL