tokenize

package
v0.2.4 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Sep 1, 2021 License: MIT Imports: 6 Imported by: 0

Documentation

Overview

Package tokenize supplies tokenization operations for BERT. Ports the tokenizer.py capability from the core BERT repo

NOTE: All definitions are related to BERT and may vary from unicode definitions, for example, BERT considers '$' punctuation, but unicode does not.

Index

Constants

View Source
const (
	ClassToken        = "[CLS]"
	SeparatorToken    = "[SEP]"
	SequenceSeparator = " ||| "
)

Static tokens

View Source
const DefaultMaxWordChars = 200

DefaultMaxWordChars is the max length of a token for it to be tokenized, otherwise marked as unknown

View Source
const DefaultUnknownToken = "[UNK]"

DefaultUnknownToken is the token used to signify an unknown token

Variables

This section is empty.

Functions

This section is empty.

Types

type Basic

type Basic struct {
	// Lower will apply a lower case filter to input
	Lower bool
}

Basic is a BasicTokenizer that run basic tokenize (punctuation splitting, lower casing, etc.).

func NewBasic

func NewBasic() Basic

NewBasic returns a basic tokenizer. Method is supplied to match constructor of other tokenizers

func (Basic) Tokenize

func (bt Basic) Tokenize(text string) (toks []string)

Tokenize will segment a text into individual tokens. Follows algorithm from ref-imp Clean, PadChinese, Whitespace Split, Lower?, SplitPunc, Whitespace Split

type Feature

type Feature struct {
	ID       int32
	Text     string
	Tokens   []string
	TokenIDs []int32
	Mask     []int32 // short?
	TypeIDs  []int32 // sequence ids, short?
}

Feature is an input feature for a BERT model. Maps to extract_features.InputFeature in ref-impl

func (Feature) Count

func (f Feature) Count() int

Count will return the number of tokens in the feature by counting the mask bits

type FeatureFactory

type FeatureFactory struct {
	Tokenizer VocabTokenizer
	SeqLen    int32
	// contains filtered or unexported fields
}

FeatureFactory will create features with the supplied tokenizer and sequence length

func (*FeatureFactory) Feature

func (ff *FeatureFactory) Feature(text string) Feature

Feature will create a single feature from the factory ID creation is thread safe and incremental

func (*FeatureFactory) Features

func (ff *FeatureFactory) Features(texts ...string) []Feature

Features will create multiple features with incremental IDs

type Full

type Full struct {
	Basic     Basic
	Wordpiece Wordpiece
}

Full is a FullTokenizer which comprises of a Basic & Wordpiece tokenizer

func (Full) Tokenize

func (f Full) Tokenize(text string) []string

Tokenize will tokenize the input text First basic is applited, then wordpiece on the tokens from basic

func (Full) Vocab

func (f Full) Vocab() vocab.Dict

Vocab returns the vocab used for this tokenizer

type Option

type Option func(tkz Full) Full

Option alter the behavior of the tokenizer TODO add tests for these behavior changes

func WithLower

func WithLower(lower bool) Option

WithLower will lowercase all input if set to true, or skip lowering if false NOTE: kink from reference implementation is that lowering also strips accents

func WithMaxChars

func WithMaxChars(wc int) Option

WithMaxChars sets the maximum len of a token to be tokenized, if longer will be labeled as unknown

func WithUnknownToken

func WithUnknownToken(unk string) Option

WithUnknownToken will alter the unknown token from default [UNK]

type Tokenizer

type Tokenizer interface {
	Tokenize(text string) (tokens []string)
}

Tokenizer is an interface for chunking a string into it's tokens as per the BERT implementation

type VocabTokenizer

type VocabTokenizer interface {
	Tokenizer
	vocab.Provider
}

VocabTokenizer comprises of a Tokenizer and VocabProvider

func NewTokenizer

func NewTokenizer(voc vocab.Dict, buf *bytebufferpool.ByteBuffer, opts ...Option) VocabTokenizer

NewTokenizer returns a new FullTokenizer Use Option array to modify default behavior

type Wordpiece

type Wordpiece struct {
	// contains filtered or unexported fields
}

Wordpiece is a tokenizer that breaks tokens into sub-word units based on a supplied vocabulary https://arxiv.org/pdf/1609.08144.pdf Section 4.1 for details

func NewWordpiece

func NewWordpiece(voc vocab.Dict, buffer *bytebufferpool.ByteBuffer) Wordpiece

NewWordpiece returns a WordpieceTokenizer with the default settings. Generally should be used in a FullTokenizer

func (Wordpiece) CharLoop added in v0.2.1

func (wp Wordpiece) CharLoop(text string)

CharLoop simplify logic and avoid slice memory leak

func (Wordpiece) CheckIsLargeThanMaxWordChars added in v0.2.1

func (wp Wordpiece) CheckIsLargeThanMaxWordChars(text string) bool

CheckIsLargeThanMaxWordChars check text is larger than wp.maxWordChars

func (Wordpiece) SetMaxWordChars

func (wp Wordpiece) SetMaxWordChars(c int)

SetMaxWordChars will set the max chars for a word to be tokenized, generally this should be configured through the FullTokenizer

func (Wordpiece) SetUnknownToken

func (wp Wordpiece) SetUnknownToken(tok string)

SetUnknownToken will set the unknown token, generally this should be configured through the FullTokenizer

func (Wordpiece) SubTokenize added in v0.2.0

func (wp Wordpiece) SubTokenize(text string) bool

SubTokenize impl for old method

func (Wordpiece) Tokenize

func (wp Wordpiece) Tokenize(text string) []string

Tokenize will segment the text into sub-word tokens from the supplied vocabulary NOTE: This implementation does not EXACTLY match the ref-impl and behaves slightly differently See https://github.com/google-research/bert/issues/763

Directories

Path Synopsis

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL