tokenizer

package module
v1.0.0 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Sep 20, 2024 License: Apache-2.0 Imports: 21 Imported by: 0

README

Tokenizer LicenseGo.Dev referenceTravis CIGo Report Card

Overview

tokenizer is pure Go package to facilitate applying Natural Language Processing (NLP) models train/test and inference in Go.

It is heavily inspired by and based on the popular HuggingFace Tokenizers.

tokenizer is part of an ambitious goal (together with transformer and gotch) to bring more AI/deep-learning tools to Gophers so that they can stick to the language they love and build faster software in production.

Features

tokenizer is built in modules located in sub-packages.

  1. Normalizer
  2. Pretokenizer
  3. Tokenizer
  4. Post-processing

It implements various tokenizer models:

  • Word level model
  • Wordpiece model
  • Byte Pair Encoding (BPE)

It can be used for both training new models from scratch or fine-tuning existing models. See examples detail.

Basic example

This tokenizer package is compatible to load pretrained models from Huggingface. Some of them can be loaded using pretrained subpackage.

import (
	"fmt"
	"log"

	"github.com/sugarme/tokenizer/pretrained"
)

func main() {
    // Download and cache pretrained tokenizer. In this case `bert-base-uncased` from Huggingface
    // can be any model with `tokenizer.json` available. E.g. `tiiuae/falcon-7b`
	configFile, err := tokenizer.CachedPath("bert-base-uncased", "tokenizer.json")
	if err != nil {
		panic(err)
	}

	tk, err := pretrained.FromFile(configFile)
	if err != nil {
		panic(err)
	}

	sentence := `The Gophers craft code using [MASK] language.`
	en, err := tk.EncodeSingle(sentence)
	if err != nil {
		log.Fatal(err)
	}

	fmt.Printf("tokens: %q\n", en.Tokens)
	fmt.Printf("offsets: %v\n", en.Offsets)

	// Output
	// tokens: ["the" "go" "##pher" "##s" "craft" "code" "using" "[MASK]" "language" "."]
	// offsets: [[0 3] [4 6] [6 10] [10 11] [12 17] [18 22] [23 28] [29 35] [36 44] [44 45]]
}

All models can be loaded from files manually. pkg.go.dev for detail APIs.

Getting Started

License

tokenizer is Apache 2.0 licensed.

Acknowledgement

Documentation

Overview

Package tokenizer represents a tokenization pipeline.

Index

Examples

Constants

View Source
const (
	WeightName    = "pytorch_model.gt"
	ConfigName    = "config.json"
	TokenizerName = "tokenizer.json"

	// NOTE. URL form := `$HFpath/ModelName/resolve/main/WeightName`
	HFpath = "https://huggingface.co"
)
View Source
const (
	RawInput = iota
	PretokenizedInput
	PretokenizedOwnedInput
	PretokenizedCowInput
)
View Source
const (
	SecondSequenceNotProvided = "Truncation error: Second sequence not provided"
	SequenceTooShort          = "Truncation error: Sequence to truncate too short to respect the provided max_length"
)

Variables

View Source
var (
	CachedDir string = "NOT_SETTING"
)
View Source
var (
	DUMMY_INPUT [][]int64 = [][]int64{
		{7, 6, 0, 0, 1},
		{1, 2, 3, 0, 0},
		{0, 0, 0, 4, 5},
	}
)

Functions

func CachedPath

func CachedPath(modelNameOrPath, fileName string) (resolvedPath string, err error)

CachedPath resolves and caches data based on input string, then returns fullpath to the cached data.

Parameters: - `modelNameOrPath`: model name e.g., "bert-base-uncase" or path to directory contains model/config files. - `fileName`: model or config file name. E.g., "pytorch_model.py", "config.json"

CachedPath does several things consequently: 1. Resolves input string to a fullpath cached filename candidate. 2. Check it at `CachedPath`, if exists, then return the candidate. If not 3. Retrieves and Caches data to `CachedPath` and returns path to cached data

NOTE. default `CachedDir` is at "{$HOME}/.cache/transformer" Custom `CachedDir` can be changed by setting with environment `GO_TRANSFORMER`

func CleanCache

func CleanCache() error

CleanCache removes all files cached in transformer cache directory `CachedDir`.

NOTE. custom `CachedDir` can be changed by setting environment `GO_TRANSFORMER`

Types

type ATOption

type ATOption func(at *AddedToken)

func WithLStrip

func WithLStrip(lstrip bool) ATOption

WithLStrip specify whether this token should include all the whitespaces on its left in order to strip them out.

func WithNormalized

func WithNormalized(normalized bool) ATOption

WithNormalized specifies whether this token should be normalized and match against its normalized version in the input text.

func WithRStrip

func WithRStrip(rstrip bool) ATOption

WithRStrip specify whether this token should include all the whitespaces on its right in order to strip them out.

func WithSingleWord

func WithSingleWord(singleWord bool) ATOption

WithSingleWord specifies whether this token should only match on whole single words, and never part of a word.

type AddedToken

type AddedToken struct {
	// Content is the content of added token
	Content string
	// whether this token is single word or break words
	SingleWord bool
	// Whether this token should strip whitespace on its left
	LStrip bool
	// Whether this token should strip whitespace on its right
	RStrip bool
	// Whether this token should be normalized
	Normalized bool
}

AddedToken represents a token added by the user on top of the existing model vocabulary.

AddedToken can be configured to specify the behaviour they should have in various situations. I.e.,: - Whether they should only match single words - Whether to include any whitespace on its left or right

func DefaultAddedToken

func DefaultAddedToken() (retVal AddedToken)

DefaultAddedToken initiates a default AddedToken

func NewAddedToken

func NewAddedToken(s string, special bool, opts ...ATOption) (retVal AddedToken)

NewAddedToken builds an AddedToken from given content specifying whether it is intended to be a special token. NOTE. Special token ar not normalized by default.

func (AddedToken) GetPattern

func (at AddedToken) GetPattern(n normalizer.Normalizer) (retVal string)

GetPattern retrieves the pattern built for this token, according to all the specified parameters.

NOTE. normalizer input is optional

func (AddedToken) SetLStrip

func (at AddedToken) SetLStrip(lstrip bool) (retVal AddedToken)

Specify whether this token should include all the whitespaces on its left, in order to strip them out.

func (AddedToken) SetNormalized

func (at AddedToken) SetNormalized(normalized bool) (retVal AddedToken)

Specify whether this token should be normalized and match against its normalized version in the input text.

func (AddedToken) SetRStrip

func (at AddedToken) SetRStrip(rstrip bool) (retVal AddedToken)

Specify whether this token should include all the whitespaces on its right, in order to strip them out.

func (AddedToken) SetSingleWord

func (at AddedToken) SetSingleWord(singleWord bool) (retVal AddedToken)

Specify whether this token should only match on whole single words, and never part of a word.

type AddedTokenWithId

type AddedTokenWithId struct {
	Id      int        // Id assigned to this token
	Special bool       // whether this is a special token
	Token   AddedToken // the target AddedToken
}

type AddedVocabulary

type AddedVocabulary struct {
	// contains filtered or unexported fields
}

AddedVocabulary is a vocabulary built on top of the Model

This provides a way to add new vocabulary to a Tokenizer that has already been trained, in a previous process, maybe by someone else. This is especially interesting in the case of fine-tunings, where we want to finetune a model while adding some new functionalities using some new special tokens, or maybe add some tokens in the case of unknown tokens, etc.

One of the reasons we need to handle these tokens outside of the model is simply that for many models, it is not possible to add new tokens after the training process. For example, using BPE, the training process generates merges pairs along the vocabulary, and any token in the vocabulary can be decomposed in other tokens, down to the original alphabet. If we were to add new tokens after this training process, we couldn't make sure the merges pairs exist as required.

func NewAddedVocabulary

func NewAddedVocabulary() (retVal AddedVocabulary)

func (*AddedVocabulary) AddSpecialTokens

func (av *AddedVocabulary) AddSpecialTokens(tokens []AddedToken, model Model, normalizer normalizer.Normalizer) (retVal int)

Add some special tokens to the vocabulary It returns number of added tokens

func (*AddedVocabulary) AddTokens

func (av *AddedVocabulary) AddTokens(tokens []AddedToken, model Model, normalizer normalizer.Normalizer) (retVal int)

Add some tokens to the vocabulary It returns number of added tokens

func (*AddedVocabulary) ExtractAndNormalize

func (av *AddedVocabulary) ExtractAndNormalize(sequence string, n normalizer.Normalizer) *PreTokenizedString

ExtractAndNormalize extracts the additional vocabulary from the given sentence, normalizing it along the way.

Some tokens should match against their normalized representation, as well as the non-normalized one. For example, when we expect to extract the token `yesterday` in the input sentence `I read a book Yesterday`, if the normalizer is supposed to lowercase everything, we expect a match.

func (*AddedVocabulary) GetVocab

func (av *AddedVocabulary) GetVocab() (retVal map[string]int)

GetVocab gets the additional vocabulary

func (*AddedVocabulary) IdToToken

func (av *AddedVocabulary) IdToToken(id int, model Model) (retVal string, ok bool)

Get the token matching the given id if it exists

func (*AddedVocabulary) IsSpecialToken

func (av *AddedVocabulary) IsSpecialToken(token string) bool

Check if a token is a special token

func (*AddedVocabulary) Len

func (av *AddedVocabulary) Len() int

Len returns size of the additional vocabulary

func (*AddedVocabulary) TokenToId

func (av *AddedVocabulary) TokenToId(token string, model Model) (retVal int, ok bool)

Get the id matching one of our token if it exists

type BytesToCharOffsetConverter

type BytesToCharOffsetConverter struct {
	// contains filtered or unexported fields
}

func NewBytesToCharOffsetConverter

func NewBytesToCharOffsetConverter(sequence string) *BytesToCharOffsetConverter

func (*BytesToCharOffsetConverter) Convert

func (c *BytesToCharOffsetConverter) Convert(offsets []int) ([]int, error)

Convert converts byte-indexed offsets to character-index offsets.

type Config

type Config struct {
	Version       string                 `json:"version"`
	Truncation    map[string]interface{} `json:"truncation"`
	Padding       map[string]interface{} `json:"padding"`
	AddedTokens   []TokenConfig          `json:"added_tokens"`
	Normalizer    map[string]interface{} `json:"normalizer"`
	PreTokenizer  map[string]interface{} `json:"pre_tokenizer"`
	PostProcessor map[string]interface{} `json:"post_processor"`
	Decoder       map[string]interface{} `json:"decoder"`
	Model         map[string]interface{} `json:"model"`
}

Config construct configuration for creating Tokenizer.

Example
tokFile, err := CachedPath("hf-internal-testing/llama-tokenizer", "tokenizer.json")
if err != nil {
	panic(err)
}

f, err := os.Open(tokFile)
if err != nil {
	panic(err)
}

dec := json.NewDecoder(f)

var config *Config

err = dec.Decode(&config)
if err != nil {
	panic(err)
}

modelConfig := util.NewParams(config.Model)

modelType := modelConfig.Get("type", "").(string)
fmt.Println(modelType)
Output:

BPE

func ConfigFromFile

func ConfigFromFile(file string) (*Config, error)

ConfigFromFile loads config from file.

type Decoder

type Decoder interface {
	Decode(tokens []string) string
	DecodeChain(tokens []string) []string
}

Decoder takes care of (merges) the given slice of tokens to string

type DecoderConfig

type DecoderConfig struct {
	Type     string                   `json:"type"`
	Decoders []map[string]interface{} `json:"decoders"`
}

type Dual

type Dual struct {
	Sentence InputSequence
	Pair     InputSequence
}

type EncodeInput

type EncodeInput interface {
	// contains filtered or unexported methods
}

func NewDualEncodeInput

func NewDualEncodeInput(sentence, pairSentence InputSequence) (retVal EncodeInput)

func NewSingleEncodeInput

func NewSingleEncodeInput(sentence InputSequence) (retVal EncodeInput)

type Encoding

type Encoding struct {
	Ids              []int         // ID produced by the `tokenizer`
	TypeIds          []int         // Type of the ID
	Tokens           []string      // Tokens associated with each ID
	Offsets          [][]int       // Offsets of the token/ID from the NormalizedString
	SpecialTokenMask []int         // Mask identifying special tokens
	AttentionMask    []int         // Mask identifying padding tokens for the attention mechanism
	Overflowing      []Encoding    // A list of overflowing generated when being truncated
	Words            []int         // Optional - Indexes of the word associated with each token/ID. None value = -1
	SequenceRanges   map[int]Range // Range of tokens covered by each sequence. If empty -> only one sequence and covers the entire range.
}

Encoding represents the output of tokenizer

func DefaultEncoding

func DefaultEncoding() *Encoding

Default creates an encoding with default values

func DefaultProcess

func DefaultProcess(encoding, pairEncoding *Encoding, addSpecialTokens bool) *Encoding

DefaultProcess is a helper function of PostProcessor's Process method It helps to fast track by just merging encoding and its pair.

func MergeEncodings

func MergeEncodings(encodings []Encoding, growingOffsets bool) *Encoding

MergeEncodings merges slice of encodings together.

func NewEncoding

func NewEncoding(ids []int, typeIds []int, tokens []string, offsets [][]int, specialTokenMask []int, attentionMask []int, overflowing []Encoding, opts ...EncodingOpt) *Encoding

NewEncoding initiate a new encoding from input data

func NewEncodingFromTokens

func NewEncodingFromTokens(tokens []Token, typeId int) (retVal *Encoding)

NewEncodingFromTokens initiate Encoding from input tokens

func NewEncodingWithCapacity

func NewEncodingWithCapacity(l int) (retVal *Encoding)

func PadEncodings

func PadEncodings(encodings []Encoding, params PaddingParams) []Encoding

func PrepareEncodings

func PrepareEncodings(encoding, pairEncoding *Encoding) (out []Encoding)

PrepareEncodings prepares encoding and pairEncoding if any before `ProcessEncodings` call.

func TruncateEncodings

func TruncateEncodings(encoding, pairEncoding *Encoding, params *TruncationParams) (tEncoding, tPairEncoding *Encoding)

func (*Encoding) Char2Token

func (e *Encoding) Char2Token(pos int) (retVal int, ok bool)

Char2Token returns a token index that contains the given `char` index

func (*Encoding) Char2Word

func (e *Encoding) Char2Word(pos int) (retVal int, ok bool)

Char2Word get the word index that contain the given `char` index

func (*Encoding) Clone

func (e *Encoding) Clone() *Encoding

func (*Encoding) GetAttentionMask

func (e *Encoding) GetAttentionMask() []int

GetAttentionMask returns attentionMask from encoding

func (*Encoding) GetIds

func (e *Encoding) GetIds() []int

GetIds returns Ids from encoding

func (*Encoding) GetOffsets

func (e *Encoding) GetOffsets() [][]int

GetOffsets returns offsets from encoding

func (*Encoding) GetOverflowing

func (e *Encoding) GetOverflowing() []Encoding

GetOverflowing returns overflowing from encoding

func (*Encoding) GetSequenceIds

func (e *Encoding) GetSequenceIds() []int

func (*Encoding) GetSpecialTokenMask

func (e *Encoding) GetSpecialTokenMask() []int

GetSpecialTokenMask returns specialTokenMask from encoding

func (*Encoding) GetTokens

func (e *Encoding) GetTokens() []string

GetToken returns tokens from encoding

func (*Encoding) GetTypeIds

func (e *Encoding) GetTypeIds() []int

GetTypeIds returns type Ids from encoding

func (*Encoding) GetWords

func (e *Encoding) GetWords() []int

GetWords returns word indexes on normalized string

func (*Encoding) IsEmpty

func (e *Encoding) IsEmpty() (retVal bool)

IsEmpty returns whether Encoding is empty

func (*Encoding) Len

func (e *Encoding) Len() (retVal int)

Len returns number of encoding tokens

func (*Encoding) Merge

func (e *Encoding) Merge(encodings []Encoding, growingOffsets bool) (retVal *Encoding)

Merge merges all Encodings together

func (*Encoding) MergeWith

func (e *Encoding) MergeWith(pair *Encoding, growingOffsets bool) (retVal *Encoding)

MergeWith merges the current encoding with other (pair) encoding

func (*Encoding) NSequences

func (e *Encoding) NSequences() int

NSequences returns number of sequences combined in this encoding.

func (*Encoding) Pad

func (e *Encoding) Pad(targetLength, padId, padTypeId int, padToken string, direction PaddingDirection) *Encoding

Pad pads current encoding with given length, values to either Left or Right direction

func (*Encoding) SequenceRange

func (e *Encoding) SequenceRange(sequencId int) (Range, error)

SequenceRange returns the range to target to retrieve something (word id, offsets, ...) related to the given sequence id.

func (*Encoding) SetOverflowing

func (e *Encoding) SetOverflowing(overflowing []Encoding)

SetOverflowing set overflowing.

func (*Encoding) SetSequenceIds

func (e *Encoding) SetSequenceIds(sequenceId int)

SetSequenceIds set the given sequence id for the whole range of tokens contained in this Encoding

func (*Encoding) SetTypeIds

func (e *Encoding) SetTypeIds(typeIds []int)

func (*Encoding) SetWord

func (e *Encoding) SetWord(index int, val int)

SetWord set word index value at given index of word in e.Words slice

func (*Encoding) TakeOverflowing

func (e *Encoding) TakeOverflowing() []Encoding

TakeOverflowing returns overflowing and reset it to empty at encoding

func (*Encoding) Token2Chars

func (e *Encoding) Token2Chars(tokenIdx int) (retVal []int, ok bool)

Token2Chars get the offsets of the token at the given index

func (*Encoding) Token2Sequence

func (e *Encoding) Token2Sequence(token int) (int, bool)

Token2Sequence returns the index of the sequence containing the given token.

func (*Encoding) Token2Word

func (e *Encoding) Token2Word(tokenIdx int) (retVal int, ok bool)

Token2Word get the word index of corresponding token if existing

func (*Encoding) Truncate

func (e *Encoding) Truncate(maxLen int, stride int) (retVal *Encoding, err error)

Truncate truncates the current encoding

func (*Encoding) Word2Chars

func (e *Encoding) Word2Chars(word int) (retVal []int, ok bool)

Word2Chars get the offsets of the word at a given index in the input sequence

func (*Encoding) Word2Tokens

func (e *Encoding) Word2Tokens(word int) (startTok, endTok int, ok bool)

Word2Tokens gets the encoded tokens corresponding the word at the given index in the input sequence in the form `(startToken, endToken + 1)`

NOTE. e.Words is optional, therefore, there's case of `none` result if `none` result, `ok` will be false.

type EncodingOpt

type EncodingOpt func(o *EncodingOpts)

func WithSequenceRangeEncodingOpt

func WithSequenceRangeEncodingOpt(v map[int]Range) EncodingOpt

func WithWordsEncodingOpt

func WithWordsEncodingOpt(v []int) EncodingOpt

type EncodingOpts

type EncodingOpts struct {
	Words         []int
	SequenceRange map[int]Range
}

func DefaultEncodingOpts

func DefaultEncodingOpts() *EncodingOpts

type InputSequence

type InputSequence struct {
	// contains filtered or unexported fields
}

func NewInputSequence

func NewInputSequence(input interface{}) (retVal InputSequence)

NewInputSequence creates a new InputSequence from input A valid input can be a string type (RawInput) or slice of string (PretokenizedInput)

type InputType

type InputType int

type Model

type Model interface {
	// Tokenize tokenizes the given sequence into multiple underlying `Token`
	// The `offsets` on the `Token` are expected to be relative to the given
	// sequence
	Tokenize(sequence string) ([]Token, error)
	// TokenToId finds the ID associated with a string token
	TokenToId(token string) (id int, ok bool)
	// IdToToken find the string token associated with an ID
	IdToToken(id int) (token string, ok bool)
	// GetVocab retrieves the entire vocabulary mapping (token -> Id)
	GetVocab() map[string]int
	// GetVocabSize retrieves the entire vocabulary mapping(map[token]id)
	GetVocabSize() int
	// Save saves the current `Model` in the given folder, using the
	// given `prefixOpt` for various files that need to be saved.
	Save(path string, prefixOpt ...string) error
}

Model represents a model used during tokenization (i.e., BPE, Word, or Unigram)

type ModelConfig

type ModelConfig struct {
	Type                    string         `json:"type"`
	Dropout                 interface{}    `json:"dropout"`
	UnkToken                string         `json:"unk_token"`
	ContinuingSubwordPrefix interface{}    `json:"continuing_subword_prefix"`
	EndOfWordSuffix         interface{}    `json:"end_of_word_suffix"`
	FuseUnk                 bool           `json:"fuse_unk"`
	ByteFallback            bool           `json:"byte_fallback"`
	Vocab                   map[string]int `json:"vocab"`
	Merges                  []string       `json:"merges"`
	MaxInputCharsPerWord    interface{}    `json:"max_input_chars_per_word"`
}

type NormalizerConfig

type NormalizerConfig struct {
	Type        string                   `json:"type"`
	Normalizers []map[string]interface{} `json:"normalizers"`
}

type OffsetConverter

type OffsetConverter interface {
	Convert(offsets []int) ([]int, error)
}

type OffsetType

type OffsetType int

OffsetType is a enum-like possible type of offsets

const (
	Byte OffsetType = iota
	Char
)

type PaddingDirection

type PaddingDirection int
const (
	Left PaddingDirection = iota
	Right
)

type PaddingParams

type PaddingParams struct {
	Strategy  PaddingStrategy
	Direction PaddingDirection
	PadId     int
	PadTypeId int
	PadToken  string
}

type PaddingStrategy

type PaddingStrategy struct {
	Value interface{}
	Name  string
}

PaddingStrategy is a enum of either - string `BatchLongest` - or a func type `Fixed(uint)` which return a uint Example:

func main() {
    var ps PaddingStrategy
    ps = NewPaddingStrategy(WithFixed(3))
    fmt.Println(ps.Value)
}

func NewPaddingStrategy

func NewPaddingStrategy(opts ...PaddingStrategyOption) *PaddingStrategy

type PaddingStrategyOption

type PaddingStrategyOption func(*PaddingStrategy)

func WithBatchLongest

func WithBatchLongest() PaddingStrategyOption

func WithFixed

func WithFixed(size int) PaddingStrategyOption

type PostProcessor

type PostProcessor interface {
	// AddedTokens returns the number of tokens that will be added during the processing step
	AddedTokens(isPair bool) int
	// Process processes both encodings and returns a new merged one
	// NOTE: pairEncoding is optional
	Process(encoding, pairEncoding *Encoding, addSpecialTokens bool) *Encoding
}

PostProcessor is in charge of post-processing an encoded output of the `Tokenizer`. It adds any special tokens that a language model would require.

type PostProcessorConfig

type PostProcessorConfig struct {
	Type          string                   `json:"type"`
	Single        []map[string]interface{} `json:"single"`
	Pair          []map[string]interface{} `json:"pair"`
	SpecialTokens map[string]interface{}   `json:"speical_tokens"`
}

type PreToken

type PreToken struct {
	Value   string
	Offsets []int
	Tokens  []Token // optional
}

type PreTokenizedString

type PreTokenizedString struct {
	// contains filtered or unexported fields
}

The `PreTokenizedString` is in charge of splitting an underlying string, making sure everything is fine while doing so, and providing ways to normalize and tokenize these splits.

Once everything has been normalized and tokenized, the `PreTokenizedString` is able to build an `Encoding` with all the relevant offsets and word ids, relative to the original string.

func NewPreTokenizedString

func NewPreTokenizedString(s string) *PreTokenizedString

NewPreTokenizedString create a new PreTokenizedString from input string

func NewPreTokenizedStringFromNS

func NewPreTokenizedStringFromNS(n *normalizer.NormalizedString) *PreTokenizedString

NewNormalizedStringFromNS creates a PreTokenizedString from input NormalizedString

func (*PreTokenizedString) GetSplits

func (pt *PreTokenizedString) GetSplits(offsetRef normalizer.IndexOn, offsetType OffsetType) []PreToken

GetSplits returns a list of splits, each of them being a slice of the normalized string, the associated offsets either in original or normalized referential, as well as the potention tokens

func (*PreTokenizedString) IntoEncoding

func (pt *PreTokenizedString) IntoEncoding(typeId int, wordIdx int, offsetType OffsetType) (*Encoding, error)

IntoEncoding transforms the current `PreTokenizedString` into an `Encoding`.

If a `wordIdx` is provided, any word in the generated `Encoding` will be set to this value. This is generally used with pre-tokenized input, that do not need the `PreTokenizedString` to generate word ids.

This method will fail if some splits do not have associated `Token`.

func (*PreTokenizedString) Normalize

Normalize normalizes all the splits that do not have attached `Tokens`, using the provided `normalize` function.

func (*PreTokenizedString) Split

func (pt *PreTokenizedString) Split(splitFn SplitFn) *PreTokenizedString

Split splits the `PreTokenizedString` by providing a `SplitFn` which is in charge of splitting each substring (`NormalizedString`) into multiple parts. func (pt *PreTokenizedString) Split(splitFn SplitFn) *PreTokenizedString {

func (*PreTokenizedString) Tokenize

Tokenize tokenizes all the splits that do not have attached `Tokens`, using the provided `tokenize` function

type PreTokenizer

type PreTokenizer interface {
	PreTokenize(*PreTokenizedString) (*PreTokenizedString, error)
}

PreTokenizer is in charge of doing the pre-segmentation step. It splits the given string in multiple substrings, keeping track of the offsets of said substrings from the `NormalizedString`. In some occasions, the `PreTokenizer` might need to modify the given `NormalizedString` to ensure we can entirely keep track of the offsets and the mapping with the original string.

type PreTokenizerConfig

type PreTokenizerConfig struct{}

type Range

type Range []int

func NewRange

func NewRange(start, end int) Range

func (Range) Contains

func (r Range) Contains(item int) bool

func (Range) IsEmpty

func (r Range) IsEmpty() bool

func (Range) Len

func (r Range) Len() int

type Single

type Single struct {
	Sentence InputSequence
}

type Split

type Split struct {
	// contains filtered or unexported fields
}

Split contains the underlying `NormalizedString` as well as its offsets in the original string. These offsets are in the `original` referential. It also contains any `Token` associated to the current split

func NewSplit

func NewSplit(normalized *normalizer.NormalizedString, tokens []Token) Split

NewSplit creates a new Split from a input NormalizedString

type SplitFn

type SplitFn func(int, *normalizer.NormalizedString) []SplitIdx

SplitFn takes a `NormalizedString` and returns an iterator over the produced `NormalizedString`.

NOTE. SplitFn is free of modifying these `NormalizedString` as long as: The produced `NormalizedString`, if combined back together, must have the same `original` string as the original one given to `SplitFn`. This means that for the offsets tracking to work as expected, `SplitFn` must produce "splits" of the ORIGINAL string.

type SplitIdx

type SplitIdx struct {
	Normalized *normalizer.NormalizedString
	Tokens     []Token
}

type Token

type Token struct {
	Id      int
	Value   string
	Offsets []int
}

func NewToken

func NewToken(id int, value string, offsets []int) Token

Implement methods for `Token` NewToken generate new token from input data

type TokenConfig

type TokenConfig struct {
	Id         int64  `json:"id"`
	Content    string `json:"content"`
	SingleWord bool   `json:"single_word"`
	Lstrip     bool   `json:"lstrip"`
	Rstrip     bool   `json:"rstrip"`
	Normalized bool   `json:"normalized"`
	Special    bool   `json:"special"`
}

type Tokenizer

type Tokenizer struct {
	// contains filtered or unexported fields
}

Tokenizer represents a tokenization pipeline. It can implement any encoding or decoding of any text.

func NewTokenizer

func NewTokenizer(model Model) *Tokenizer

Implementing methods for Tokenizer

func NewTokenizerFromFile

func NewTokenizerFromFile(file string) (retVal *Tokenizer)

NewTokenizerFromFile instantiates a new Tokenizer from the given file

func (*Tokenizer) AddSpecialTokens

func (t *Tokenizer) AddSpecialTokens(tokens []AddedToken) (retVal int)

AddSpecialTokens registers the given tokens as special tokens. This is especially useful for removing these special tokens while decoding

func (*Tokenizer) AddTokens

func (t *Tokenizer) AddTokens(tokens []AddedToken) (retVal int)

AddTokens adds the given tokens to the added vocabulary

func (*Tokenizer) Decode

func (t *Tokenizer) Decode(ids []int, skipSpecialTokens bool) (retVal string)

Decode decodes the given ids, back to a String

func (*Tokenizer) DecodeBatch

func (t *Tokenizer) DecodeBatch(sentences [][]int, skipSpecialTokens bool) []string

DecodeBatch decodes all sentences in concurrency

func (*Tokenizer) Encode

func (t *Tokenizer) Encode(input EncodeInput, addSpecialTokens bool) (retVal *Encoding, err error)

Encode the given input. This method accepts both single sequences, as well as pair sequences. Also, a sequence can be a string, or already pre-tokenized input directly:

Example
package main

import (
	"fmt"
	"log"

	"github.com/danmolitor/tokenizer/pretrained"
)

func main() {

	tk := pretrained.BertBaseUncased()
	sentence := `Yesterday I saw a [MASK] far away`

	en, err := tk.EncodeSingle(sentence)
	if err != nil {
		log.Fatal(err)
	}

	fmt.Printf("tokens: %v\n", en.GetTokens())
	fmt.Printf("offsets: %v\n", en.GetOffsets())

}
Output:

tokens: [yesterday i saw a [MASK] far away]
offsets: [[0 9] [10 11] [12 15] [16 17] [18 24] [25 28] [29 33]]

func (*Tokenizer) EncodeBatch

func (t *Tokenizer) EncodeBatch(inputs []EncodeInput, addSpecialTokens bool) (retVal []Encoding, err error)

EncodeBatch encodes all sentences in concurrency

func (*Tokenizer) EncodeCharOffsets

func (t *Tokenizer) EncodeCharOffsets(input EncodeInput, addSpecialTokens bool) (*Encoding, error)

EncodeCharOffsets encodes the given input, using offsets relative to chars instead of bytes. This method accepts both single sequences, as well as pair sequences. Also, a sequence can be a string, or already pre-tokenized input directly:

func (*Tokenizer) EncodePair

func (t *Tokenizer) EncodePair(input, pair string, addSpecialTokensOpt ...bool) (*Encoding, error)

EncodePair encodes a pair of string sequences.

Params: - input: the sequence string to be tokenized - pair: the pair sequence stirng to be tokenized with - addSpecialTokensOpt: optional (default = false) whether adding special tokens e.g. in BERT model `[CLS]` `[UNK]` or `[SEP]`

func (*Tokenizer) EncodeSingle

func (t *Tokenizer) EncodeSingle(input string, addSpecialTokensOpt ...bool) (*Encoding, error)

EncodeSingle encodes a single input string.

Params: - input: the input string to be tokenized - addSpecialTokensOpt: optional (default = false) whether adding special tokens e.g. in BERT model `[CLS]` `[UNK]` or `[SEP]`

func (*Tokenizer) EncodeSingleSequence

func (t *Tokenizer) EncodeSingleSequence(sequence InputSequence, typeId int, offsetType OffsetType) (*Encoding, error)

EncodeSingleSequence encodes a single sequence

func (*Tokenizer) GetDecoder

func (t *Tokenizer) GetDecoder() Decoder

func (*Tokenizer) GetModel

func (t *Tokenizer) GetModel() Model

func (*Tokenizer) GetNormalizer

func (t *Tokenizer) GetNormalizer() normalizer.Normalizer

func (*Tokenizer) GetPadding

func (t *Tokenizer) GetPadding() (retVal *PaddingParams)

func (*Tokenizer) GetPostProcessor

func (t *Tokenizer) GetPostProcessor() PostProcessor

func (*Tokenizer) GetPreTokenizer

func (t *Tokenizer) GetPreTokenizer() PreTokenizer

func (*Tokenizer) GetSpecialTokens

func (t *Tokenizer) GetSpecialTokens() []string

GetSpecialTokens returns a slice of special tokens.

func (*Tokenizer) GetTruncation

func (t *Tokenizer) GetTruncation() *TruncationParams

func (*Tokenizer) GetVocab

func (t *Tokenizer) GetVocab(withAddedTokens bool) map[string]int

GetVocab get the vocabulary

func (*Tokenizer) GetVocabSize

func (t *Tokenizer) GetVocabSize(withAddedTokens bool) int

GetVocabSize get the size of vocabulary

func (*Tokenizer) IdToToken

func (t *Tokenizer) IdToToken(id int) (token string, ok bool)

IdToToken converts an Id to a corresponding token

func (*Tokenizer) PostProcess

func (t *Tokenizer) PostProcess(encoding, pairEncoding *Encoding, addSpecialTokens bool) (retVal *Encoding)

PostProcess does post-processing logic, handling the case where there is no PostProcessor set

func (*Tokenizer) Save

func (t *Tokenizer) Save(path string, pretty bool) (err error)

Save saves the current tokenizer at the given path

func (*Tokenizer) Serialize

func (t *Tokenizer) Serialize(pretty bool) (retVal string)

Serialize serializes current Tokenizer to string

func (*Tokenizer) TokenToId

func (t *Tokenizer) TokenToId(token string) (id int, ok bool)

TokenToId converts a token to a corresponding id

func (*Tokenizer) Tokenize

func (t *Tokenizer) Tokenize(input string, addSpecialTokensOpt ...bool) ([]string, error)

Tokenize slices input string into tokens.

Params: - input: the input string to be tokenized - addSpecialTokensOpt: optional (default = false) whether adding special tokens e.g. in BERT model `[CLS]` `[UNK]` or `[SEP]`

func (*Tokenizer) Train

func (t *Tokenizer) Train(trainer Trainer, files []string) error

Train trains a model and replaces the current model using a given trainer The tokenizer does the following steps

  1. Concurrently, reads training data (text) from files, normalizes text using specified normalizer, and generates a slice of words and their frequency (count)
  2. Train tokenizer model using specified tokenizer configuration on slice of word-count generated from previous step to create `vocab` and `merges` data (files)
  3. Update current tokenizer with newly generated model (`vocab` and `merges` data)

func (*Tokenizer) TrainAndReplace

func (t *Tokenizer) TrainAndReplace(trainer Model, files []string) (err error)

Train a model and replace our current Model, using the given Trainer

func (*Tokenizer) WithDecoder

func (t *Tokenizer) WithDecoder(decoder Decoder)

func (*Tokenizer) WithModel

func (t *Tokenizer) WithModel(model Model)

func (*Tokenizer) WithNormalizer

func (t *Tokenizer) WithNormalizer(n normalizer.Normalizer)

func (*Tokenizer) WithPadding

func (t *Tokenizer) WithPadding(padding *PaddingParams)

func (*Tokenizer) WithPostProcessor

func (t *Tokenizer) WithPostProcessor(postProcessor PostProcessor)

func (*Tokenizer) WithPreTokenizer

func (t *Tokenizer) WithPreTokenizer(preTokenizer PreTokenizer)

func (*Tokenizer) WithTruncation

func (t *Tokenizer) WithTruncation(trunc *TruncationParams)

type Trainer

type Trainer interface {
	// Whether showing progress bar or not
	WithProgressBar() bool
	// Actual training method. It will return a trained model and
	// a list of `special tokens` to be added directly to the tokenizer
	// along with the model
	Train(words map[string]int) (Model, []AddedToken)
	// ProcessTokens processes a bunch of tokens and counts them as relevant
	ProcessTokens(words map[string]int, tokens []string)
}

Trainer is responsible for training a model. It takes lines/sentences and returns a tokenizer `Model` when done.

type TruncationParams

type TruncationParams struct {
	MaxLength int
	Strategy  TruncationStrategy
	Stride    int
}

type TruncationStrategy

type TruncationStrategy int

TruncationStrategy is enum of int type represents truncation strategy

const (
	LongestFirst TruncationStrategy = iota
	OnlyFirst
	OnlySecond
)

Directories

Path Synopsis
example
bpe
bpe
Basic text preprocessing tasks are: 1.
Basic text preprocessing tasks are: 1.
slice
utils slice manupulation Ref.
utils slice manupulation Ref.

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL