tokenizers

package module
v0.0.0-...-1de48c6 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Aug 24, 2024 License: MIT Imports: 21 Imported by: 0

README

Tokenizers for Go

Under Construction

UNDER CONSTRUCTION

Not functional yet, but for Gemma/Gemini/T5 and other Google models, see https://github.com/eliben/go-sentencepiece/.

About

Tokenizers for Language Models - Go API for HuggingFace Tokenizers

Highlights

[!IMPORTANT]
TODO: nothing implemented yet.

  • Allow customization to various LLMs, exposing most of the functionality of the HuggingFace Tokenizers library.
  • Provide a from_pretrained API, that downloads parameters to various known models -- levaraging HuggingFace Hub

Installation

This library is a wrapper around the Rust implementation by HuggingFace, and it requires the compiled Rust code available as a libgomlx_tokenizers.a.

To make that easy, the project provides a prebuilt libgomlx_tokenizers.a in the git repository (for the popular platforms), so for many nothing is needed (except having CGO enabled -- for cross-compilation set CGO_ENABLED=1), and it can be simply included as any other Go library.

If you want to build the underlying Rust wrapper and dependencies yourselves for any reason (including maybe to add support for a different platform), it uses the Mage build system -- an improved Makefile-like that uses Go.

If you create a new rule for a different platform, please consider contributing it back 😄

[!IMPORTANT]
TODO

Thank You

Questions

Why fork and not collaborate with an already existing tokenizers project ?

I plan to revamp how the library is organized, its "ergonomics" to be more aligned with GoMLX APIs, and add documentation. I will also expand the functionality to match (as much as I'm able to do) HuggingFace's library. All this will completely break the API of the original repositories, and I felt too much to ask from the original authors.

Documentation

Overview

Package tokenizers provides an implementation of today's most used tokenizers, with a focus on performance and versatility.

It is currently a wrapper around the Rust implementation in https://github.com/huggingface/tokenizers/tree/main/tokenizers.

For now, it only provides the encoding and decoding functionality -- not training a new tokenizers. It includes reading from [HuggingFace's pretrained tokenizers](https://huggingface.co/docs/tokenizers/index) using `FromPretrained`.

Index

Constants

View Source
const (
	HeaderXRepoCommit = "X-Repo-Commit"
	HeaderXLinkedETag = "X-Linked-Etag"
	HeaderXLinkedSize = "X-Linked-Size"
)
View Source
const RepoIdSeparator = "--"

RepoIdSeparator is used to separate repository/model names parts when mapping to file names. Likely only for internal use.

Variables

View Source
var (
	// DefaultDirCreationPerm is used when creating new cache subdirectories.
	DefaultDirCreationPerm = os.FileMode(0755)

	// DefaultFileCreationPerm is used when creating files inside the cache subdirectories.
	DefaultFileCreationPerm = os.FileMode(0644)
)
View Source
var (
	RepoTypesUrlPrefixes = map[string]string{
		"dataset": "datasets/",
		"space":   "spaces/",
	}

	DefaultRevision = "main"

	HuggingFaceUrlTemplate = template.Must(template.New("hf_url").Parse(
		"https://huggingface.co/{{.RepoId}}/resolve/{{.Revision}}/{{.Filename}}"))
)
View Source
var SessionId string

Functions

func DefaultCacheDir

func DefaultCacheDir() string

DefaultCacheDir for HuggingFace Hub, same used by the python library.

Its prefix is either `${XDG_CACHE_HOME}` if set, or `~/.cache` otherwise. Followed by `/huggingface/hub/`. So typically: `~/.cache/huggingface/hub/`.

func Download

func Download(ctx context.Context, client *http.Client,
	repoId, repoType, revision, fileName, cacheDir, token string,
	forceDownload, forceLocal bool, progressFn ProgressFn) (filePath, commitHash string, err error)

Download returns file either from cache or by downloading from HuggingFace Hub.

Args:

  • `ctx` for the requests. There may be more than one request, the first being an `HEAD` HTTP.
  • `client` used to make HTTP requests. I can be created with `&httpClient{}`.
  • `repoId` and `fileName`: define the file and repository (model) name to download.
  • `repoType`: usually "model".
  • `revision`: default is "main", but a commitHash can be given.
  • `cacheDir`: directory where to store the downloaded files, or reuse if previously downloaded. Consider using the output from `DefaultCacheDir()` if in doubt.
  • `token`: used for authentication. TODO: not implemented yet.
  • `forceDownload`: if set to true, it will download the contents of the file even if there is a local copy.
  • `localOnly`: does not use network, not even for reading the metadata.
  • `progressFn`: is called during the download of a file. It is called synchronously and expected to be fast/ instantaneous. If the UI can be blocking, arrange it to be handled on a separate GoRoutine.

On success it returns the `filePath` to the downloaded file, and its `commitHash`. Otherwise it returns an error.

func FileExists

func FileExists(path string) bool

FileExists returns true if file or directory exists.

func GetHeaders

func GetHeaders(userAgent, token string) map[string]string

GetHeaders is based on the `build_hf_headers` function defined in the [huggingface_hub](https://github.com/huggingface/huggingface_hub) library. TODO: add support for authentication token.

func GetUrl

func GetUrl(repoId, fileName, repoType, revision string) string

GetUrl is based on the `hf_hub_url` function defined in the [huggingface_hub](https://github.com/huggingface/huggingface_hub) library.

func HttpUserAgent

func HttpUserAgent() string

HttpUserAgent returns a user agent to use with HuggingFace Hub API. Loosely based on https://github.com/huggingface/transformers/blob/main/src/transformers/utils/hub.py#L198.

func RepoFolderName

func RepoFolderName(repoId, repoType string) string

RepoFolderName returns a serialized version of a hf.co repo name and type, safe for disk storage as a single non-nested folder.

Based on github.com/huggingface/huggingface_hub repo_folder_name.

Types

type Direction

type Direction uint8

Direction is used in truncation and padding configuration.

const (
	Left  Direction = 0
	Right Direction = 1
)

func (Direction) String

func (i Direction) String() string

type Encoding

type Encoding = rs.Encoding

Encoding is the result of a Tokenizer.Encode.

Only TokenIds is always present, all other fields are only set if configured in the Tokenizer.

The SpecialTokensMask indicates which tokens are special tokens (e.g., padding, CLS, SEP).

The AttentionMask indicates which tokens are padding and should be ignored.

type HFFileMetadata

type HFFileMetadata struct {
	CommitHash, ETag, Location string
	Size                       int
}

HFFileMetadata used by HuggingFace Hub.

type OffsetsCharMode

type OffsetsCharMode uint8

OffsetsCharMode defines how to encode the offset positions when encoding. - `OffsetsCharModeByte`: Offsets are calculated on a byte basis. - `OffsetsCharModeUnicode` (default): Offsets are calculated on a Unicode code point basis.

const (
	OffsetsCharModeByte    OffsetsCharMode = 0
	OffsetsCharModeUnicode OffsetsCharMode = 1
)

func (OffsetsCharMode) String

func (i OffsetsCharMode) String() string

type PaddingStrategy

type PaddingStrategy uint8 // Values must match the underlying Rust library.

PaddingStrategy usually is defined by the preloaded tokenization model (since it should match the LLM model). But it can be manipulated.

It can be set to PadLongest, which pads the tokenization to the longest sequence in the batch, or PadFixed when it pads to a fixed length.

const (
	PadLongest PaddingStrategy = iota
	PadFixed
)

func (PaddingStrategy) String

func (i PaddingStrategy) String() string

type PretrainedConfig

type PretrainedConfig struct {
	// contains filtered or unexported fields
}

PretrainedConfig for how to download (or load from disk) a pretrained Tokenizer. It can be configured in different ways (see methods below), and when finished configuring, call Done to actually download (or load from disk) the pretrained tokenizer.

func FromPretrainedWith

func FromPretrainedWith(name string) *PretrainedConfig

FromPretrainedWith creates a new Tokenizer by downloading the pretrained tokenizer corresponding to the name.

There are several options that can be configured. After that one calls Done, and it will return the Tokenizer object (or an error).

If anything goes wrong, an error is returned instead.

func (*PretrainedConfig) AuthToken

func (pt *PretrainedConfig) AuthToken(token string) *PretrainedConfig

AuthToken sets the authentication token to use. The default is to use no token, which works for simply downloading most tokenizers. TODO: not implemented yet, it will lead to an error when calling Done.

func (*PretrainedConfig) CacheDir

func (pt *PretrainedConfig) CacheDir(cacheDir string) *PretrainedConfig

CacheDir configures cacheDir as directory to store a cache of the downloaded files. If the tokenizer has already been downloaded in the directory, it will be read from disk instead of the network.

The default value is `~/.cache/huggingface/hub/`, the same used by the original Transformers library. The cache home is overwritten by `$XDG_CACHE_HOME` if it is set.

func (*PretrainedConfig) Context

Context configures the given context to download content from the internet. The default is to use `context.Background()` with no timeout.

func (*PretrainedConfig) Done

func (pt *PretrainedConfig) Done() (*Tokenizer, error)

Done concludes the configuration of FromPretrainedWith and actually downloads (or loads from disk) the tokenizer.

func (*PretrainedConfig) ForceDownload

func (pt *PretrainedConfig) ForceDownload() *PretrainedConfig

ForceDownload will ignore previous files in cache and force (re-)download of contents.

func (*PretrainedConfig) ForceLocal

func (pt *PretrainedConfig) ForceLocal() *PretrainedConfig

ForceLocal won't use the internet, and will only read from the local disk. Notice this prevents even reaching out for the metadata.

func (*PretrainedConfig) HttpClient

func (pt *PretrainedConfig) HttpClient(client *http.Client) *PretrainedConfig

HttpClient configures an http.Client to use to connect to HuggingFace Hub. The default is `nil`, in which case one will be created for the requests.

func (*PretrainedConfig) NoCache

func (pt *PretrainedConfig) NoCache() *PretrainedConfig

NoCache to be used, no copy is kept of the downloaded tokenizer.

func (*PretrainedConfig) ProgressBar

func (pt *PretrainedConfig) ProgressBar() *PretrainedConfig

ProgressBar will display a progress bar when downloading files from the network. Only displayed if not reading from cache.

type ProgressFn

type ProgressFn func(progress, downloaded, total int, eof bool)

ProgressFn is a function called while downloading a file. It will be called with `progress=0` and `downloaded=0` at the first call, when download starts.

type Tokenizer

type Tokenizer struct {
	// contains filtered or unexported fields
}

Tokenizer represents an initialized Tokenizer, including various configurations for truncation, padding, and how to encode.

It can be used to encode (`Encode` and `EncodeBatch`) strings to token ids and other optional fields, and to decode (`Decode` and `DecodeBatch`) token ids back to strings.

To build a new Tokenizer from a JSon configuration, see `FromFile` or `FromBytes`. To automatically load the JSon configuration from HuggingFace, use `FromPretrained`.

func FromBytes

func FromBytes(data []byte) (*Tokenizer, error)

FromBytes is the same as FromFile, but instead takes the JSon `data` and returns a Tokenizer, or an error. It is the same format as [HuggingFace Tokenizers](https://github.com/huggingface/tokenizers).

func FromFile

func FromFile(filePath string) (*Tokenizer, error)

FromFile creates a Tokenizer from the tokenizer model stored as JSon in filePath. It is the same format as [HuggingFace Tokenizers](https://github.com/huggingface/tokenizers).

func (*Tokenizer) AddSpecialTokens

func (t *Tokenizer) AddSpecialTokens(value bool) *Tokenizer

AddSpecialTokens sets whether Encode (and EncodeBatch) should add the special tokens (start and end of sentence, etc.). Default is false.

It returns itself (the Tokenizer), to allow cascaded configuration calls.

func (*Tokenizer) Decode

func (t *Tokenizer) Decode(tokenIds []uint32, skipSpecialTokens bool) string

Decode is the reverse of encode, and converts the list of tokens back to a "sentence" (string).

func (*Tokenizer) Encode

func (t *Tokenizer) Encode(sentence string) (*Encoding, error)

Encode given sentence.

The returned Encoding object will have fields filled according to Tokenizer fields configured to be returned.

func (*Tokenizer) EncodeBatch

func (t *Tokenizer) EncodeBatch(sentences []string) ([]Encoding, error)

EncodeBatch list of strings.

The returned Encoding object will have fields filled according to Tokenizer fields configured to be returned.

func (*Tokenizer) Finalize

func (t *Tokenizer) Finalize()

Finalize is optional, and will release immediately the memory associated with the Tokenizer, not waiting for the garbage collection. After calling this function, the Tokenizer is no longer valid, and any calls to it will panic.

func (*Tokenizer) ReturnAttentionMask

func (t *Tokenizer) ReturnAttentionMask(value bool) *Tokenizer

ReturnAttentionMask sets whether Encode (and EncodeBatch) should also return an attention mask. The attention mask is a binary matrix indicating which tokens can attend to each other. It is used in transformer models to prevent the model from attending to padding tokens. Default is false.

It returns itself (the Tokenizer), to allow cascaded configuration calls.

func (*Tokenizer) ReturnOffsets

func (t *Tokenizer) ReturnOffsets(value bool) *Tokenizer

ReturnOffsets sets whether Encode (and EncodeBatch) should also return the byte offsets of the tokens in the original text. Default is false.

It returns itself (the Tokenizer), to allow cascaded configuration calls.

func (*Tokenizer) ReturnSpecialTokensMask

func (t *Tokenizer) ReturnSpecialTokensMask(value bool) *Tokenizer

ReturnSpecialTokensMask sets whether Encode (and EncodeBatch) should also return a special tokens mask. The special tokens mask is a binary vector indicating whether each token is a special token (e.g., padding, CLS, SEP). Default is false.

It returns itself (the Tokenizer), to allow cascaded configuration calls.

func (*Tokenizer) ReturnTokens

func (t *Tokenizer) ReturnTokens(value bool) *Tokenizer

ReturnTokens sets whether Encode (and EncodeBatch) should also return the textual tokens separated. Default is true.

It returns itself (the Tokenizer), to allow cascaded configuration calls.

func (*Tokenizer) ReturnTypeIds

func (t *Tokenizer) ReturnTypeIds(value bool) *Tokenizer

ReturnTypeIds sets whether Encode (and EncodeBatch) should also return the token IDs. Default is false.

It returns itself (the Tokenizer), to allow cascaded configuration calls.

func (*Tokenizer) String

func (t *Tokenizer) String() string

String implements fmt.Stringer.

func (*Tokenizer) VocabSize

func (t *Tokenizer) VocabSize() uint32

VocabSize returns the number of known tokens.

func (*Tokenizer) WithNoPadding

func (t *Tokenizer) WithNoPadding() *Tokenizer

WithNoPadding disables padding and resets all padding parameters.

It returns itself (the Tokenizer), to allow cascaded configuration calls.

It may panic is an invalid value is used (negative length, etc.).

func (*Tokenizer) WithNoTruncation

func (t *Tokenizer) WithNoTruncation() *Tokenizer

WithNoTruncation disables truncation and resets all truncation parameters.

It returns itself (the Tokenizer), to allow cascaded configuration calls.

func (*Tokenizer) WithOffsetsCharMode

func (t *Tokenizer) WithOffsetsCharMode(value OffsetsCharMode) *Tokenizer

WithOffsetsCharMode sets the character-level offset mode for the token offsets. The possible values are:

- `OffsetsCharModeByte`: Offsets are calculated on a byte basis. - `OffsetsCharModeUnicode` (default): Offsets are calculated on a Unicode code point basis.

Notice that to enable returning of the offsets you need to configure `t.ReturnOffsets(true)`.

It returns itself (the Tokenizer), to allow cascaded configuration calls.

func (*Tokenizer) WithPadId

func (t *Tokenizer) WithPadId(id uint32) *Tokenizer

WithPadId enables padding (if not already) and sets the id of the token to use for padding.

It returns itself (the Tokenizer), to allow cascaded configuration calls.

It may panic is an invalid value is used (e.g.: if padding length == 0).

func (*Tokenizer) WithPadToLength

func (t *Tokenizer) WithPadToLength(length uint32) *Tokenizer

WithPadToLength enables padding (if not already) and sets the padding to the fixed given length.

It returns itself (the Tokenizer), to allow cascaded configuration calls.

It may panic is an invalid value is used (e.g.: if padding length == 0).

func (*Tokenizer) WithPadToLongest

func (t *Tokenizer) WithPadToLongest() *Tokenizer

WithPadToLongest enables padding (if not already) and sets the padding to the longest sequence in the batch.

It returns itself (the Tokenizer), to allow cascaded configuration calls.

It may panic is an invalid value is used (e.g.: if padding length <= 0).

func (*Tokenizer) WithPadToken

func (t *Tokenizer) WithPadToken(token string) *Tokenizer

WithPadToken enables padding (if not already) and sets the token to use for padding.

It returns itself (the Tokenizer), to allow cascaded configuration calls.

It may panic is an invalid value is used (e.g.: if padding length == 0).

func (*Tokenizer) WithPadTypeId

func (t *Tokenizer) WithPadTypeId(typeId uint32) *Tokenizer

WithPadTypeId enables padding (if not already) and sets the type id of the token to use for padding.

It returns itself (the Tokenizer), to allow cascaded configuration calls.

It may panic is an invalid value is used (e.g.: if padding length == 0).

func (*Tokenizer) WithPaddingDirection

func (t *Tokenizer) WithPaddingDirection(direction Direction) *Tokenizer

WithPaddingDirection enables padding (if not already) and sets the padding to happen in the given direction.

It returns itself (the Tokenizer), to allow cascaded configuration calls.

It may panic is an invalid value is used (negative length, etc.).

func (*Tokenizer) WithPaddingToMultipleOf

func (t *Tokenizer) WithPaddingToMultipleOf(multiple uint32) *Tokenizer

WithPaddingToMultipleOf enables padding (if not already) and sets the multiple of value. If specified, the padding length should always snap to the next multiple of the given value. For example, if we were going to pad with a length of 250 but pad_to_multiple_of=8 then we will pad to 256.

It returns itself (the Tokenizer), to allow cascaded configuration calls.

It may panic is an invalid value is used (e.g.: if padding length == 0).

func (*Tokenizer) WithTruncation

func (t *Tokenizer) WithTruncation(length int) *Tokenizer

WithTruncation enables truncation and changes the truncation to the given length.

It returns itself (the Tokenizer), to allow cascaded configuration calls.

It may panic is an invalid value is used (negative length, etc.).

func (*Tokenizer) WithTruncationDirection

func (t *Tokenizer) WithTruncationDirection(direction Direction) *Tokenizer

WithTruncationDirection enables truncation (if not already) and sets the truncation to happen in the given direction.

It returns itself (the Tokenizer), to allow cascaded configuration calls.

It may panic is an invalid value is used (negative length, etc.).

func (*Tokenizer) WithTruncationStrategy

func (t *Tokenizer) WithTruncationStrategy(strategy TruncationStrategy) *Tokenizer

WithTruncationStrategy enables truncation (if not already) and sets the truncation strategy. This affects how truncation behaves when encoding sentence pairs, and is usually defined by the tokenization model that is loaded, and not directly by the user.

It returns itself (the Tokenizer), to allow cascaded configuration calls.

It may panic is an invalid value is used (negative length, etc.).

func (*Tokenizer) WithTruncationStride

func (t *Tokenizer) WithTruncationStride(stride int) *Tokenizer

WithTruncationStride enables truncation (if not already) and sets the truncation stride. From HuggingFace: "The length of the previous first sequence to be included in the overflowing sequence", but I'm not sure what they mean with that.

This is usually defined by the tokenization model that is loaded, and not directly by the user.

It returns itself (the Tokenizer), to allow cascaded configuration calls.

It may panic is an invalid value is used (negative length, etc.).

type TruncationStrategy

type TruncationStrategy uint8 // Values must match the underlying Rust library.

TruncationStrategy generally affects how truncation is applied and the inputs are pairs of sentences. It is very dependent on the truncation model used, and usually set by the preloaded tokenization model.

const (
	TruncateLongestFirst TruncationStrategy = iota
	TruncateOnlyFirst
	TruncateOnlySecond
)

func (TruncationStrategy) String

func (i TruncationStrategy) String() string

Directories

Path Synopsis
internal
rs
Package rs wraps the Rust tokenizer.
Package rs wraps the Rust tokenizer.
lib

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL