Documentation ¶
Overview ¶
Package tokenizers provides an implementation of today's most used tokenizers, with a focus on performance and versatility.
It is currently a wrapper around the Rust implementation in https://github.com/huggingface/tokenizers/tree/main/tokenizers.
For now, it only provides the encoding and decoding functionality -- not training a new tokenizers. It includes reading from [HuggingFace's pretrained tokenizers](https://huggingface.co/docs/tokenizers/index) using `FromPretrained`.
Index ¶
- Constants
- Variables
- func DefaultCacheDir() string
- func Download(ctx context.Context, client *http.Client, ...) (filePath, commitHash string, err error)
- func FileExists(path string) bool
- func GetHeaders(userAgent, token string) map[string]string
- func GetUrl(repoId, fileName, repoType, revision string) string
- func HttpUserAgent() string
- func RepoFolderName(repoId, repoType string) string
- type Direction
- type Encoding
- type HFFileMetadata
- type OffsetsCharMode
- type PaddingStrategy
- type PretrainedConfig
- func (pt *PretrainedConfig) AuthToken(token string) *PretrainedConfig
- func (pt *PretrainedConfig) CacheDir(cacheDir string) *PretrainedConfig
- func (pt *PretrainedConfig) Context(ctx context.Context) *PretrainedConfig
- func (pt *PretrainedConfig) Done() (*Tokenizer, error)
- func (pt *PretrainedConfig) ForceDownload() *PretrainedConfig
- func (pt *PretrainedConfig) ForceLocal() *PretrainedConfig
- func (pt *PretrainedConfig) HttpClient(client *http.Client) *PretrainedConfig
- func (pt *PretrainedConfig) NoCache() *PretrainedConfig
- func (pt *PretrainedConfig) ProgressBar() *PretrainedConfig
- type ProgressFn
- type Tokenizer
- func (t *Tokenizer) AddSpecialTokens(value bool) *Tokenizer
- func (t *Tokenizer) Decode(tokenIds []uint32, skipSpecialTokens bool) string
- func (t *Tokenizer) Encode(sentence string) (*Encoding, error)
- func (t *Tokenizer) EncodeBatch(sentences []string) ([]Encoding, error)
- func (t *Tokenizer) Finalize()
- func (t *Tokenizer) ReturnAttentionMask(value bool) *Tokenizer
- func (t *Tokenizer) ReturnOffsets(value bool) *Tokenizer
- func (t *Tokenizer) ReturnSpecialTokensMask(value bool) *Tokenizer
- func (t *Tokenizer) ReturnTokens(value bool) *Tokenizer
- func (t *Tokenizer) ReturnTypeIds(value bool) *Tokenizer
- func (t *Tokenizer) String() string
- func (t *Tokenizer) VocabSize() uint32
- func (t *Tokenizer) WithNoPadding() *Tokenizer
- func (t *Tokenizer) WithNoTruncation() *Tokenizer
- func (t *Tokenizer) WithOffsetsCharMode(value OffsetsCharMode) *Tokenizer
- func (t *Tokenizer) WithPadId(id uint32) *Tokenizer
- func (t *Tokenizer) WithPadToLength(length uint32) *Tokenizer
- func (t *Tokenizer) WithPadToLongest() *Tokenizer
- func (t *Tokenizer) WithPadToken(token string) *Tokenizer
- func (t *Tokenizer) WithPadTypeId(typeId uint32) *Tokenizer
- func (t *Tokenizer) WithPaddingDirection(direction Direction) *Tokenizer
- func (t *Tokenizer) WithPaddingToMultipleOf(multiple uint32) *Tokenizer
- func (t *Tokenizer) WithTruncation(length int) *Tokenizer
- func (t *Tokenizer) WithTruncationDirection(direction Direction) *Tokenizer
- func (t *Tokenizer) WithTruncationStrategy(strategy TruncationStrategy) *Tokenizer
- func (t *Tokenizer) WithTruncationStride(stride int) *Tokenizer
- type TruncationStrategy
Constants ¶
const ( HeaderXRepoCommit = "X-Repo-Commit" HeaderXLinkedETag = "X-Linked-Etag" HeaderXLinkedSize = "X-Linked-Size" )
const RepoIdSeparator = "--"
RepoIdSeparator is used to separate repository/model names parts when mapping to file names. Likely only for internal use.
Variables ¶
var ( // DefaultDirCreationPerm is used when creating new cache subdirectories. DefaultDirCreationPerm = os.FileMode(0755) // DefaultFileCreationPerm is used when creating files inside the cache subdirectories. DefaultFileCreationPerm = os.FileMode(0644) )
var ( RepoTypesUrlPrefixes = map[string]string{ "dataset": "datasets/", "space": "spaces/", } DefaultRevision = "main" HuggingFaceUrlTemplate = template.Must(template.New("hf_url").Parse( "https://huggingface.co/{{.RepoId}}/resolve/{{.Revision}}/{{.Filename}}")) )
var SessionId string
Functions ¶
func DefaultCacheDir ¶
func DefaultCacheDir() string
DefaultCacheDir for HuggingFace Hub, same used by the python library.
Its prefix is either `${XDG_CACHE_HOME}` if set, or `~/.cache` otherwise. Followed by `/huggingface/hub/`. So typically: `~/.cache/huggingface/hub/`.
func Download ¶
func Download(ctx context.Context, client *http.Client, repoId, repoType, revision, fileName, cacheDir, token string, forceDownload, forceLocal bool, progressFn ProgressFn) (filePath, commitHash string, err error)
Download returns file either from cache or by downloading from HuggingFace Hub.
Args:
- `ctx` for the requests. There may be more than one request, the first being an `HEAD` HTTP.
- `client` used to make HTTP requests. I can be created with `&httpClient{}`.
- `repoId` and `fileName`: define the file and repository (model) name to download.
- `repoType`: usually "model".
- `revision`: default is "main", but a commitHash can be given.
- `cacheDir`: directory where to store the downloaded files, or reuse if previously downloaded. Consider using the output from `DefaultCacheDir()` if in doubt.
- `token`: used for authentication. TODO: not implemented yet.
- `forceDownload`: if set to true, it will download the contents of the file even if there is a local copy.
- `localOnly`: does not use network, not even for reading the metadata.
- `progressFn`: is called during the download of a file. It is called synchronously and expected to be fast/ instantaneous. If the UI can be blocking, arrange it to be handled on a separate GoRoutine.
On success it returns the `filePath` to the downloaded file, and its `commitHash`. Otherwise it returns an error.
func FileExists ¶
FileExists returns true if file or directory exists.
func GetHeaders ¶
GetHeaders is based on the `build_hf_headers` function defined in the [huggingface_hub](https://github.com/huggingface/huggingface_hub) library. TODO: add support for authentication token.
func GetUrl ¶
GetUrl is based on the `hf_hub_url` function defined in the [huggingface_hub](https://github.com/huggingface/huggingface_hub) library.
func HttpUserAgent ¶
func HttpUserAgent() string
HttpUserAgent returns a user agent to use with HuggingFace Hub API. Loosely based on https://github.com/huggingface/transformers/blob/main/src/transformers/utils/hub.py#L198.
func RepoFolderName ¶
RepoFolderName returns a serialized version of a hf.co repo name and type, safe for disk storage as a single non-nested folder.
Based on github.com/huggingface/huggingface_hub repo_folder_name.
Types ¶
type Encoding ¶
Encoding is the result of a Tokenizer.Encode.
Only TokenIds is always present, all other fields are only set if configured in the Tokenizer.
The SpecialTokensMask indicates which tokens are special tokens (e.g., padding, CLS, SEP).
The AttentionMask indicates which tokens are padding and should be ignored.
type HFFileMetadata ¶
HFFileMetadata used by HuggingFace Hub.
type OffsetsCharMode ¶
type OffsetsCharMode uint8
OffsetsCharMode defines how to encode the offset positions when encoding. - `OffsetsCharModeByte`: Offsets are calculated on a byte basis. - `OffsetsCharModeUnicode` (default): Offsets are calculated on a Unicode code point basis.
const ( OffsetsCharModeByte OffsetsCharMode = 0 OffsetsCharModeUnicode OffsetsCharMode = 1 )
func (OffsetsCharMode) String ¶
func (i OffsetsCharMode) String() string
type PaddingStrategy ¶
type PaddingStrategy uint8 // Values must match the underlying Rust library.
PaddingStrategy usually is defined by the preloaded tokenization model (since it should match the LLM model). But it can be manipulated.
It can be set to PadLongest, which pads the tokenization to the longest sequence in the batch, or PadFixed when it pads to a fixed length.
const ( PadLongest PaddingStrategy = iota PadFixed )
func (PaddingStrategy) String ¶
func (i PaddingStrategy) String() string
type PretrainedConfig ¶
type PretrainedConfig struct {
// contains filtered or unexported fields
}
PretrainedConfig for how to download (or load from disk) a pretrained Tokenizer. It can be configured in different ways (see methods below), and when finished configuring, call Done to actually download (or load from disk) the pretrained tokenizer.
func FromPretrainedWith ¶
func FromPretrainedWith(name string) *PretrainedConfig
FromPretrainedWith creates a new Tokenizer by downloading the pretrained tokenizer corresponding to the name.
There are several options that can be configured. After that one calls Done, and it will return the Tokenizer object (or an error).
If anything goes wrong, an error is returned instead.
func (*PretrainedConfig) AuthToken ¶
func (pt *PretrainedConfig) AuthToken(token string) *PretrainedConfig
AuthToken sets the authentication token to use. The default is to use no token, which works for simply downloading most tokenizers. TODO: not implemented yet, it will lead to an error when calling Done.
func (*PretrainedConfig) CacheDir ¶
func (pt *PretrainedConfig) CacheDir(cacheDir string) *PretrainedConfig
CacheDir configures cacheDir as directory to store a cache of the downloaded files. If the tokenizer has already been downloaded in the directory, it will be read from disk instead of the network.
The default value is `~/.cache/huggingface/hub/`, the same used by the original Transformers library. The cache home is overwritten by `$XDG_CACHE_HOME` if it is set.
func (*PretrainedConfig) Context ¶
func (pt *PretrainedConfig) Context(ctx context.Context) *PretrainedConfig
Context configures the given context to download content from the internet. The default is to use `context.Background()` with no timeout.
func (*PretrainedConfig) Done ¶
func (pt *PretrainedConfig) Done() (*Tokenizer, error)
Done concludes the configuration of FromPretrainedWith and actually downloads (or loads from disk) the tokenizer.
func (*PretrainedConfig) ForceDownload ¶
func (pt *PretrainedConfig) ForceDownload() *PretrainedConfig
ForceDownload will ignore previous files in cache and force (re-)download of contents.
func (*PretrainedConfig) ForceLocal ¶
func (pt *PretrainedConfig) ForceLocal() *PretrainedConfig
ForceLocal won't use the internet, and will only read from the local disk. Notice this prevents even reaching out for the metadata.
func (*PretrainedConfig) HttpClient ¶
func (pt *PretrainedConfig) HttpClient(client *http.Client) *PretrainedConfig
HttpClient configures an http.Client to use to connect to HuggingFace Hub. The default is `nil`, in which case one will be created for the requests.
func (*PretrainedConfig) NoCache ¶
func (pt *PretrainedConfig) NoCache() *PretrainedConfig
NoCache to be used, no copy is kept of the downloaded tokenizer.
func (*PretrainedConfig) ProgressBar ¶
func (pt *PretrainedConfig) ProgressBar() *PretrainedConfig
ProgressBar will display a progress bar when downloading files from the network. Only displayed if not reading from cache.
type ProgressFn ¶
ProgressFn is a function called while downloading a file. It will be called with `progress=0` and `downloaded=0` at the first call, when download starts.
type Tokenizer ¶
type Tokenizer struct {
// contains filtered or unexported fields
}
Tokenizer represents an initialized Tokenizer, including various configurations for truncation, padding, and how to encode.
It can be used to encode (`Encode` and `EncodeBatch`) strings to token ids and other optional fields, and to decode (`Decode` and `DecodeBatch`) token ids back to strings.
To build a new Tokenizer from a JSon configuration, see `FromFile` or `FromBytes`. To automatically load the JSon configuration from HuggingFace, use `FromPretrained`.
func FromBytes ¶
FromBytes is the same as FromFile, but instead takes the JSon `data` and returns a Tokenizer, or an error. It is the same format as [HuggingFace Tokenizers](https://github.com/huggingface/tokenizers).
func FromFile ¶
FromFile creates a Tokenizer from the tokenizer model stored as JSon in filePath. It is the same format as [HuggingFace Tokenizers](https://github.com/huggingface/tokenizers).
func (*Tokenizer) AddSpecialTokens ¶
AddSpecialTokens sets whether Encode (and EncodeBatch) should add the special tokens (start and end of sentence, etc.). Default is false.
It returns itself (the Tokenizer), to allow cascaded configuration calls.
func (*Tokenizer) Decode ¶
Decode is the reverse of encode, and converts the list of tokens back to a "sentence" (string).
func (*Tokenizer) Encode ¶
Encode given sentence.
The returned Encoding object will have fields filled according to Tokenizer fields configured to be returned.
func (*Tokenizer) EncodeBatch ¶
EncodeBatch list of strings.
The returned Encoding object will have fields filled according to Tokenizer fields configured to be returned.
func (*Tokenizer) Finalize ¶
func (t *Tokenizer) Finalize()
Finalize is optional, and will release immediately the memory associated with the Tokenizer, not waiting for the garbage collection. After calling this function, the Tokenizer is no longer valid, and any calls to it will panic.
func (*Tokenizer) ReturnAttentionMask ¶
ReturnAttentionMask sets whether Encode (and EncodeBatch) should also return an attention mask. The attention mask is a binary matrix indicating which tokens can attend to each other. It is used in transformer models to prevent the model from attending to padding tokens. Default is false.
It returns itself (the Tokenizer), to allow cascaded configuration calls.
func (*Tokenizer) ReturnOffsets ¶
ReturnOffsets sets whether Encode (and EncodeBatch) should also return the byte offsets of the tokens in the original text. Default is false.
It returns itself (the Tokenizer), to allow cascaded configuration calls.
func (*Tokenizer) ReturnSpecialTokensMask ¶
ReturnSpecialTokensMask sets whether Encode (and EncodeBatch) should also return a special tokens mask. The special tokens mask is a binary vector indicating whether each token is a special token (e.g., padding, CLS, SEP). Default is false.
It returns itself (the Tokenizer), to allow cascaded configuration calls.
func (*Tokenizer) ReturnTokens ¶
ReturnTokens sets whether Encode (and EncodeBatch) should also return the textual tokens separated. Default is true.
It returns itself (the Tokenizer), to allow cascaded configuration calls.
func (*Tokenizer) ReturnTypeIds ¶
ReturnTypeIds sets whether Encode (and EncodeBatch) should also return the token IDs. Default is false.
It returns itself (the Tokenizer), to allow cascaded configuration calls.
func (*Tokenizer) WithNoPadding ¶
WithNoPadding disables padding and resets all padding parameters.
It returns itself (the Tokenizer), to allow cascaded configuration calls.
It may panic is an invalid value is used (negative length, etc.).
func (*Tokenizer) WithNoTruncation ¶
WithNoTruncation disables truncation and resets all truncation parameters.
It returns itself (the Tokenizer), to allow cascaded configuration calls.
func (*Tokenizer) WithOffsetsCharMode ¶
func (t *Tokenizer) WithOffsetsCharMode(value OffsetsCharMode) *Tokenizer
WithOffsetsCharMode sets the character-level offset mode for the token offsets. The possible values are:
- `OffsetsCharModeByte`: Offsets are calculated on a byte basis. - `OffsetsCharModeUnicode` (default): Offsets are calculated on a Unicode code point basis.
Notice that to enable returning of the offsets you need to configure `t.ReturnOffsets(true)`.
It returns itself (the Tokenizer), to allow cascaded configuration calls.
func (*Tokenizer) WithPadId ¶
WithPadId enables padding (if not already) and sets the id of the token to use for padding.
It returns itself (the Tokenizer), to allow cascaded configuration calls.
It may panic is an invalid value is used (e.g.: if padding length == 0).
func (*Tokenizer) WithPadToLength ¶
WithPadToLength enables padding (if not already) and sets the padding to the fixed given length.
It returns itself (the Tokenizer), to allow cascaded configuration calls.
It may panic is an invalid value is used (e.g.: if padding length == 0).
func (*Tokenizer) WithPadToLongest ¶
WithPadToLongest enables padding (if not already) and sets the padding to the longest sequence in the batch.
It returns itself (the Tokenizer), to allow cascaded configuration calls.
It may panic is an invalid value is used (e.g.: if padding length <= 0).
func (*Tokenizer) WithPadToken ¶
WithPadToken enables padding (if not already) and sets the token to use for padding.
It returns itself (the Tokenizer), to allow cascaded configuration calls.
It may panic is an invalid value is used (e.g.: if padding length == 0).
func (*Tokenizer) WithPadTypeId ¶
WithPadTypeId enables padding (if not already) and sets the type id of the token to use for padding.
It returns itself (the Tokenizer), to allow cascaded configuration calls.
It may panic is an invalid value is used (e.g.: if padding length == 0).
func (*Tokenizer) WithPaddingDirection ¶
WithPaddingDirection enables padding (if not already) and sets the padding to happen in the given direction.
It returns itself (the Tokenizer), to allow cascaded configuration calls.
It may panic is an invalid value is used (negative length, etc.).
func (*Tokenizer) WithPaddingToMultipleOf ¶
WithPaddingToMultipleOf enables padding (if not already) and sets the multiple of value. If specified, the padding length should always snap to the next multiple of the given value. For example, if we were going to pad with a length of 250 but pad_to_multiple_of=8 then we will pad to 256.
It returns itself (the Tokenizer), to allow cascaded configuration calls.
It may panic is an invalid value is used (e.g.: if padding length == 0).
func (*Tokenizer) WithTruncation ¶
WithTruncation enables truncation and changes the truncation to the given length.
It returns itself (the Tokenizer), to allow cascaded configuration calls.
It may panic is an invalid value is used (negative length, etc.).
func (*Tokenizer) WithTruncationDirection ¶
WithTruncationDirection enables truncation (if not already) and sets the truncation to happen in the given direction.
It returns itself (the Tokenizer), to allow cascaded configuration calls.
It may panic is an invalid value is used (negative length, etc.).
func (*Tokenizer) WithTruncationStrategy ¶
func (t *Tokenizer) WithTruncationStrategy(strategy TruncationStrategy) *Tokenizer
WithTruncationStrategy enables truncation (if not already) and sets the truncation strategy. This affects how truncation behaves when encoding sentence pairs, and is usually defined by the tokenization model that is loaded, and not directly by the user.
It returns itself (the Tokenizer), to allow cascaded configuration calls.
It may panic is an invalid value is used (negative length, etc.).
func (*Tokenizer) WithTruncationStride ¶
WithTruncationStride enables truncation (if not already) and sets the truncation stride. From HuggingFace: "The length of the previous first sequence to be included in the overflowing sequence", but I'm not sure what they mean with that.
This is usually defined by the tokenization model that is loaded, and not directly by the user.
It returns itself (the Tokenizer), to allow cascaded configuration calls.
It may panic is an invalid value is used (negative length, etc.).
type TruncationStrategy ¶
type TruncationStrategy uint8 // Values must match the underlying Rust library.
TruncationStrategy generally affects how truncation is applied and the inputs are pairs of sentences. It is very dependent on the truncation model used, and usually set by the preloaded tokenization model.
const ( TruncateLongestFirst TruncationStrategy = iota TruncateOnlyFirst TruncateOnlySecond )
func (TruncationStrategy) String ¶
func (i TruncationStrategy) String() string