Documentation ¶
Overview ¶
Package gotoken provides an OpenAI-compatible tokenization library similar to tiktoken. Its primary export is the Tokenizer interface, featuring Encode and Decode methods for converting strings to/from []int.
Tokenizer encodings, such as r50kbase or cl100kbase, are available in separate packages that implement the Tokenizer interface. This design mirrors the image/png and image/jpeg packages' integration with the standard image library. Encoding packages self-register with gotoken.
Encoding packages include built-in token dictionaries, which removes the need for external downloads or local file caches. However, these packages are relatively large (a few MB) and should only be imported when needed. At least one encoding package must be imported for gotoken to be able to tokenize text.
Example of importing gotoken and a tokenizer encoding:
import ( "github.com/peterheb/gotoken" _ "github.com/peterheb/gotoken/cl100kbase" )
The _ indicates that cl100kbase should be imported even without a direct reference in your code. Encoding packages have no public functions or types, but they do contain public constants defining special tokens.
Example ¶
// This example demonstrates encoding and decoding a sample string using the // cl100k_base tokenizer. package main import ( "fmt" "log" "github.com/peterheb/gotoken" _ "github.com/peterheb/gotoken/cl100kbase" ) var tok gotoken.Tokenizer func main() { // Instantiate the tokenizer by name. The _ import above registers the // tokenizer with the encoding "cl100k_base". Consult your model's // documentation for information on which tokenizer to use with which model. tok, err := gotoken.GetTokenizer("cl100k_base") if err != nil { log.Fatal(err) } // Encode some text input := "Salutations, world! 😄" encoded, err := tok.Encode(input) if err != nil { log.Fatal(err) } fmt.Printf("input: %#v\n", input) fmt.Printf("encoded: %#v\n", encoded) // Decode the encoded text decoded, err := tok.Decode(encoded) if err != nil { log.Fatal(err) } fmt.Printf("decoded: %#v\n", decoded) }
Output:
Index ¶
Examples ¶
Constants ¶
This section is empty.
Variables ¶
var ( ErrUnknownEncoding = errors.New("unknown tokenizer encoding") ErrInvalidToken = errors.New("invalid token") ErrSpecialToken = errors.New("unexpected special token found") )
These errors can be returned by functions in this library. Errors will be wrapped with fmt.Errorf; use errors.Is or errors.As to check for the underlying error type.
Functions ¶
func ListTokenizers ¶
func ListTokenizers() []string
ListTokenizers returns a list of all registered tokenizer encodings. These are the valid inputs to GetTokenizer.
func RegisterTokenizer ¶
RegisterTokenizer registers a tokenizer with the given name. This is typically called by the init function of a specific tokenizer's package.
func WithSpecialTokens ¶
func WithSpecialTokens(tokens ...string) func(*tokenizerOptions)
WithSpecialTokens is a functional option for GetTokenizer that configures the tokenizer to encode special tokens to their special token values. This should only be used when a Tokenizer is encoding trusted input.
func WithSpecialTokensAsText ¶
func WithSpecialTokensAsText() func(*tokenizerOptions)
WithSpecialTokensAsText is a functional option for GetTokenizer that configures the tokenizer to treat special tokens as text. This allows strings like "<|endoftext|>" to be encoded as text tokens, rather than causing an encoding error (which is the default behavior).
Types ¶
type Option ¶
type Option func(*tokenizerOptions)
Option is a functional option for a tokenizer, such as WithSpecialTokens or WithSpecialTokensAsText.
type Tokenizer ¶
type Tokenizer interface { Count(input string) int Encode(input string) ([]int, error) Decode(input []int) (string, error) Allowed(input string) error }
Tokenizer is the primary public interface provided by gotoken. It is implemented by encoding packages, like github.com/peterheb/gotoken/r50kbase. A Tokenizer is created using GetTokenizer.
Tokenizer supports four methods:
- Count returns the number of tokens in an input string, or 0 on error.
- Encode tokenizes an input string to an []int.
- Decode un-tokenizes an []int back to its string representation.
- Allowed returns an error if the input string contains any sequences corresponding to special tokens that are not allowed by this tokenizer.
func GetTokenizer ¶
GetTokenizer returns a tokenizer by its encoding name. If no matching registered encoding is found, an error is returned that wraps ErrUnknownEncoding.
GetTokenizer supports functional options to configure the returned Tokenizer. The default configuration, if no options are specified, disallows special tokens in the input.
If special tokens are not applicable, using WithSpecialTokensAsText will allow the tokenizer to process any input string without raising an error. If special tokens should be supported by the Tokenizer, list the specific ones to allow using the option WithSpecialTokens.
The following encoding names are supported:
- "cl100k_base" in github.com/peterheb/gotoken/cl100kbase
- "p50k_base" and "p50k_edit" in github.com/peterheb/gotoken/p50kbase
- "r50k_base" in github.com/peterheb/gotoken/r50kbase
Directories ¶
Path | Synopsis |
---|---|
Package cl100kbase registers the "cl100k_base" tokenizer with gotoken.
|
Package cl100kbase registers the "cl100k_base" tokenizer with gotoken. |
examples
|
|
basic
The basic example is an introduction to using gotoken.
|
The basic example is an introduction to using gotoken. |
bench
The bench example is a synthetic benchmark that tokenizes every line in a test file.
|
The bench example is a synthetic benchmark that tokenizes every line in a test file. |
Package gen generates the data.go files for the gotoken's encoding sub-packages.
|
Package gen generates the data.go files for the gotoken's encoding sub-packages. |
Package internal contains non-exported implementation details of gotoken.
|
Package internal contains non-exported implementation details of gotoken. |
Package p50kbase registers the "p50k_base" tokenizer with gotoken.
|
Package p50kbase registers the "p50k_base" tokenizer with gotoken. |
Package r50kbase registers the "r50k_base" tokenizer with gotoken.
|
Package r50kbase registers the "r50k_base" tokenizer with gotoken. |