tiktoken

package module

v0.0.9 Latest Latest Go to latest Published: May 20, 2024 License: MIT Imports: 13 Imported by: 3

Details

Valid go.mod file
Redistributable license
Tagged version
Stable version
Learn more about best practices

Repository

github.com/hupe1980/go-tiktoken

Links

Open Source Insights

README ¶

✂️ go-tiktoken

OpenAI's tiktoken tokenizer written in Go. The vocabularies are embedded and do not need to be downloaded at runtime.

Installation

go get github.com/hupe1980/go-tiktoken

How to use

package main

import (
	"fmt"
	"log"

	"github.com/hupe1980/go-tiktoken"
)

func main() {
	encoding, err := tiktoken.NewEncodingForModel("gpt-3.5-turbo")
	if err != nil {
		log.Fatal(err)
	}

	ids, tokens, err := encoding.Encode("Hello World", nil, nil)
	if err != nil {
		log.Fatal(err)
	}

	fmt.Println("IDs:", ids)
	fmt.Println("Tokens:", tokens)
}

Output:

IDs: [9906 4435]
Tokens: [Hello  World]

For more example usage, see _examples.

Supported Encodings

✅ o200k_base
✅ cl100k_base
✅ p50k_base
✅ p50k_edit
✅ r50k_base
✅ gpt2
✅ claude

License

MIT

Documentation ¶

Overview ¶

Package tiktoken provides functionality for tokenizing and encoding text using the tiktoken algorithm. The package includes various functions for text processing and encoding using the tiktoken algorithm.

Index ¶

Constants
Variables
func ConvertToMergeableBPERanks(bpe io.Reader) (map[string]uint, error)
func CovertVocabBPEAndEncoderJSONToMergeableBPERanks(vocabBPE io.Reader, encoderJSON io.Reader) (map[string]uint, error)
type Codec
type Encoding

Constants ¶

View Source

const (
	StartOfText string = "<|startoftext|>"
	EndOfText   string = "<|endoftext|>"
	FimPrefix   string = "<|fim_prefix|>"
	FimMiddle   string = "<|fim_middle|>"
	FimSuffix   string = "<|fim_suffix|>"
	EndOfPrompt string = "<|endofprompt|>"
)

Constants for special tokens.

View Source

const (
	O200kBase  string = "o200k_base"
	CL100kBase string = "cl100k_base"
	P50kBase   string = "p50k_base"
	P50kEdit   string = "p50k_edit"
	R50kBase   string = "r50k_base"
	GPT2       string = "gpt2"
)

Constants for different encodings.

Variables ¶

View Source

var AllSpecial = []string{"all"}

View Source

var ModelPrefixToEncoding = map[string]string{

	"gpt-4o-":        O200kBase,
	"gpt-4-":         CL100kBase,
	"gpt-3.5-turbo-": CL100kBase,
	"gpt-3.5":        CL100kBase,
	"gpt-35-turbo":   CL100kBase,

	"ft:gpt-4":         CL100kBase,
	"ft:gpt-3.5-turbo": CL100kBase,
	"ft:davinci-002":   CL100kBase,
	"ft:babbage-002":   CL100kBase,
}

ModelPrefixToEncoding maps model prefixes to encodings.

View Source

var ModelToEncoding = map[string]string{

	"gpt-4o":        O200kBase,
	"gpt-4":         CL100kBase,
	"gpt-3.5-turbo": CL100kBase,
	"gpt-35-turbo":  CL100kBase,

	"text-davinci-003": P50kBase,
	"text-davinci-002": P50kBase,
	"text-davinci-001": R50kBase,
	"text-curie-001":   R50kBase,
	"text-babbage-001": R50kBase,
	"text-ada-001":     R50kBase,
	"davinci":          R50kBase,
	"curie":            R50kBase,
	"babbage":          R50kBase,
	"ada":              R50kBase,

	"code-davinci-002": P50kBase,
	"code-davinci-001": P50kBase,
	"code-cushman-002": P50kBase,
	"code-cushman-001": P50kBase,
	"davinci-codex":    P50kBase,
	"cushman-codex":    P50kBase,

	"text-davinci-edit-001": P50kEdit,
	"code-davinci-edit-001": P50kEdit,

	"text-embedding-ada-002": CL100kBase,
	"text-embedding-3-small": CL100kBase,
	"text-embedding-3-large": CL100kBase,

	"text-similarity-davinci-001":  R50kBase,
	"text-similarity-curie-001":    R50kBase,
	"text-similarity-babbage-001":  R50kBase,
	"text-similarity-ada-001":      R50kBase,
	"text-search-davinci-doc-001":  R50kBase,
	"text-search-curie-doc-001":    R50kBase,
	"text-search-babbage-doc-001":  R50kBase,
	"text-search-ada-doc-001":      R50kBase,
	"code-search-babbage-code-001": R50kBase,
	"code-search-ada-code-001":     R50kBase,

	"gpt2":  GPT2,
	"gpt-2": GPT2,
}

ModelToEncoding maps models to encodings.

Functions ¶

func ConvertToMergeableBPERanks ¶

func ConvertToMergeableBPERanks(bpe io.Reader) (map[string]uint, error)

ConvertToMergeableBPERanks converts the BPE file to mergeable BPE ranks.

func CovertVocabBPEAndEncoderJSONToMergeableBPERanks ¶

func CovertVocabBPEAndEncoderJSONToMergeableBPERanks(vocabBPE io.Reader, encoderJSON io.Reader) (map[string]uint, error)

CovertVocabBPEAndEncoderJSONToMergeableBPERanks converts the vocabulary BPE and encoder JSON to mergeable BPE ranks.

Types ¶

type Codec ¶

type Codec struct {
	Name           string          `json:"name"`
	ExplicitNVocab int             `json:"explicit_n_vocab"`
	PatStr         string          `json:"pat_str"`
	MergeableRanks map[string]uint `json:"mergeable_ranks"`
	SpecialTokens  map[string]uint `json:"special_tokens"`
}

Codec represents a token encoding codec.

func NewCL100kBase ¶

func NewCL100kBase() (*Codec, error)

NewCL100kBase creates a new Codec instance for the cl100k_base tokenization scheme. It loads the mergeable ranks from the embedded cl100kBase resource. The function returns a pointer to the Codec or an error if any.

func NewClaude ¶ added in v0.0.5

func NewClaude() (*Codec, error)

NewClaude creates a new Codec instance for the claude tokenization scheme. It loads the mergeable ranks from the embedded claude resource. The function returns a pointer to the Codec or an error if any.

func NewGPT2 ¶

func NewGPT2() (*Codec, error)

NewGPT2 creates a new Codec instance for the GPT-2 tokenization scheme. It loads the mergeable ranks from the embedded gpt2Vocab and gpt2Encode resources. The function returns a pointer to the Codec or an error if any.

func NewO200KBase ¶ added in v0.0.7

func NewO200KBase() (*Codec, error)

NewO200KBase creates a new Codec instance for the o200k_base tokenization scheme. It loads the mergeable ranks from the embedded o200kBase resource. The function returns a pointer to the Codec or an error if any.

func NewP50kBase ¶

func NewP50kBase() (*Codec, error)

NewP50kBase creates a new Codec instance for the P50k_base tokenization scheme. It loads the mergeable ranks from the embedded p50kBase resource. The function returns a pointer to the Codec or an error if any.

func NewP50kEdit ¶

func NewP50kEdit() (*Codec, error)

NewP50kEdit creates a new Codec instance for the P50k_edit tokenization scheme. It loads the mergeable ranks from the embedded p50kBase resource. The function returns a pointer to the Codec or an error if any.

func NewR50kBase ¶

func NewR50kBase() (*Codec, error)

NewR50kBase creates a new Codec instance for the R50k_base tokenization scheme. It loads the mergeable ranks from the embedded r50kBase resource. The function returns a pointer to the Codec or an error if any.

type Encoding ¶

type Encoding struct {
	// contains filtered or unexported fields
}

Encoding represents a text encoding scheme.

func NewEncoding ¶

func NewEncoding(codec *Codec) (*Encoding, error)

NewEncoding creates a new Encoding instance based on the provided Codec.

func NewEncodingByName ¶

func NewEncodingByName(encoding string) (*Encoding, error)

NewEncodingByName creates a new Encoding instance based on the given encoding name.

func NewEncodingForModel ¶

func NewEncodingForModel(model string) (*Encoding, error)

NewEncodingForModel returns a new Encoding based on the given model. It checks the ModelToEncoding map and ModelPrefixToEncoding map to find a matching encoding.

func (*Encoding) Decode ¶

func (enc *Encoding) Decode(tokens []uint) []byte

Decode decodes the given tokens using the Encoding's core BPE.

func (*Encoding) Encode ¶

func (enc *Encoding) Encode(text string, allowedSpecial, disallowedSpecial []string) ([]uint, []string, error)

Encode encodes the given text with the specified allowed and disallowed special tokens.

func (*Encoding) EncodeOrdinary ¶

func (enc *Encoding) EncodeOrdinary(text string) ([]uint, []string)

EncodeOrdinary encodes the given text using the Encoding's core BPE.

func (*Encoding) Name ¶

func (enc *Encoding) Name() string

Name returns the name of the Encoding.

Source Files ¶

View all Source files

Directories ¶

Path	Synopsis
_examples
decode
encode

?	: This menu
/	: Search site
f or F	: Jump to
y or Y	: Canonical URL