Documentation ¶
Overview ¶
Package tiktoken provides functionality for tokenizing and encoding text using the tiktoken algorithm. The package includes various functions for text processing and encoding using the tiktoken algorithm.
Index ¶
Constants ¶
const ( StartOfText string = "<|startoftext|>" EndOfText string = "<|endoftext|>" FimPrefix string = "<|fim_prefix|>" FimMiddle string = "<|fim_middle|>" FimSuffix string = "<|fim_suffix|>" EndOfPrompt string = "<|endofprompt|>" )
Constants for special tokens.
const ( O200kBase string = "o200k_base" CL100kBase string = "cl100k_base" P50kBase string = "p50k_base" P50kEdit string = "p50k_edit" R50kBase string = "r50k_base" GPT2 string = "gpt2" )
Constants for different encodings.
Variables ¶
var AllSpecial = []string{"all"}
var ModelPrefixToEncoding = map[string]string{ "gpt-4o-": O200kBase, "gpt-4-": CL100kBase, "gpt-3.5-turbo-": CL100kBase, "gpt-3.5": CL100kBase, "gpt-35-turbo": CL100kBase, "ft:gpt-4": CL100kBase, "ft:gpt-3.5-turbo": CL100kBase, "ft:davinci-002": CL100kBase, "ft:babbage-002": CL100kBase, }
ModelPrefixToEncoding maps model prefixes to encodings.
var ModelToEncoding = map[string]string{ "gpt-4o": O200kBase, "gpt-4": CL100kBase, "gpt-3.5-turbo": CL100kBase, "gpt-35-turbo": CL100kBase, "text-davinci-003": P50kBase, "text-davinci-002": P50kBase, "text-davinci-001": R50kBase, "text-curie-001": R50kBase, "text-babbage-001": R50kBase, "text-ada-001": R50kBase, "davinci": R50kBase, "curie": R50kBase, "babbage": R50kBase, "ada": R50kBase, "code-davinci-002": P50kBase, "code-davinci-001": P50kBase, "code-cushman-002": P50kBase, "code-cushman-001": P50kBase, "davinci-codex": P50kBase, "cushman-codex": P50kBase, "text-davinci-edit-001": P50kEdit, "code-davinci-edit-001": P50kEdit, "text-embedding-ada-002": CL100kBase, "text-embedding-3-small": CL100kBase, "text-embedding-3-large": CL100kBase, "text-similarity-davinci-001": R50kBase, "text-similarity-curie-001": R50kBase, "text-similarity-babbage-001": R50kBase, "text-similarity-ada-001": R50kBase, "text-search-davinci-doc-001": R50kBase, "text-search-curie-doc-001": R50kBase, "text-search-babbage-doc-001": R50kBase, "text-search-ada-doc-001": R50kBase, "code-search-babbage-code-001": R50kBase, "code-search-ada-code-001": R50kBase, "gpt2": GPT2, "gpt-2": GPT2, }
ModelToEncoding maps models to encodings.
Functions ¶
func ConvertToMergeableBPERanks ¶
ConvertToMergeableBPERanks converts the BPE file to mergeable BPE ranks.
Types ¶
type Codec ¶
type Codec struct { Name string `json:"name"` ExplicitNVocab int `json:"explicit_n_vocab"` PatStr string `json:"pat_str"` MergeableRanks map[string]uint `json:"mergeable_ranks"` SpecialTokens map[string]uint `json:"special_tokens"` }
Codec represents a token encoding codec.
func NewCL100kBase ¶
NewCL100kBase creates a new Codec instance for the cl100k_base tokenization scheme. It loads the mergeable ranks from the embedded cl100kBase resource. The function returns a pointer to the Codec or an error if any.
func NewClaude ¶ added in v0.0.5
NewClaude creates a new Codec instance for the claude tokenization scheme. It loads the mergeable ranks from the embedded claude resource. The function returns a pointer to the Codec or an error if any.
func NewGPT2 ¶
NewGPT2 creates a new Codec instance for the GPT-2 tokenization scheme. It loads the mergeable ranks from the embedded gpt2Vocab and gpt2Encode resources. The function returns a pointer to the Codec or an error if any.
func NewO200KBase ¶ added in v0.0.7
NewO200KBase creates a new Codec instance for the o200k_base tokenization scheme. It loads the mergeable ranks from the embedded o200kBase resource. The function returns a pointer to the Codec or an error if any.
func NewP50kBase ¶
NewP50kBase creates a new Codec instance for the P50k_base tokenization scheme. It loads the mergeable ranks from the embedded p50kBase resource. The function returns a pointer to the Codec or an error if any.
func NewP50kEdit ¶
NewP50kEdit creates a new Codec instance for the P50k_edit tokenization scheme. It loads the mergeable ranks from the embedded p50kBase resource. The function returns a pointer to the Codec or an error if any.
func NewR50kBase ¶
NewR50kBase creates a new Codec instance for the R50k_base tokenization scheme. It loads the mergeable ranks from the embedded r50kBase resource. The function returns a pointer to the Codec or an error if any.
type Encoding ¶
type Encoding struct {
// contains filtered or unexported fields
}
Encoding represents a text encoding scheme.
func NewEncoding ¶
NewEncoding creates a new Encoding instance based on the provided Codec.
func NewEncodingByName ¶
NewEncodingByName creates a new Encoding instance based on the given encoding name.
func NewEncodingForModel ¶
NewEncodingForModel returns a new Encoding based on the given model. It checks the ModelToEncoding map and ModelPrefixToEncoding map to find a matching encoding.
func (*Encoding) Encode ¶
func (enc *Encoding) Encode(text string, allowedSpecial, disallowedSpecial []string) ([]uint, []string, error)
Encode encodes the given text with the specified allowed and disallowed special tokens.
func (*Encoding) EncodeOrdinary ¶
EncodeOrdinary encodes the given text using the Encoding's core BPE.