tiktoken

package module

v0.0.0-...-3c69eb2 Latest Latest Go to latest Published: May 5, 2023 License: MIT Imports: 16 Imported by: 0

Details

Valid go.mod file

The Go module system was introduced in Go 1.11 and is the official dependency management solution for Go.
Redistributable license

Redistributable licenses place minimal restrictions on how software can be used, modified, and redistributed.
Tagged version

Modules with tagged versions give importers more predictable builds.
Stable version

When a project reaches major version v1 it is considered stable.
Learn more about best practices

Repository

github.com/aavaz-ai/tiktoken-go

Links

Open Source Insights

README ¶

tiktoken-go

OpenAI's tiktoken in Go, ported for use at Enterpret.

Tiktoken is a fast BPE tokeniser for use with OpenAI's models.

This is a port of the original tiktoken.

Usage

Install

go get github.com/aavaz-ai/tiktoken-go

Cache

Tiktoken-go has the same cache mechanism as the original Tiktoken library.

You can set the cache directory by using the environment variable TIKTOKEN_CACHE_DIR.

Once this variable is set, tiktoken-go will use this directory to cache the token dictionary.

If you don't set this environment variable, tiktoken-go will download the dictionary each time you initialize an encoding for the first time.

Example

get token by encoding

package main

import (
    "fmt"
    "github.com/aavaz-ai/tiktoken-go"
)

func main() (num_tokens int) {
    text = "Hello, world!"
    encoding = "r50k_base"

	tke, err := tiktoken.GetEncoding(encoding)
	if err != nil {
		err = fmt.Errorf("getEncoding: %v", err)
		return
	}

    // encode
	token := tke.Encode(text, nil, nil)

    // num_tokens
    num_tokens = len(token)
}

get token by Model

package main

import (
    "fmt"
    "github.com/aavaz-ai/tiktoken-go"
)

func main() (num_tokens int) {
    text = "Hello, world!"
    encoding = "davinci"

   tkm, err := tiktoken.EncodingForModel(model)
	if err != nil {
		err = fmt.Errorf("getEncoding: %v", err)
		return
	}

	 // encode
	token := tke.Encode(text, nil, nil)

    // num_tokens
    num_tokens = len(token)
}

counting tokens for chat API calls

Below is an example function for counting tokens for messages passed to gpt-3.5-turbo-0301 or gpt-4-0314.

The following code was written by @nasa1024 based on openai-cookbook examples.

Please note that the token calculation method for the message may change at any time, so this code may not necessarily be applicable in the future.

If you need accurate calculation, please refer to the official documentation.

If you find that this code is no longer applicable, please feel free to submit a PR or Issue.

package main

import (
	"fmt"

	"github.com/aavaz-ai/tiktoken-go"
	"github.com/sashabaranov/go-openai"
)

func NumTokensFromMessages(messages []openai.ChatCompletionMessage, model string) (num_tokens int) {
	tkm, err := tiktoken.EncodingForModel(model)
	if err != nil {
		err = fmt.Errorf("EncodingForModel: %v", err)
		fmt.Println(err)
		return
	}

	var tokens_per_message int
	var tokens_per_name int
	if model == "gpt-3.5-turbo-0301" || model == "gpt-3.5-turbo" {
		tokens_per_message = 4
		tokens_per_name = -1
	} else if model == "gpt-4-0314" || model == "gpt-4" {
		tokens_per_message = 3
		tokens_per_name = 1
	} else {
		fmt.Println("Warning: model not found. Using cl100k_base encoding.")
		tokens_per_message = 3
		tokens_per_name = 1
	}

	for _, message := range messages {
		num_tokens += tokens_per_message
		num_tokens += len(tkm.Encode(message.Content, nil, nil))
		num_tokens += len(tkm.Encode(message.Role, nil, nil))
		if message.Name != "" {
			num_tokens += tokens_per_name
		}
	}
	num_tokens += 3
	return num_tokens
}

available encodings

Encoding name	OpenAI models
`cl100k_base`	`gpt-4`, `gpt-3.5-turbo`, `text-embedding-ada-002`
`p50k_base`	Codex models, `text-davinci-002`, `text-davinci-003`
`r50k_base` (or `gpt2`)	GPT-3 models like `davinci`

available models

Model name	OpenAI models
gpt-4	cl100k_base
gpt-4-*	cl100k_base
gpt-3.5-turbo	cl100k_base
gpt-3.5-turbo-*	cl100k_base
text-davinci-003	p50k_base
text-davinci-002	p50k_base
text-davinci-001	r50k_base
text-curie-001	r50k_base
text-babbage-001	r50k_base
text-ada-001	r50k_base
davinci	r50k_base
curie	r50k_base
babbage	r50k_base
ada	r50k_base
code-davinci-002	p50k_base
code-davinci-001	p50k_base
code-cushman-002	p50k_base
code-cushman-001	p50k_base
davinci-codex	p50k_base
cushman-codex	p50k_base
text-davinci-edit-001	p50k_edit
code-davinci-edit-001	p50k_edit
text-embedding-ada-002	cl100k_base
text-similarity-davinci-001	r50k_base
text-similarity-curie-001	r50k_base
text-similarity-babbage-001	r50k_base
text-similarity-ada-001	r50k_base
text-search-davinci-doc-001	r50k_base
text-search-curie-doc-001	r50k_base
text-search-babbage-doc-001	r50k_base
text-search-ada-doc-001	r50k_base
code-search-babbage-code-001	r50k_base
code-search-ada-code-001	r50k_base
gpt2	gpt2

Documentation ¶

Index ¶

Constants
Variables
type CoreBPE
- func NewCoreBPE(encoder map[string]int, specialTokensEncoder map[string]int, pattern string) (*CoreBPE, error)
type Encoding
type Tiktoken
- func EncodingForModel(modelName string) (*Tiktoken, error)
- func GetEncoding(encodingName string) (*Tiktoken, error)

Constants ¶

View Source

const ENDOFPROMPT string = "<|endofprompt|>"

View Source

const ENDOFTEXT string = "<|endoftext|>"

View Source

const FIM_MIDDLE string = "<|fim_middle|>"

View Source

const FIM_PREFIX string = "<|fim_prefix|>"

View Source

const FIM_SUFFIX string = "<|fim_suffix|>"

Variables ¶

View Source

var ENCODING_MAP = map[string]*Encoding{}

View Source

var MODEL_PREFIX_TO_ENCODING = map[string]string{

	"gpt-4-":         "cl100k_base",
	"gpt-3.5-turbo-": "cl100k_base",
}

View Source

var MODEL_TO_ENCODING = map[string]string{

	"gpt-4":         "cl100k_base",
	"gpt-3.5-turbo": "cl100k_base",

	"text-davinci-003": "p50k_base",
	"text-davinci-002": "p50k_base",
	"text-davinci-001": "r50k_base",
	"text-curie-001":   "r50k_base",
	"text-babbage-001": "r50k_base",
	"text-ada-001":     "r50k_base",
	"davinci":          "r50k_base",
	"curie":            "r50k_base",
	"babbage":          "r50k_base",
	"ada":              "r50k_base",

	"code-davinci-002": "p50k_base",
	"code-davinci-001": "p50k_base",
	"code-cushman-002": "p50k_base",
	"code-cushman-001": "p50k_base",
	"davinci-codex":    "p50k_base",
	"cushman-codex":    "p50k_base",

	"text-davinci-edit-001": "p50k_edit",
	"code-davinci-edit-001": "p50k_edit",

	"text-embedding-ada-002": "cl100k_base",

	"text-similarity-davinci-001":  "r50k_base",
	"text-similarity-curie-001":    "r50k_base",
	"text-similarity-babbage-001":  "r50k_base",
	"text-similarity-ada-001":      "r50k_base",
	"text-search-davinci-doc-001":  "r50k_base",
	"text-search-curie-doc-001":    "r50k_base",
	"text-search-babbage-doc-001":  "r50k_base",
	"text-search-ada-doc-001":      "r50k_base",
	"code-search-babbage-code-001": "r50k_base",
	"code-search-ada-code-001":     "r50k_base",

	"gpt2": "gpt2",
}

Functions ¶

This section is empty.

Types ¶

type CoreBPE ¶

type CoreBPE struct {
	// contains filtered or unexported fields
}

func NewCoreBPE ¶

func NewCoreBPE(encoder map[string]int, specialTokensEncoder map[string]int, pattern string) (*CoreBPE, error)

type Encoding ¶

type Encoding struct {
	Name           string
	PatStr         string
	MergeableRanks map[string]int
	SpecialTokens  map[string]int
	ExplicitNVocab int
}

type Tiktoken ¶

type Tiktoken struct {
	// contains filtered or unexported fields
}

func EncodingForModel ¶

func EncodingForModel(modelName string) (*Tiktoken, error)

func GetEncoding ¶

func GetEncoding(encodingName string) (*Tiktoken, error)

func (*Tiktoken) Decode ¶

func (t *Tiktoken) Decode(tokens []int) string

func (*Tiktoken) Encode ¶

func (t *Tiktoken) Encode(text string, allowedSpecial []string, disallowedSpecial []string) []int

func (*Tiktoken) SpecialTokenRegex ¶

func (t *Tiktoken) SpecialTokenRegex(disallowedSpecialSet map[string]any) *regexp2.Regexp

Source Files ¶

View all Source files

Directories ¶

Path	Synopsis
test

?	: This menu
/	: Search site
f or F	: Jump to
y or Y	: Canonical URL