tokenizer

package module
v0.2.1 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Nov 22, 2024 License: MIT Imports: 3 Imported by: 40

README

Tests

Tokenizer

This is a pure go port of OpenAI's tokenizer.

Buy Me A Coffee

Usage

package main

import (
    "fmt"
    "github.com/tiktoken-go/tokenizer"
)

func main() {
    enc, err := tokenizer.Get(tokenizer.Cl100kBase)
    if err != nil {
        panic("oh oh")
    }

    // this should print a list of token ids
    ids, _, _ := enc.Encode("supercalifragilistic")
    fmt.Println(ids)

    // this should print the original string back
    text, _ := enc.Decode(ids)
    fmt.Println(text)
}

Alternatively you can use the included command-line tool

> tokenizer -h

Usage of tokenizer:
  -decode string
        tokens to decode
  -encode string
        text to encode
  -token string
        text to calculate token

> tokenizer -encode supercalifragilistic

Todo

  • ✅ port code
  • ✅ o200k_base encoding
  • ✅ cl100k_base encoding
  • ✅ r50k_base encoding
  • ✅ p50k_base encoding
  • ✅ p50k_edit encoding
  • ✅ tests
  • ❌ handle special tokens
  • ❌ gpt-2 model

Caveats

This library embeds OpenAI's vocabularies—which are not small (~4Mb)— as go maps. This is different than what the way python version of tiktoken works, which downloads the dictionaries and puts them in a cache folder.

However, since the dictionaries are compiled during the go build process the performance and start-up times should be better than downloading and loading them at runtime.

Alternatives

Here is a list of other libraries that do something similar.

Documentation

Index

Constants

This section is empty.

Variables

View Source
var (
	ErrModelNotSupported    = errors.New("model not supported")
	ErrEncodingNotSupported = errors.New("encoding not supported")
)

Functions

This section is empty.

Types

type Codec

type Codec interface {
	GetName() string
	Encode(string) ([]uint, []string, error)
	Decode([]uint) (string, error)
}

func ForModel

func ForModel(model Model) (Codec, error)

ForModel returns a new instance of a Codec implementation based on the specified OpenAI model. If the specified model is not supported, an error is returned.

func Get

func Get(encoding Encoding) (Codec, error)

Get returns a new instance of a Codec implementation based on the specified encoding format. The returned Codec instance can be used to encode (tokenize) and decode (reassemble) text. If the specified encoding is not supported, an error is returned.

type Encoding

type Encoding string
const (
	GPT2Enc    Encoding = "gpt2"
	R50kBase   Encoding = "r50k_base"
	P50kBase   Encoding = "p50k_base"
	P50kEdit   Encoding = "p50k_edit"
	Cl100kBase Encoding = "cl100k_base"
	O200kBase  Encoding = "o200k_base"
)

type Model

type Model string
const (
	O1Preview                Model = "o1-preview"
	O1Mini                   Model = "o1-mini"
	GPT4o                    Model = "gpt-4o"
	GPT4                     Model = "gpt-4"
	GPT35Turbo               Model = "gpt-3.5-turbo"
	GPT35                    Model = "gpt-3.5"
	TextEmbeddingAda002      Model = "text-embedding-ada-002"
	TextDavinci003           Model = "text-davinci-003"
	TextDavinci002           Model = "text-davinci-002"
	CodeDavinci002           Model = "code-davinci-002"
	CodeDavinci001           Model = "code-davinci-001"
	CodeCushman002           Model = "code-cushman-002"
	CodeCushman001           Model = "code-cushman-001"
	DavinciCodex             Model = "davinci-codex"
	CushmanCodex             Model = "cushman-codex"
	TextDavinci001           Model = "text-davinci-001"
	TextCurie001             Model = "text-curie-001"
	TextBabbage001           Model = "text-babbage-001"
	TextAda001               Model = "text-ada-001"
	Davinci                  Model = "davinci"
	Curie                    Model = "curie"
	Babbage                  Model = "babbage"
	Ada                      Model = "ada"
	TextSimilarityDavinci001 Model = "text-similarity-davinci-001"
	TextSimilarityCurie001   Model = "text-similarity-curie-001"
	TextSimilarityBabbage001 Model = "text-similarity-babbage-001"
	TextSimilarityAda001     Model = "text-similarity-ada-001"
	TextSearchDavinciDoc001  Model = "text-search-davinci-doc-001"
	TextSearchCurieDoc001    Model = "text-search-curie-doc-001"
	TextSearchAdaDoc001      Model = "text-search-ada-doc-001"
	TextSearchBabbageDoc001  Model = "text-search-babbage-doc-001"
	CodeSearchBabbageCode001 Model = "code-search-babbage-code-001"
	CodeSearchAdaCode001     Model = "code-search-ada-code-001"
	TextDavinciEdit001       Model = "text-davinci-edit-001"
	CodeDavinciEdit001       Model = "code-davinci-edit-001"
	GPT2                     Model = "gpt2"
)

Directories

Path Synopsis
cmd
internal
cmd

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL