tiktoken

package module
v0.0.9 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: May 20, 2024 License: MIT Imports: 13 Imported by: 3

README

✂️ go-tiktoken

Build Status Go Reference

OpenAI's tiktoken tokenizer written in Go. The vocabularies are embedded and do not need to be downloaded at runtime.

Installation

go get github.com/hupe1980/go-tiktoken

How to use

package main

import (
	"fmt"
	"log"

	"github.com/hupe1980/go-tiktoken"
)

func main() {
	encoding, err := tiktoken.NewEncodingForModel("gpt-3.5-turbo")
	if err != nil {
		log.Fatal(err)
	}

	ids, tokens, err := encoding.Encode("Hello World", nil, nil)
	if err != nil {
		log.Fatal(err)
	}

	fmt.Println("IDs:", ids)
	fmt.Println("Tokens:", tokens)
}

Output:

IDs: [9906 4435]
Tokens: [Hello  World]

For more example usage, see _examples.

Supported Encodings

  • ✅ o200k_base
  • ✅ cl100k_base
  • ✅ p50k_base
  • ✅ p50k_edit
  • ✅ r50k_base
  • ✅ gpt2
  • ✅ claude

License

MIT

Documentation

Overview

Package tiktoken provides functionality for tokenizing and encoding text using the tiktoken algorithm. The package includes various functions for text processing and encoding using the tiktoken algorithm.

Index

Constants

View Source
const (
	StartOfText string = "<|startoftext|>"
	EndOfText   string = "<|endoftext|>"
	FimPrefix   string = "<|fim_prefix|>"
	FimMiddle   string = "<|fim_middle|>"
	FimSuffix   string = "<|fim_suffix|>"
	EndOfPrompt string = "<|endofprompt|>"
)

Constants for special tokens.

View Source
const (
	O200kBase  string = "o200k_base"
	CL100kBase string = "cl100k_base"
	P50kBase   string = "p50k_base"
	P50kEdit   string = "p50k_edit"
	R50kBase   string = "r50k_base"
	GPT2       string = "gpt2"
)

Constants for different encodings.

Variables

View Source
var AllSpecial = []string{"all"}
View Source
var ModelPrefixToEncoding = map[string]string{

	"gpt-4o-":        O200kBase,
	"gpt-4-":         CL100kBase,
	"gpt-3.5-turbo-": CL100kBase,
	"gpt-3.5":        CL100kBase,
	"gpt-35-turbo":   CL100kBase,

	"ft:gpt-4":         CL100kBase,
	"ft:gpt-3.5-turbo": CL100kBase,
	"ft:davinci-002":   CL100kBase,
	"ft:babbage-002":   CL100kBase,
}

ModelPrefixToEncoding maps model prefixes to encodings.

View Source
var ModelToEncoding = map[string]string{

	"gpt-4o":        O200kBase,
	"gpt-4":         CL100kBase,
	"gpt-3.5-turbo": CL100kBase,
	"gpt-35-turbo":  CL100kBase,

	"text-davinci-003": P50kBase,
	"text-davinci-002": P50kBase,
	"text-davinci-001": R50kBase,
	"text-curie-001":   R50kBase,
	"text-babbage-001": R50kBase,
	"text-ada-001":     R50kBase,
	"davinci":          R50kBase,
	"curie":            R50kBase,
	"babbage":          R50kBase,
	"ada":              R50kBase,

	"code-davinci-002": P50kBase,
	"code-davinci-001": P50kBase,
	"code-cushman-002": P50kBase,
	"code-cushman-001": P50kBase,
	"davinci-codex":    P50kBase,
	"cushman-codex":    P50kBase,

	"text-davinci-edit-001": P50kEdit,
	"code-davinci-edit-001": P50kEdit,

	"text-embedding-ada-002": CL100kBase,
	"text-embedding-3-small": CL100kBase,
	"text-embedding-3-large": CL100kBase,

	"text-similarity-davinci-001":  R50kBase,
	"text-similarity-curie-001":    R50kBase,
	"text-similarity-babbage-001":  R50kBase,
	"text-similarity-ada-001":      R50kBase,
	"text-search-davinci-doc-001":  R50kBase,
	"text-search-curie-doc-001":    R50kBase,
	"text-search-babbage-doc-001":  R50kBase,
	"text-search-ada-doc-001":      R50kBase,
	"code-search-babbage-code-001": R50kBase,
	"code-search-ada-code-001":     R50kBase,

	"gpt2":  GPT2,
	"gpt-2": GPT2,
}

ModelToEncoding maps models to encodings.

Functions

func ConvertToMergeableBPERanks

func ConvertToMergeableBPERanks(bpe io.Reader) (map[string]uint, error)

ConvertToMergeableBPERanks converts the BPE file to mergeable BPE ranks.

func CovertVocabBPEAndEncoderJSONToMergeableBPERanks

func CovertVocabBPEAndEncoderJSONToMergeableBPERanks(vocabBPE io.Reader, encoderJSON io.Reader) (map[string]uint, error)

CovertVocabBPEAndEncoderJSONToMergeableBPERanks converts the vocabulary BPE and encoder JSON to mergeable BPE ranks.

Types

type Codec

type Codec struct {
	Name           string          `json:"name"`
	ExplicitNVocab int             `json:"explicit_n_vocab"`
	PatStr         string          `json:"pat_str"`
	MergeableRanks map[string]uint `json:"mergeable_ranks"`
	SpecialTokens  map[string]uint `json:"special_tokens"`
}

Codec represents a token encoding codec.

func NewCL100kBase

func NewCL100kBase() (*Codec, error)

NewCL100kBase creates a new Codec instance for the cl100k_base tokenization scheme. It loads the mergeable ranks from the embedded cl100kBase resource. The function returns a pointer to the Codec or an error if any.

func NewClaude added in v0.0.5

func NewClaude() (*Codec, error)

NewClaude creates a new Codec instance for the claude tokenization scheme. It loads the mergeable ranks from the embedded claude resource. The function returns a pointer to the Codec or an error if any.

func NewGPT2

func NewGPT2() (*Codec, error)

NewGPT2 creates a new Codec instance for the GPT-2 tokenization scheme. It loads the mergeable ranks from the embedded gpt2Vocab and gpt2Encode resources. The function returns a pointer to the Codec or an error if any.

func NewO200KBase added in v0.0.7

func NewO200KBase() (*Codec, error)

NewO200KBase creates a new Codec instance for the o200k_base tokenization scheme. It loads the mergeable ranks from the embedded o200kBase resource. The function returns a pointer to the Codec or an error if any.

func NewP50kBase

func NewP50kBase() (*Codec, error)

NewP50kBase creates a new Codec instance for the P50k_base tokenization scheme. It loads the mergeable ranks from the embedded p50kBase resource. The function returns a pointer to the Codec or an error if any.

func NewP50kEdit

func NewP50kEdit() (*Codec, error)

NewP50kEdit creates a new Codec instance for the P50k_edit tokenization scheme. It loads the mergeable ranks from the embedded p50kBase resource. The function returns a pointer to the Codec or an error if any.

func NewR50kBase

func NewR50kBase() (*Codec, error)

NewR50kBase creates a new Codec instance for the R50k_base tokenization scheme. It loads the mergeable ranks from the embedded r50kBase resource. The function returns a pointer to the Codec or an error if any.

type Encoding

type Encoding struct {
	// contains filtered or unexported fields
}

Encoding represents a text encoding scheme.

func NewEncoding

func NewEncoding(codec *Codec) (*Encoding, error)

NewEncoding creates a new Encoding instance based on the provided Codec.

func NewEncodingByName

func NewEncodingByName(encoding string) (*Encoding, error)

NewEncodingByName creates a new Encoding instance based on the given encoding name.

func NewEncodingForModel

func NewEncodingForModel(model string) (*Encoding, error)

NewEncodingForModel returns a new Encoding based on the given model. It checks the ModelToEncoding map and ModelPrefixToEncoding map to find a matching encoding.

func (*Encoding) Decode

func (enc *Encoding) Decode(tokens []uint) []byte

Decode decodes the given tokens using the Encoding's core BPE.

func (*Encoding) Encode

func (enc *Encoding) Encode(text string, allowedSpecial, disallowedSpecial []string) ([]uint, []string, error)

Encode encodes the given text with the specified allowed and disallowed special tokens.

func (*Encoding) EncodeOrdinary

func (enc *Encoding) EncodeOrdinary(text string) ([]uint, []string)

EncodeOrdinary encodes the given text using the Encoding's core BPE.

func (*Encoding) Name

func (enc *Encoding) Name() string

Name returns the name of the Encoding.

Directories

Path Synopsis
_examples

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL