tokenizer

package module

v1.1.2 Latest Latest Go to latest Published: Feb 7, 2023 License: Apache-2.0 Imports: 15 Imported by: 3

Details

Valid go.mod file
Redistributable license
Tagged version
Stable version
Learn more about best practices

Repository

github.com/cohere-ai/tokenizer

Links

Open Source Insights

README ¶

Tokenizers

Cohere's tokenizers library provides an interface to encode and decode text given a computed vocabulary, and includes pre-computed tokenizers that are used to train Cohere's models.

We plan on eventually also open sourcing tools to create new tokenizers.

Example using Go

Choose a tokenizer inside of the vocab folder including both a encoder.json file and a vocab.bpe file and create an encoder as seen below. The tokenizer used in this example is named the coheretext-50k tokenizer.

import (
  ...
  "github.com/cohere-ai/tokenizer"
)

encoder := tokenizer.NewFromPrebuilt("coheretext-50k")

To encode a string of text, use the Encode method. Encode returns a slice of int64s.

encoded := encoder.Encode("this is a string to be encoded")
fmt.Printf("%v", encoded)
// [6372 329 258 3852 288 345 37754]

To decode a slice of int64s, use the Decode method. Decode returns a string.

fmt.Printf(encoder.Decode(encoded))
// this is a string to be encoded

Speed

Using a 2.5GHz CPU, encoding 1000 tokens takes approximately 6.5 milliseconds, and decoding 1000 tokens takes approximately 0.2 milliseconds.

Documentation ¶

Index ¶

func CountReader(reader io.Reader) (map[string]int64, error)
func CountString(s string) map[string]int64
func MergeCounts(a map[string]int64, b map[string]int64)
func WordSplit(s string) []string
type Encoder
type Merge
- func BPE(freq map[string]int64, numSymbols, minFrequency int64) (map[string]int64, []*Merge, error)
type WordCount

Constants ¶

This section is empty.

Variables ¶

This section is empty.

Functions ¶

func CountReader ¶ added in v1.0.0

func CountReader(reader io.Reader) (map[string]int64, error)

func CountString ¶ added in v1.0.0

func CountString(s string) map[string]int64

func MergeCounts ¶ added in v1.0.0

func MergeCounts(a map[string]int64, b map[string]int64)

func WordSplit ¶ added in v1.0.2

func WordSplit(s string) []string

Types ¶

type Encoder ¶

type Encoder struct {
	Encoder   map[string]int64
	Decoder   map[int64]string
	BPERanks  map[[2]string]int64
	Cache     map[string]string
	VocabSize int64
}

func New ¶

func New(encoder map[string]int64, bpeMerges [][2]string) (*Encoder, error)

func NewFromPrebuilt ¶

func NewFromPrebuilt(name string) (*Encoder, error)

func NewFromReaders ¶

func NewFromReaders(encoderReader, vocabReader io.Reader) (*Encoder, error)

func (*Encoder) Decode ¶

func (e *Encoder) Decode(tokens []int64) string

func (*Encoder) Encode ¶

func (e *Encoder) Encode(text string) ([]int64, []string)

func (*Encoder) EncodeWords ¶ added in v1.0.4

func (e *Encoder) EncodeWords(words []string) ([]int64, []string)

type Merge ¶ added in v1.0.0

type Merge struct {
	Merge [2]string
	Count int64
}

func BPE ¶ added in v1.0.0

func BPE(freq map[string]int64, numSymbols, minFrequency int64) (map[string]int64, []*Merge, error)

type WordCount ¶ added in v1.0.0

type WordCount struct {
	Pieces []string `json:"pieces"`
	Count  int64    `json:"count"`
}

Source Files ¶

View all Source files

?	: This menu
/	: Search site
f or F	: Jump to
y or Y	: Canonical URL