sentencepiece

package
v0.13.3 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Dec 20, 2024 License: Apache-2.0, Apache-2.0 Imports: 10 Imported by: 0

README

go-sentencepiece

Logo


Go Reference

This is a pure Go implementation of encoding and decoding text with the SentencePiece tokenizer.

"Encoding" is the operation used to split text into tokens, using a trained tokenizer model. "Decoding" is the reverse process - converting a list of tokens into the original text.

SentencePiece is a general family of tokenizers that is configured by a protobuf configuration file. This repository currently focuses on implementing just the functionality required to reproduce the tokenization of Gemma models (the same tokenizer is used for Google's proprietary Gemini family of models). Specifically, it only implements BPE tokenization since this is what Gemma uses.

Current status

This package should be ready to use for encoding text into tokens using the Gemma tokenizer; it's been reasonably optimized and extensively tested vs. the SentencePiece Python bindings (see system_test.go in this repository).

If you find any problems or discrepancies, please open an issue.

Tokenizer configuration

The configuration file for the tokenizer is a protobuf (structured data, serialized in the protocol buffer format) that describes a trained tokenizer model; it includes the complete learned vocabulary used for tokenization, as well as other configuration information.

It is not part of this repository. Please fetch it from the official Gemma implementation repository. NewProcessor* constructors will expect to read this file.

Developing

A protobuf is used to configure the tokenizer. The structure of the protobuf is described by the internal/model/sentencepiece_model.proto file, which is vendored from https://github.com/google/sentencepiece

To re-generate the *.pb.go file from it:

$ cd internal/model
$ ./gen.sh

The configuration protobuf itself is obtained as described in the Tokenizer configuration section. All tests require the MODELPATH env var to point to a local copy of the tokenizer configuration file.

Online demo

To see an in-browser demo of this tokenizer in action, visit https://eliben.github.io/go-sentencepiece/

The Go code is compiled to WebAssembly and loaded from a small JS program to allow interactive encoding of text.

Documentation

Index

Constants

This section is empty.

Variables

This section is empty.

Functions

This section is empty.

Types

type Processor

type Processor struct {
	// contains filtered or unexported fields
}

Processor represents a SentencePiece processor (tokenizer). A Processor converts input text into a sequence of tokens LLMs use, and back. The mapping between token IDs and the text they represent is read from the model proto (provided to the constructor); it's the same between all calls to the Encode method.

The term "processor" comes from the original C++ SentencePiece library and its Python bindings.

func NewProcessor

func NewProcessor(protoReader io.Reader) (*Processor, error)

NewProcessor creates a new Processor from a reader with the protobuf data.

func NewProcessorFromPath

func NewProcessorFromPath(protoFile string) (*Processor, error)

NewProcessorFromPath creates a new Processor from a file path to the protobuf data.

func (*Processor) Decode

func (proc *Processor) Decode(ids []int) string

Decode translates a list of IDs produced by [Encode] back into the string it represents.

func (*Processor) DecodeTokens

func (proc *Processor) DecodeTokens(tokens []Token) string

DecodeTokens is a convenience wrapper around [Decode], accepting a list of tokens as returned by [Encode]. It only uses the ID fields of tokens to decode the text.

func (*Processor) Encode

func (proc *Processor) Encode(text string) []Token

Encode tokenizes the input text and returns a list of Tokens.

func (*Processor) VocabularySize

func (proc *Processor) VocabularySize() int

VocabularySize returns the vocabulary size from the proto model.

type Token

type Token struct {
	ID   int
	Text string
}

Token represents a single token from the input text. ID is a unique token identifier that the model uses in its internal representation. Text is the piece of text this token represents.

func (Token) String

func (t Token) String() string

Directories

Path Synopsis
internal
priorityqueue
Package priorityqueue provides a generic priority queue with Insert and PopMax operations.
Package priorityqueue provides a generic priority queue with Insert and PopMax operations.

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL