sentencepiece

package

v0.13.3 Latest Latest Go to latest Published: Dec 20, 2024 License: Apache-2.0, Apache-2.0 Imports: 10 Imported by: 0

Details

Valid go.mod file
Redistributable license
Tagged version
Stable version
Learn more about best practices

Repository

github.com/googleapis/google-cloud-go

Links

Open Source Insights

README ¶

go-sentencepiece

Logo

This is a pure Go implementation of encoding and decoding text with the SentencePiece tokenizer.

"Encoding" is the operation used to split text into tokens, using a trained tokenizer model. "Decoding" is the reverse process - converting a list of tokens into the original text.

SentencePiece is a general family of tokenizers that is configured by a protobuf configuration file. This repository currently focuses on implementing just the functionality required to reproduce the tokenization of Gemma models (the same tokenizer is used for Google's proprietary Gemini family of models). Specifically, it only implements BPE tokenization since this is what Gemma uses.

Current status

This package should be ready to use for encoding text into tokens using the Gemma tokenizer; it's been reasonably optimized and extensively tested vs. the SentencePiece Python bindings (see system_test.go in this repository).

If you find any problems or discrepancies, please open an issue.

Tokenizer configuration

The configuration file for the tokenizer is a protobuf (structured data, serialized in the protocol buffer format) that describes a trained tokenizer model; it includes the complete learned vocabulary used for tokenization, as well as other configuration information.

It is not part of this repository. Please fetch it from the official Gemma implementation repository. NewProcessor* constructors will expect to read this file.

Developing

A protobuf is used to configure the tokenizer. The structure of the protobuf is described by the internal/model/sentencepiece_model.proto file, which is vendored from https://github.com/google/sentencepiece

To re-generate the *.pb.go file from it:

$ cd internal/model
$ ./gen.sh

The configuration protobuf itself is obtained as described in the Tokenizer configuration section. All tests require the MODELPATH env var to point to a local copy of the tokenizer configuration file.

Online demo

To see an in-browser demo of this tokenizer in action, visit https://eliben.github.io/go-sentencepiece/

The Go code is compiled to WebAssembly and loaded from a small JS program to allow interactive encoding of text.

Documentation ¶

Index ¶

type Processor
- func NewProcessor(protoReader io.Reader) (*Processor, error)
- func NewProcessorFromPath(protoFile string) (*Processor, error)
type Token
- func (t Token) String() string

Constants ¶

This section is empty.

Variables ¶

This section is empty.

Functions ¶

This section is empty.

Types ¶

type Processor ¶

type Processor struct {
	// contains filtered or unexported fields
}

Processor represents a SentencePiece processor (tokenizer). A Processor converts input text into a sequence of tokens LLMs use, and back. The mapping between token IDs and the text they represent is read from the model proto (provided to the constructor); it's the same between all calls to the Encode method.

The term "processor" comes from the original C++ SentencePiece library and its Python bindings.

func NewProcessor ¶

func NewProcessor(protoReader io.Reader) (*Processor, error)

NewProcessor creates a new Processor from a reader with the protobuf data.

func NewProcessorFromPath ¶

func NewProcessorFromPath(protoFile string) (*Processor, error)

NewProcessorFromPath creates a new Processor from a file path to the protobuf data.

func (*Processor) Decode ¶

func (proc *Processor) Decode(ids []int) string

Decode translates a list of IDs produced by [Encode] back into the string it represents.

func (*Processor) DecodeTokens ¶

func (proc *Processor) DecodeTokens(tokens []Token) string

DecodeTokens is a convenience wrapper around [Decode], accepting a list of tokens as returned by [Encode]. It only uses the ID fields of tokens to decode the text.

func (*Processor) Encode ¶

func (proc *Processor) Encode(text string) []Token

Encode tokenizes the input text and returns a list of Tokens.

func (*Processor) VocabularySize ¶

func (proc *Processor) VocabularySize() int

VocabularySize returns the vocabulary size from the proto model.

type Token ¶

type Token struct {
	ID   int
	Text string
}

Token represents a single token from the input text. ID is a unique token identifier that the model uses in its internal representation. Text is the piece of text this token represents.

func (Token) String ¶

func (t Token) String() string

Source Files ¶

View all Source files

Directories ¶

Path	Synopsis
internal
cmd/dumper
model
prefixmatcher
priorityqueue Package priorityqueue provides a generic priority queue with Insert and PopMax operations.	Package priorityqueue provides a generic priority queue with Insert and PopMax operations.

?	: This menu
/	: Search site
f or F	: Jump to
y or Y	: Canonical URL