internal

package

v0.9.1 Latest Latest Go to latest Published: Apr 19, 2023 License: MIT Imports: 10 Imported by: 0

Details

Valid go.mod file
Redistributable license
Tagged version
Stable version
Learn more about best practices

Repository

github.com/peterheb/gotoken

Links

Open Source Insights

Documentation ¶

Overview ¶

Package internal contains non-exported implementation details of gotoken.

Index ¶

func GPT2Splitter(input []byte) [][]byte
func TrieLookup(trie []uint32, input []byte) int
type BPEParams
type BPETokenizer
- func NewBPETokenizer(params *BPEParams, allowSpecialAsText bool, allowedSpecialTokens []string) (*BPETokenizer, error)
type TestPair
type TestPairReader
- func NewTestPairReader(inputFile, expectedFile string) (*TestPairReader, error)

Constants ¶

This section is empty.

Variables ¶

This section is empty.

Functions ¶

func GPT2Splitter ¶

func GPT2Splitter(input []byte) [][]byte

GPT2Splitter implements the splitter function used by r50k_base and p50k_base to split text before byte-pair encoding. It is located in the internal package since it is shared between multiple encodings.

func TrieLookup ¶

func TrieLookup(trie []uint32, input []byte) int

TrieLookup returns the index of the given input in a serialized trie. It returns -1 if the input is not present, or its token# otherwise. This is exported for use in gen.go.

Types ¶

type BPEParams ¶

type BPEParams struct {
	Name           string
	Splitter       func([]byte) [][]byte
	ByteEncoder    []byte         // token values for each byte 0-255
	EncoderTrie    serializedTrie // pseudo-map[string]int for strings->tokens
	DecoderMap     []string       // strings for each token int
	SpecialTokens  map[string]int // map of all defined special tokens
	BytePairLookup []int          // lookup table for byte pairs, 256*256 entries
}

BPEParams contains the parameters defining the encoding used by a BPETokenizer. These are the data structures generated by gen.go.

type BPETokenizer ¶

type BPETokenizer struct {
	// contains filtered or unexported fields
}

BPETokenizer implements the gotoken.Tokenizer interface. All tokenizers returned by gotoken.GetTokenizer are BPETokenizers.

Internally, BPETokenizer uses a combination of a splitting function and a BPE dictionary. The splitting function divides the text to tokenize into semantic parts based on character classes (typically keeping similar character classes together and breaking on white space or punctuation in some consistent way.) The BPE dictionary then encodes frequently encountered sequences of bytes into their own tokens.

To create the canonical tokenization of a given sequence of bytes, the following steps are performed:

Convert the source bytes into their corresponding tokens.
Determine which pairs of adjacent tokens are encodable as a higher-numbered token.
Of those, choose the lowest-ranked pair (the one with the lowest resulting token #), and combine those two tokens into a new token.
Repeat this process until no more pairs can be combined.

Working this out for "Bonjour" in r50k_base results in the following steps:

"Bonjour" == []byte{66, 111, 110, 106, 111, 117, 114}
Equivalent tokens == []int{33, 78, 77, 73, 78, 84, 81}
{33, 78, 77, 73, 78, 84, 81} == {"B", "o", "n", "j", "o", "u", "r"}
{33, 261, 73, 78, 84, 81} == {"B", "on", "j", "o", "u", "r"}
{33, 261, 73, 280, 81} == {"B", "on", "j", "ou", "r"}
{33, 261, 73, 454} == {"B", "on", "j", "our"}
{20682, 73, 454} == {"Bon", "j", "our"}

Combining tokens in a different order (for example, strictly left-to-right) could result in a different tokenization of the same word. For example, a valid tokenization of "Bonjour" is:

{20682, 7639, 333} == {"Bon", "jo", "ur"}

Calling Decode() on any of the above token arrays returns "Bonjour". However, non-canonical tokenizations may not be the same length as the ones generated by OpenAI's APIs.

func NewBPETokenizer ¶

func NewBPETokenizer(params *BPEParams, allowSpecialAsText bool, allowedSpecialTokens []string) (*BPETokenizer, error)

NewBPETokenizer creates a new BPETokenizer from the given BPEParams and using the specified special token encoding settings.

func (*BPETokenizer) Allowed ¶

func (tt *BPETokenizer) Allowed(input string) error

Allowed performs the special token safety check on an input string according to the configuration of this Tokenizer. A wrapped gotoken.ErrSpecialToken is returned if the input contains a special token defined by this encoding that has not been not explicitly allowed.

If a Tokenizer instance was created with the gotoken.AllowSpecialAsText option, this method will always return no error (nil).

func (*BPETokenizer) Count ¶

func (tt *BPETokenizer) Count(input string) int

Count returns the number of tokens in an input string, without returning the actual tokens. It returns 0 if the input string is empty, or if the input cannot be encoded.

func (*BPETokenizer) Decode ¶

func (tt *BPETokenizer) Decode(tokens []int) (string, error)

Decode converts a slice of ints (tokens) into a string. It may return an error that wraps gotoken.ErrInvalidToken if any of the provided tokens are not valid in this encoding.

func (*BPETokenizer) Encode ¶

func (tt *BPETokenizer) Encode(s string) ([]int, error)

Encode converts a string into a slice of ints (tokens). A wrapped gotoken.ErrSpecialToken error will be returned if a special token appears in the input without being explicitly allowed.

type TestPair ¶

type TestPair struct {
	Input    string
	Expected []int
}

TestPair contains the input and expected output for a tokenization test case.

type TestPairReader ¶

type TestPairReader struct {
	// contains filtered or unexported fields
}

TestPairReader reads a pair of text files that contain tokenization test cases. The rows from the file are read one-by-one.

func NewTestPairReader ¶

func NewTestPairReader(inputFile, expectedFile string) (*TestPairReader, error)

NewTestPairReader returns a new TestPairReader that reads from the given input and expected files. The parameter inputFile must point to a text file that contains one test case per line, and expectedFile should contain one JSON-encoded array of integers per line.

func (*TestPairReader) CaseName ¶

func (tpr *TestPairReader) CaseName() string

CaseName returns "{inputFilename}#{line}" to identify the current test case.

func (*TestPairReader) Close ¶

func (fp *TestPairReader) Close()

Close closes the input and expected files. It is safe to call even on a closed reader.

func (*TestPairReader) Line ¶

func (tpr *TestPairReader) Line() int

Line returns the the 1-based line number of the last lines read from the input and expected files.

func (*TestPairReader) Next ¶

func (tpr *TestPairReader) Next() (*TestPair, error)

Next returns the next test case from the input and expected files. If the end of the files has been reached, Next() returns (nil, nil) indefinitely.

Source Files ¶

View all Source files

?	: This menu
/	: Search site
f or F	: Jump to
y or Y	: Canonical URL