internal

package
v0.9.1 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Apr 19, 2023 License: MIT Imports: 10 Imported by: 0

Documentation

Overview

Package internal contains non-exported implementation details of gotoken.

Index

Constants

This section is empty.

Variables

This section is empty.

Functions

func GPT2Splitter

func GPT2Splitter(input []byte) [][]byte

GPT2Splitter implements the splitter function used by r50k_base and p50k_base to split text before byte-pair encoding. It is located in the internal package since it is shared between multiple encodings.

func TrieLookup

func TrieLookup(trie []uint32, input []byte) int

TrieLookup returns the index of the given input in a serialized trie. It returns -1 if the input is not present, or its token# otherwise. This is exported for use in gen.go.

Types

type BPEParams

type BPEParams struct {
	Name           string
	Splitter       func([]byte) [][]byte
	ByteEncoder    []byte         // token values for each byte 0-255
	EncoderTrie    serializedTrie // pseudo-map[string]int for strings->tokens
	DecoderMap     []string       // strings for each token int
	SpecialTokens  map[string]int // map of all defined special tokens
	BytePairLookup []int          // lookup table for byte pairs, 256*256 entries
}

BPEParams contains the parameters defining the encoding used by a BPETokenizer. These are the data structures generated by gen.go.

type BPETokenizer

type BPETokenizer struct {
	// contains filtered or unexported fields
}

BPETokenizer implements the gotoken.Tokenizer interface. All tokenizers returned by gotoken.GetTokenizer are BPETokenizers.

Internally, BPETokenizer uses a combination of a splitting function and a BPE dictionary. The splitting function divides the text to tokenize into semantic parts based on character classes (typically keeping similar character classes together and breaking on white space or punctuation in some consistent way.) The BPE dictionary then encodes frequently encountered sequences of bytes into their own tokens.

To create the canonical tokenization of a given sequence of bytes, the following steps are performed:

  • Convert the source bytes into their corresponding tokens.
  • Determine which pairs of adjacent tokens are encodable as a higher-numbered token.
  • Of those, choose the lowest-ranked pair (the one with the lowest resulting token #), and combine those two tokens into a new token.
  • Repeat this process until no more pairs can be combined.

Working this out for "Bonjour" in r50k_base results in the following steps:

  • "Bonjour" == []byte{66, 111, 110, 106, 111, 117, 114}
  • Equivalent tokens == []int{33, 78, 77, 73, 78, 84, 81}
  • {33, 78, 77, 73, 78, 84, 81} == {"B", "o", "n", "j", "o", "u", "r"}
  • {33, 261, 73, 78, 84, 81} == {"B", "on", "j", "o", "u", "r"}
  • {33, 261, 73, 280, 81} == {"B", "on", "j", "ou", "r"}
  • {33, 261, 73, 454} == {"B", "on", "j", "our"}
  • {20682, 73, 454} == {"Bon", "j", "our"}

Combining tokens in a different order (for example, strictly left-to-right) could result in a different tokenization of the same word. For example, a valid tokenization of "Bonjour" is:

  • {20682, 7639, 333} == {"Bon", "jo", "ur"}

Calling Decode() on any of the above token arrays returns "Bonjour". However, non-canonical tokenizations may not be the same length as the ones generated by OpenAI's APIs.

func NewBPETokenizer

func NewBPETokenizer(params *BPEParams, allowSpecialAsText bool, allowedSpecialTokens []string) (*BPETokenizer, error)

NewBPETokenizer creates a new BPETokenizer from the given BPEParams and using the specified special token encoding settings.

func (*BPETokenizer) Allowed

func (tt *BPETokenizer) Allowed(input string) error

Allowed performs the special token safety check on an input string according to the configuration of this Tokenizer. A wrapped gotoken.ErrSpecialToken is returned if the input contains a special token defined by this encoding that has not been not explicitly allowed.

If a Tokenizer instance was created with the gotoken.AllowSpecialAsText option, this method will always return no error (nil).

func (*BPETokenizer) Count

func (tt *BPETokenizer) Count(input string) int

Count returns the number of tokens in an input string, without returning the actual tokens. It returns 0 if the input string is empty, or if the input cannot be encoded.

func (*BPETokenizer) Decode

func (tt *BPETokenizer) Decode(tokens []int) (string, error)

Decode converts a slice of ints (tokens) into a string. It may return an error that wraps gotoken.ErrInvalidToken if any of the provided tokens are not valid in this encoding.

func (*BPETokenizer) Encode

func (tt *BPETokenizer) Encode(s string) ([]int, error)

Encode converts a string into a slice of ints (tokens). A wrapped gotoken.ErrSpecialToken error will be returned if a special token appears in the input without being explicitly allowed.

type TestPair

type TestPair struct {
	Input    string
	Expected []int
}

TestPair contains the input and expected output for a tokenization test case.

type TestPairReader

type TestPairReader struct {
	// contains filtered or unexported fields
}

TestPairReader reads a pair of text files that contain tokenization test cases. The rows from the file are read one-by-one.

func NewTestPairReader

func NewTestPairReader(inputFile, expectedFile string) (*TestPairReader, error)

NewTestPairReader returns a new TestPairReader that reads from the given input and expected files. The parameter inputFile must point to a text file that contains one test case per line, and expectedFile should contain one JSON-encoded array of integers per line.

func (*TestPairReader) CaseName

func (tpr *TestPairReader) CaseName() string

CaseName returns "{inputFilename}#{line}" to identify the current test case.

func (*TestPairReader) Close

func (fp *TestPairReader) Close()

Close closes the input and expected files. It is safe to call even on a closed reader.

func (*TestPairReader) Line

func (tpr *TestPairReader) Line() int

Line returns the the 1-based line number of the last lines read from the input and expected files.

func (*TestPairReader) Next

func (tpr *TestPairReader) Next() (*TestPair, error)

Next returns the next test case from the input and expected files. If the end of the files has been reached, Next() returns (nil, nil) indefinitely.

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL