Documentation ¶
Overview ¶
Package internal contains non-exported implementation details of gotoken.
Index ¶
Constants ¶
This section is empty.
Variables ¶
This section is empty.
Functions ¶
func GPT2Splitter ¶
GPT2Splitter implements the splitter function used by r50k_base and p50k_base to split text before byte-pair encoding. It is located in the internal package since it is shared between multiple encodings.
func TrieLookup ¶
TrieLookup returns the index of the given input in a serialized trie. It returns -1 if the input is not present, or its token# otherwise. This is exported for use in gen.go.
Types ¶
type BPEParams ¶
type BPEParams struct { Name string Splitter func([]byte) [][]byte ByteEncoder []byte // token values for each byte 0-255 EncoderTrie serializedTrie // pseudo-map[string]int for strings->tokens DecoderMap []string // strings for each token int SpecialTokens map[string]int // map of all defined special tokens BytePairLookup []int // lookup table for byte pairs, 256*256 entries }
BPEParams contains the parameters defining the encoding used by a BPETokenizer. These are the data structures generated by gen.go.
type BPETokenizer ¶
type BPETokenizer struct {
// contains filtered or unexported fields
}
BPETokenizer implements the gotoken.Tokenizer interface. All tokenizers returned by gotoken.GetTokenizer are BPETokenizers.
Internally, BPETokenizer uses a combination of a splitting function and a BPE dictionary. The splitting function divides the text to tokenize into semantic parts based on character classes (typically keeping similar character classes together and breaking on white space or punctuation in some consistent way.) The BPE dictionary then encodes frequently encountered sequences of bytes into their own tokens.
To create the canonical tokenization of a given sequence of bytes, the following steps are performed:
- Convert the source bytes into their corresponding tokens.
- Determine which pairs of adjacent tokens are encodable as a higher-numbered token.
- Of those, choose the lowest-ranked pair (the one with the lowest resulting token #), and combine those two tokens into a new token.
- Repeat this process until no more pairs can be combined.
Working this out for "Bonjour" in r50k_base results in the following steps:
- "Bonjour" == []byte{66, 111, 110, 106, 111, 117, 114}
- Equivalent tokens == []int{33, 78, 77, 73, 78, 84, 81}
- {33, 78, 77, 73, 78, 84, 81} == {"B", "o", "n", "j", "o", "u", "r"}
- {33, 261, 73, 78, 84, 81} == {"B", "on", "j", "o", "u", "r"}
- {33, 261, 73, 280, 81} == {"B", "on", "j", "ou", "r"}
- {33, 261, 73, 454} == {"B", "on", "j", "our"}
- {20682, 73, 454} == {"Bon", "j", "our"}
Combining tokens in a different order (for example, strictly left-to-right) could result in a different tokenization of the same word. For example, a valid tokenization of "Bonjour" is:
- {20682, 7639, 333} == {"Bon", "jo", "ur"}
Calling Decode() on any of the above token arrays returns "Bonjour". However, non-canonical tokenizations may not be the same length as the ones generated by OpenAI's APIs.
func NewBPETokenizer ¶
func NewBPETokenizer(params *BPEParams, allowSpecialAsText bool, allowedSpecialTokens []string) (*BPETokenizer, error)
NewBPETokenizer creates a new BPETokenizer from the given BPEParams and using the specified special token encoding settings.
func (*BPETokenizer) Allowed ¶
func (tt *BPETokenizer) Allowed(input string) error
Allowed performs the special token safety check on an input string according to the configuration of this Tokenizer. A wrapped gotoken.ErrSpecialToken is returned if the input contains a special token defined by this encoding that has not been not explicitly allowed.
If a Tokenizer instance was created with the gotoken.AllowSpecialAsText option, this method will always return no error (nil).
func (*BPETokenizer) Count ¶
func (tt *BPETokenizer) Count(input string) int
Count returns the number of tokens in an input string, without returning the actual tokens. It returns 0 if the input string is empty, or if the input cannot be encoded.
func (*BPETokenizer) Decode ¶
func (tt *BPETokenizer) Decode(tokens []int) (string, error)
Decode converts a slice of ints (tokens) into a string. It may return an error that wraps gotoken.ErrInvalidToken if any of the provided tokens are not valid in this encoding.
func (*BPETokenizer) Encode ¶
func (tt *BPETokenizer) Encode(s string) ([]int, error)
Encode converts a string into a slice of ints (tokens). A wrapped gotoken.ErrSpecialToken error will be returned if a special token appears in the input without being explicitly allowed.
type TestPairReader ¶
type TestPairReader struct {
// contains filtered or unexported fields
}
TestPairReader reads a pair of text files that contain tokenization test cases. The rows from the file are read one-by-one.
func NewTestPairReader ¶
func NewTestPairReader(inputFile, expectedFile string) (*TestPairReader, error)
NewTestPairReader returns a new TestPairReader that reads from the given input and expected files. The parameter inputFile must point to a text file that contains one test case per line, and expectedFile should contain one JSON-encoded array of integers per line.
func (*TestPairReader) CaseName ¶
func (tpr *TestPairReader) CaseName() string
CaseName returns "{inputFilename}#{line}" to identify the current test case.
func (*TestPairReader) Close ¶
func (fp *TestPairReader) Close()
Close closes the input and expected files. It is safe to call even on a closed reader.
func (*TestPairReader) Line ¶
func (tpr *TestPairReader) Line() int
Line returns the the 1-based line number of the last lines read from the input and expected files.
func (*TestPairReader) Next ¶
func (tpr *TestPairReader) Next() (*TestPair, error)
Next returns the next test case from the input and expected files. If the end of the files has been reached, Next() returns (nil, nil) indefinitely.