Documentation ¶
Overview ¶
Package rs wraps the Rust tokenizer.
The two parts used by this wrapper are:
- The linked Rust tokenizer is in huggingface/tokenizers/tokenizers(https://github.com/huggingface/tokenizers/tree/main/tokenizers).
- The Rust wrapper with a C signature (`extern "C"`) is implemented in the subdirectory github.com/gomlx/tokenizers/rs(https://github.com/gomlx/tokenizers/tree/main/rs).
End users should use the public library in github.com/gomlx/tokenizers(https://github.com/gomlx/tokenizers) instead.
Index ¶
- Variables
- type EncodeParams
- type Encoding
- type Offset
- type Tokenizer
- func (t *Tokenizer) Decode(tokenIDs []uint32, skipSpecialTokens bool) string
- func (t *Tokenizer) Encode(str string, encParams EncodeParams) (*Encoding, error)
- func (t *Tokenizer) EncodeBatch(strArr []string, encParams EncodeParams) ([]Encoding, error)
- func (t *Tokenizer) Finalize()
- func (t *Tokenizer) GetPadding() (isSet bool, strategy uint32, direction uint8, ...)
- func (t *Tokenizer) GetTruncation() (isSet bool, direction uint8, maxLength uint32, strategy uint8, stride uint32)
- func (t *Tokenizer) SetNoPadding()
- func (t *Tokenizer) SetNoTruncation() error
- func (t *Tokenizer) SetPadding(strategy uint32, direction uint8, padToMultipleOf, padId, padTypeId uint32, ...)
- func (t *Tokenizer) SetTruncation(direction uint8, maxLength uint32, strategy uint8, stride uint32) error
- func (t *Tokenizer) VocabSize() uint32
- type TruncationDirection
Constants ¶
This section is empty.
Variables ¶
var CountTokenizerAllocs = atomic.Int64{}
CountTokenizerAllocs counts the number of Tokenizer allocations. This is used to test for memory leaks.
Functions ¶
This section is empty.
Types ¶
type EncodeParams ¶
type EncodeParams struct {
AddSpecialTokens, ReturnTokens, ReturnTypeIds, ReturnSpecialTokensMask, ReturnAttentionMask, ReturnOffsets, WithOffsetsCharMode bool
}
EncodeParams are passed at `Encode` or `EncodeBatch` calls.
It's copy of the underlying C.EncodeParams.
func ReturnAll ¶
func ReturnAll(addSpecialTokens, withCharMode bool) EncodeParams
type Encoding ¶
type Encoding struct { TokenIds []uint32 TypeIds []uint32 SpecialTokensMask []uint32 AttentionMask []uint32 Tokens []string Offsets []Offset }
Encoding is the result of a Tokenizer.Encode.
Only TokenIds is always present, all other fields are only set if requested.
type Offset ¶
type Offset struct {
Start, End uint32
}
Offset with the range (Start and End) of the matching token in the original sentence. Values depend on CharMode configuration (bytes or UTF-8 character).
type Tokenizer ¶
type Tokenizer struct {
// contains filtered or unexported fields
}
func (*Tokenizer) Encode ¶
func (t *Tokenizer) Encode(str string, encParams EncodeParams) (*Encoding, error)
func (*Tokenizer) EncodeBatch ¶
func (t *Tokenizer) EncodeBatch(strArr []string, encParams EncodeParams) ([]Encoding, error)
func (*Tokenizer) Finalize ¶
func (t *Tokenizer) Finalize()
Finalize frees the associated Rust tokenizer. It is automatically called at garbage collection, but you can call ahead of time. If called the tokenizer will become invalid.
func (*Tokenizer) GetPadding ¶
func (t *Tokenizer) GetPadding() (isSet bool, strategy uint32, direction uint8, padToMultipleOf, padId, padTypeId uint32, padToken string)
GetPadding returns the current padding parameters of the Tokenizer. If there are no parameters set, `isSet` is false, and the other values should be ignored. Otherwise, `isSet` is true, and the other values are returned appropriately.
func (*Tokenizer) GetTruncation ¶
func (t *Tokenizer) GetTruncation() (isSet bool, direction uint8, maxLength uint32, strategy uint8, stride uint32)
GetTruncation returns the current truncation parameters of the Tokenizer. If there are no parameters set, `isSet` is false, and the other values should be ignored. Otherwise, `isSet` is true, and the other values are returned appropriately.
func (*Tokenizer) SetNoPadding ¶
func (t *Tokenizer) SetNoPadding()
SetNoPadding changes the tokenizer not to use padding.
func (*Tokenizer) SetNoTruncation ¶
SetNoTruncation changes the tokenizer to not use truncation.
func (*Tokenizer) SetPadding ¶
func (t *Tokenizer) SetPadding( strategy uint32, direction uint8, padToMultipleOf, padId, padTypeId uint32, padToken string)
SetPadding changes the tokenizer padding configuration. - strategy: 0 -> BatchLongest, >0 -> Fixed to the given value. - direction: 0 -> Left (*); 1 -> Right.
func (*Tokenizer) SetTruncation ¶
func (t *Tokenizer) SetTruncation( direction uint8, maxLength uint32, strategy uint8, stride uint32) error
SetTruncation changes the tokenizer truncation. - direction: // 0 -> Left (*); 1 -> Right - 0 -> LongestFirst (*), 1 -> OnlyFirst, 2 -> OnlySecond,
It may return an error if `stride` is too high relative to `maxLength` and the `post_processor.added_tokens()`.
type TruncationDirection ¶
type TruncationDirection int
const ( TruncationDirectionLeft TruncationDirection = iota TruncationDirectionRight )