rs

package
v0.0.0-...-1de48c6 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Aug 24, 2024 License: MIT Imports: 7 Imported by: 0

Documentation

Overview

Package rs wraps the Rust tokenizer.

The two parts used by this wrapper are:

End users should use the public library in github.com/gomlx/tokenizers(https://github.com/gomlx/tokenizers) instead.

Index

Constants

This section is empty.

Variables

View Source
var CountTokenizerAllocs = atomic.Int64{}

CountTokenizerAllocs counts the number of Tokenizer allocations. This is used to test for memory leaks.

Functions

This section is empty.

Types

type EncodeParams

type EncodeParams struct {
	AddSpecialTokens, ReturnTokens, ReturnTypeIds, ReturnSpecialTokensMask, ReturnAttentionMask, ReturnOffsets, WithOffsetsCharMode bool
}

EncodeParams are passed at `Encode` or `EncodeBatch` calls.

It's copy of the underlying C.EncodeParams.

func ReturnAll

func ReturnAll(addSpecialTokens, withCharMode bool) EncodeParams

type Encoding

type Encoding struct {
	TokenIds          []uint32
	TypeIds           []uint32
	SpecialTokensMask []uint32
	AttentionMask     []uint32
	Tokens            []string
	Offsets           []Offset
}

Encoding is the result of a Tokenizer.Encode.

Only TokenIds is always present, all other fields are only set if requested.

type Offset

type Offset struct {
	Start, End uint32
}

Offset with the range (Start and End) of the matching token in the original sentence. Values depend on CharMode configuration (bytes or UTF-8 character).

type Tokenizer

type Tokenizer struct {
	// contains filtered or unexported fields
}

func FromBytes

func FromBytes(data []byte) (*Tokenizer, error)

func FromFile

func FromFile(path string) (*Tokenizer, error)

func (*Tokenizer) Decode

func (t *Tokenizer) Decode(tokenIDs []uint32, skipSpecialTokens bool) string

func (*Tokenizer) Encode

func (t *Tokenizer) Encode(str string, encParams EncodeParams) (*Encoding, error)

func (*Tokenizer) EncodeBatch

func (t *Tokenizer) EncodeBatch(strArr []string, encParams EncodeParams) ([]Encoding, error)

func (*Tokenizer) Finalize

func (t *Tokenizer) Finalize()

Finalize frees the associated Rust tokenizer. It is automatically called at garbage collection, but you can call ahead of time. If called the tokenizer will become invalid.

func (*Tokenizer) GetPadding

func (t *Tokenizer) GetPadding() (isSet bool, strategy uint32, direction uint8, padToMultipleOf, padId, padTypeId uint32, padToken string)

GetPadding returns the current padding parameters of the Tokenizer. If there are no parameters set, `isSet` is false, and the other values should be ignored. Otherwise, `isSet` is true, and the other values are returned appropriately.

func (*Tokenizer) GetTruncation

func (t *Tokenizer) GetTruncation() (isSet bool, direction uint8, maxLength uint32, strategy uint8, stride uint32)

GetTruncation returns the current truncation parameters of the Tokenizer. If there are no parameters set, `isSet` is false, and the other values should be ignored. Otherwise, `isSet` is true, and the other values are returned appropriately.

func (*Tokenizer) SetNoPadding

func (t *Tokenizer) SetNoPadding()

SetNoPadding changes the tokenizer not to use padding.

func (*Tokenizer) SetNoTruncation

func (t *Tokenizer) SetNoTruncation() error

SetNoTruncation changes the tokenizer to not use truncation.

func (*Tokenizer) SetPadding

func (t *Tokenizer) SetPadding(
	strategy uint32, direction uint8, padToMultipleOf, padId, padTypeId uint32, padToken string)

SetPadding changes the tokenizer padding configuration. - strategy: 0 -> BatchLongest, >0 -> Fixed to the given value. - direction: 0 -> Left (*); 1 -> Right.

func (*Tokenizer) SetTruncation

func (t *Tokenizer) SetTruncation(
	direction uint8, maxLength uint32, strategy uint8, stride uint32) error

SetTruncation changes the tokenizer truncation. - direction: // 0 -> Left (*); 1 -> Right - 0 -> LongestFirst (*), 1 -> OnlyFirst, 2 -> OnlySecond,

It may return an error if `stride` is too high relative to `maxLength` and the `post_processor.added_tokens()`.

func (*Tokenizer) VocabSize

func (t *Tokenizer) VocabSize() uint32

type TruncationDirection

type TruncationDirection int
const (
	TruncationDirectionLeft TruncationDirection = iota
	TruncationDirectionRight
)

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL