bpe

package module

v0.0.0-...-ec25de7 Latest Latest Go to latest Published: Dec 9, 2024 License: Apache-2.0 Imports: 8 Imported by: 0

Details

Valid go.mod file
Redistributable license
Tagged version
Stable version
Learn more about best practices

Repository

github.com/gmohmad/go-YouTokenToMe

Links

Open Source Insights

README ¶

go-YouTokenToMe

go-YouTokenToMe is a Go port of YoutTokenToMe - a computationally efficient implementation of Byte Pair Encoding [Sennrich et al.]. Only inference is supported, no training.

Usage example

file, err := os.Open("data/yttm.model")
if err != nil {
    fmt.Println(err)
    return
}
defer file.Close()

r := io.Reader(file)

m, err := bpe.ReadModel(r)
if err != nil {
    panic(err)
}
config := bpe.NewEncodingConfig(false, false, false)
fmt.Println(m.EncodeSentence("мама мыла раму", *config))

Documentation ¶

Index ¶

func DecodeToken(token EncodedString, id2char map[TokenID]rune) (string, error)
type EncodedString
type EncodingConfig
- func NewEncodingConfig(bos, eos, reverse bool) *EncodingConfig
type Model
- func ReadModel(reader io.Reader) (*Model, error)
type TokenID
type TokenIDPair

Constants ¶

This section is empty.

Variables ¶

This section is empty.

Functions ¶

func DecodeToken ¶

func DecodeToken(token EncodedString, id2char map[TokenID]rune) (string, error)

DecodeToken converts the sequence of chars' ids into the string - sequence of the corresponding chars

Types ¶

type EncodedString ¶

type EncodedString []TokenID

EncodedString is a sequence of subword token identifiers

type EncodingConfig ¶

type EncodingConfig struct {
	// contains filtered or unexported fields
}

EncodingConfig is a configuration for encoding of strings

func NewEncodingConfig ¶

func NewEncodingConfig(bos, eos, reverse bool) *EncodingConfig

type Model ¶

type Model struct {
	// contains filtered or unexported fields
}

Model is a Byte-Pair encoding model, which supports encoding and decoding text into sequences of most frequent subword tokens

func ReadModel ¶

func ReadModel(reader io.Reader) (*Model, error)

ReadModel loads the BPE model from the binary dump

func (Model) DecodeFromStream ¶

func (m Model) DecodeFromStream(reader io.Reader) ([]string, error)

DecodeFromStream decodes a sequence of encoded sentences written in an input stream using Model.DecodeSentences

func (Model) DecodeSentence ¶

func (m Model) DecodeSentence(encodedSentence EncodedString) (string, error)

DecodeSentence decodes a sequence of token ids in a text sentence - string of words with spaces in between

func (Model) DecodeSentences ¶

func (m Model) DecodeSentences(encodedSentences []EncodedString) ([]string, error)

DecodeSentences decodes a sequence of encoded sentences - sequences of token ids - into a sequence of corresponding text sentences

func (Model) EncodeSentence ¶

func (m Model) EncodeSentence(sentence string, encodingConfig EncodingConfig,
) (EncodedString, error)

EncodeSentence takes a string of space-separated words and tokenizes each word according to the BPE rules. Through encodingConfig one can state whether to add BOS, EOS tokens and whether to reverse the output sequences. EncodeSentence returns the numerical encoding of the sentence.

func (Model) EncodeSentences ¶

func (m Model) EncodeSentences(sentences []string, encodingConfig EncodingConfig) ([]EncodedString,
	error)

EncodeSentences takes a sequence of strings which consist of space-separated words and tokenizes each word according to the BPE rules. Through encodingConfig one can state whether to add BOS and EOS tokens (beginning and end of sentence) and whether to reverse the output sequences. EncodeSentences returns the numerical encodings of the sentences.

func (Model) EncodeStream ¶

func (m Model) EncodeStream(reader io.Reader, encodingConfig EncodingConfig) ([]EncodedString,
	error)

EncodeStream reads a sequence of strings which consist of space-separated words from the given stream and tokenizes each word according to the BPE rules. Through encodingConfig one can state whether to add BOS and EOS tokens (beginning and end of sentence) and whether to reverse the output sequences. EncodeStream returns the numerical encodings of the sentences.

func (Model) IDToToken ¶

func (m Model) IDToToken(id TokenID, replaceSpace bool) (string, error)

IDToToken returns string token corresponding to the given token id. If replaceSpace is true, special space token that is used for marking starts of words will be replaced with space.

type TokenID ¶

type TokenID uint32

TokenID is a numerical identifier of the subword token

type TokenIDPair ¶

type TokenIDPair uint64

TokenIDPair is a concatenation of two TokenIDs that is used as the key type in rule2id map.

Source Files ¶

View all Source files

bpe.go

?	: This menu
/	: Search site
f or F	: Jump to
y or Y	: Canonical URL