bpe

package module
v0.0.0-...-ec25de7 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Dec 9, 2024 License: Apache-2.0 Imports: 8 Imported by: 0

README

go-YouTokenToMe

go-YouTokenToMe is a Go port of YoutTokenToMe - a computationally efficient implementation of Byte Pair Encoding [Sennrich et al.]. Only inference is supported, no training.

Usage example

file, err := os.Open("data/yttm.model")
if err != nil {
    fmt.Println(err)
    return
}
defer file.Close()

r := io.Reader(file)

m, err := bpe.ReadModel(r)
if err != nil {
    panic(err)
}
config := bpe.NewEncodingConfig(false, false, false)
fmt.Println(m.EncodeSentence("мама мыла раму", *config))

Documentation

Index

Constants

This section is empty.

Variables

This section is empty.

Functions

func DecodeToken

func DecodeToken(token EncodedString, id2char map[TokenID]rune) (string, error)

DecodeToken converts the sequence of chars' ids into the string - sequence of the corresponding chars

Types

type EncodedString

type EncodedString []TokenID

EncodedString is a sequence of subword token identifiers

type EncodingConfig

type EncodingConfig struct {
	// contains filtered or unexported fields
}

EncodingConfig is a configuration for encoding of strings

func NewEncodingConfig

func NewEncodingConfig(bos, eos, reverse bool) *EncodingConfig

type Model

type Model struct {
	// contains filtered or unexported fields
}

Model is a Byte-Pair encoding model, which supports encoding and decoding text into sequences of most frequent subword tokens

func ReadModel

func ReadModel(reader io.Reader) (*Model, error)

ReadModel loads the BPE model from the binary dump

func (Model) DecodeFromStream

func (m Model) DecodeFromStream(reader io.Reader) ([]string, error)

DecodeFromStream decodes a sequence of encoded sentences written in an input stream using Model.DecodeSentences

func (Model) DecodeSentence

func (m Model) DecodeSentence(encodedSentence EncodedString) (string, error)

DecodeSentence decodes a sequence of token ids in a text sentence - string of words with spaces in between

func (Model) DecodeSentences

func (m Model) DecodeSentences(encodedSentences []EncodedString) ([]string, error)

DecodeSentences decodes a sequence of encoded sentences - sequences of token ids - into a sequence of corresponding text sentences

func (Model) EncodeSentence

func (m Model) EncodeSentence(sentence string, encodingConfig EncodingConfig,
) (EncodedString, error)

EncodeSentence takes a string of space-separated words and tokenizes each word according to the BPE rules. Through encodingConfig one can state whether to add BOS, EOS tokens and whether to reverse the output sequences. EncodeSentence returns the numerical encoding of the sentence.

func (Model) EncodeSentences

func (m Model) EncodeSentences(sentences []string, encodingConfig EncodingConfig) ([]EncodedString,
	error)

EncodeSentences takes a sequence of strings which consist of space-separated words and tokenizes each word according to the BPE rules. Through encodingConfig one can state whether to add BOS and EOS tokens (beginning and end of sentence) and whether to reverse the output sequences. EncodeSentences returns the numerical encodings of the sentences.

func (Model) EncodeStream

func (m Model) EncodeStream(reader io.Reader, encodingConfig EncodingConfig) ([]EncodedString,
	error)

EncodeStream reads a sequence of strings which consist of space-separated words from the given stream and tokenizes each word according to the BPE rules. Through encodingConfig one can state whether to add BOS and EOS tokens (beginning and end of sentence) and whether to reverse the output sequences. EncodeStream returns the numerical encodings of the sentences.

func (Model) IDToToken

func (m Model) IDToToken(id TokenID, replaceSpace bool) (string, error)

IDToToken returns string token corresponding to the given token id. If replaceSpace is true, special space token that is used for marking starts of words will be replaced with space.

type TokenID

type TokenID uint32

TokenID is a numerical identifier of the subword token

type TokenIDPair

type TokenIDPair uint64

TokenIDPair is a concatenation of two TokenIDs that is used as the key type in rule2id map.

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL