Documentation
¶
Index ¶
- func DecodeToken(token EncodedString, id2char map[TokenID]rune) (string, error)
- type EncodedString
- type EncodingConfig
- type Model
- func (m Model) DecodeFromStream(reader io.Reader) ([]string, error)
- func (m Model) DecodeSentence(encodedSentence EncodedString) (string, error)
- func (m Model) DecodeSentences(encodedSentences []EncodedString) ([]string, error)
- func (m Model) EncodeSentence(sentence string, encodingConfig EncodingConfig) (EncodedString, error)
- func (m Model) EncodeSentences(sentences []string, encodingConfig EncodingConfig) ([]EncodedString, error)
- func (m Model) EncodeStream(reader io.Reader, encodingConfig EncodingConfig) ([]EncodedString, error)
- func (m Model) IDToToken(id TokenID, replaceSpace bool) (string, error)
- type TokenID
- type TokenIDPair
Constants ¶
This section is empty.
Variables ¶
This section is empty.
Functions ¶
func DecodeToken ¶
func DecodeToken(token EncodedString, id2char map[TokenID]rune) (string, error)
DecodeToken converts the sequence of chars' ids into the string - sequence of the corresponding chars
Types ¶
type EncodedString ¶
type EncodedString []TokenID
EncodedString is a sequence of subword token identifiers
type EncodingConfig ¶
type EncodingConfig struct {
// contains filtered or unexported fields
}
EncodingConfig is a configuration for encoding of strings
func NewEncodingConfig ¶
func NewEncodingConfig(bos, eos, reverse bool) *EncodingConfig
type Model ¶
type Model struct {
// contains filtered or unexported fields
}
Model is a Byte-Pair encoding model, which supports encoding and decoding text into sequences of most frequent subword tokens
func (Model) DecodeFromStream ¶
DecodeFromStream decodes a sequence of encoded sentences written in an input stream using Model.DecodeSentences
func (Model) DecodeSentence ¶
func (m Model) DecodeSentence(encodedSentence EncodedString) (string, error)
DecodeSentence decodes a sequence of token ids in a text sentence - string of words with spaces in between
func (Model) DecodeSentences ¶
func (m Model) DecodeSentences(encodedSentences []EncodedString) ([]string, error)
DecodeSentences decodes a sequence of encoded sentences - sequences of token ids - into a sequence of corresponding text sentences
func (Model) EncodeSentence ¶
func (m Model) EncodeSentence(sentence string, encodingConfig EncodingConfig, ) (EncodedString, error)
EncodeSentence takes a string of space-separated words and tokenizes each word according to the BPE rules. Through encodingConfig one can state whether to add BOS, EOS tokens and whether to reverse the output sequences. EncodeSentence returns the numerical encoding of the sentence.
func (Model) EncodeSentences ¶
func (m Model) EncodeSentences(sentences []string, encodingConfig EncodingConfig) ([]EncodedString, error)
EncodeSentences takes a sequence of strings which consist of space-separated words and tokenizes each word according to the BPE rules. Through encodingConfig one can state whether to add BOS and EOS tokens (beginning and end of sentence) and whether to reverse the output sequences. EncodeSentences returns the numerical encodings of the sentences.
func (Model) EncodeStream ¶
func (m Model) EncodeStream(reader io.Reader, encodingConfig EncodingConfig) ([]EncodedString, error)
EncodeStream reads a sequence of strings which consist of space-separated words from the given stream and tokenizes each word according to the BPE rules. Through encodingConfig one can state whether to add BOS and EOS tokens (beginning and end of sentence) and whether to reverse the output sequences. EncodeStream returns the numerical encodings of the sentences.
type TokenIDPair ¶
type TokenIDPair uint64
TokenIDPair is a concatenation of two TokenIDs that is used as the key type in rule2id map.