Documentation ¶
Index ¶
- type NormalizedByteSplit
- type OriginalByteSplit
- type PreTokenizedString
- func (p *PreTokenizedString) GetNormalizedByteSplits() []NormalizedByteSplit
- func (p *PreTokenizedString) GetOriginalByteSplits() []OriginalByteSplit
- func (p *PreTokenizedString) IntoEncoding(wordIndex int, typeID int) (*encodings.Encoding, error)
- func (p *PreTokenizedString) Normalize(normalize func(ns *normalizedstring.NormalizedString) error) error
- func (p *PreTokenizedString) Split(splitFunc SplitFunc) error
- func (p *PreTokenizedString) Splits() []Split
- func (p *PreTokenizedString) Tokenize(tokenize func(ns *normalizedstring.NormalizedString) ([]models.Token, error)) error
- type Split
- type SplitFunc
Constants ¶
This section is empty.
Variables ¶
This section is empty.
Functions ¶
This section is empty.
Types ¶
type NormalizedByteSplit ¶
type NormalizedByteSplit struct { // A slice of the normalized string String string // The associated bytes offsets, in the normalized referential Offsets strutils.ByteOffsets // The potential tokens Tokens *[]models.Token }
type OriginalByteSplit ¶
type OriginalByteSplit struct { // A slice of the normalized string String string // The associated bytes offsets, in the original referential Offsets strutils.ByteOffsets // The potential tokens Tokens *[]models.Token }
type PreTokenizedString ¶
type PreTokenizedString struct {
// contains filtered or unexported fields
}
PreTokenizedString is in charge of splitting an underlying string, making sure everything is fine while doing so, and providing ways to normalize and tokenize these splits.
Once everything has been normalized and tokenized, the PreTokenizedString is able to build an Encoding with all the relevant offsets and word ids, relative to the original string.
func FromNormalizedString ¶
func FromNormalizedString(ns *normalizedstring.NormalizedString) *PreTokenizedString
func FromString ¶
func FromString(s string) *PreTokenizedString
func (*PreTokenizedString) GetNormalizedByteSplits ¶
func (p *PreTokenizedString) GetNormalizedByteSplits() []NormalizedByteSplit
GetNormalizedByteSplits returns a list of NormalizedByteSplit.
func (*PreTokenizedString) GetOriginalByteSplits ¶
func (p *PreTokenizedString) GetOriginalByteSplits() []OriginalByteSplit
GetOriginalByteSplits returns a list of OriginalByteSplit.
func (*PreTokenizedString) IntoEncoding ¶ added in v0.2.0
IntoEncoding transforms the current PreTokenizedString into an encodings.Encoding.
If a wordIndex is provided (i.e. >= 0), any word in the generated Encoding will be set to this value. This is generally used with pre-tokenized input, that does not need the PreTokenizedString to generate word ids.
This method will fail if some splits do not have associated Token.
Offset indices are based on bytes (not runes).
func (*PreTokenizedString) Normalize ¶
func (p *PreTokenizedString) Normalize( normalize func(ns *normalizedstring.NormalizedString) error, ) error
Normalize normalizes all the splits that do not have attached Split.Tokens, using the provided normalization function.
func (*PreTokenizedString) Split ¶
func (p *PreTokenizedString) Split(splitFunc SplitFunc) error
Split splits the PreTokenizedString by providing a SplitFunc in charge of splitting each substring (normalizedstring.NormalizedString) into multiple parts.
func (*PreTokenizedString) Splits ¶
func (p *PreTokenizedString) Splits() []Split
func (*PreTokenizedString) Tokenize ¶
func (p *PreTokenizedString) Tokenize( tokenize func(ns *normalizedstring.NormalizedString) ([]models.Token, error), ) error
Tokenize tokenizes all the splits that do not have attached Split.Tokens, using the provided tokenization function.
type Split ¶
type Split struct { // The underlying normalizedstring.NormalizedString. // Each SubString is represented by a normalizedstring.NormalizedString // and in the end we might be carrying a lot of SubString representing // various parts of the original input string. NormalizedString *normalizedstring.NormalizedString // Optional Tokens associated to this Split. Tokens *[]models.Token }
Split is a wrapper for a subpart of a NormalizedString.
This Split contains the underlying NormalizedString as well as its offsets in the original string. These offsets are in the "original" referential. It also contains any Token associated to the current split.
func SplitsFromNormalizedStrings ¶
func SplitsFromNormalizedStrings(nss []*normalizedstring.NormalizedString) []Split
SplitsFromNormalizedStrings transforms a slice of NormalizedStrings into a corresponding slice of Splits, with nil tokens.
type SplitFunc ¶
type SplitFunc func( index int, ns *normalizedstring.NormalizedString, ) ([]Split, error)
SplitFunc (used by PreTokenizedString.Split) takes a normalizedstring.NormalizedString and is in charge of returning an iterator over the produced normalizedstring.NormalizedString.
SplitFunc is free of modifying these NormalizedString as relevant, as long as it respects the constraint stated below.
There is only one constraint that MUST be respected: the produced normalizedstring.NormalizedString, if combined back together, must have the same "original" string as the original one given to SplitFunc. This concretely means that, for the offset tracking to work as expected, SplitFunc must produce "splits" of the original string.