Documentation ¶
Index ¶
- func BertBaseUncased() *tokenizer.Tokenizer
- func BertLargeCasedWholeWordMaskingSquad() *tokenizer.Tokenizer
- func CreateAddedTokens(data []tokenizer.TokenConfig) (specialToks, toks []tokenizer.AddedToken)
- func CreateDecoder(config map[string]interface{}) (tokenizer.Decoder, error)
- func CreateModel(config *tokenizer.Config) (tokenizer.Model, error)
- func CreateNormalizer(config map[string]interface{}) (normalizer.Normalizer, error)
- func CreatePaddingParams(config map[string]interface{}) (*tokenizer.PaddingParams, error)
- func CreatePostProcessor(config map[string]interface{}) (tokenizer.PostProcessor, error)
- func CreatePreTokenizer(config map[string]interface{}) (tokenizer.PreTokenizer, error)
- func CreateTruncationParams(config map[string]interface{}) (*tokenizer.TruncationParams, error)
- func FromFile(file string) (*tokenizer.Tokenizer, error)
- func FromReader(r io.Reader) (*tokenizer.Tokenizer, error)
- func GPT2(addPrefixSpace bool, trimOffsets bool) *tokenizer.Tokenizer
- func RobertaBase(addPrefixSpace, trimOffsets bool) *tokenizer.Tokenizer
- func RobertaBaseSquad2(addPrefixSpace, trimOffsets bool) *tokenizer.Tokenizer
Constants ¶
This section is empty.
Variables ¶
This section is empty.
Functions ¶
func BertBaseUncased ¶
BertBaseUncase loads pretrained BERT tokenizer.
Special tokens: - unknown token: "[UNK]" - sep token: "[SEP]" - cls token: "[CLS]" - mask token: "[MASK]" Its normalizer configued with: clean text, lower-case, handle Chinese characters and strip accents.
Source: "https://cdn.huggingface.co/bert-base-uncased-vocab.txt"
func BertLargeCasedWholeWordMaskingSquad ¶
BertLargeCasedWholeWordMaskingSquad loads pretrained BERT large case whole-word masking tokenizer finetuned on SQuAD dataset.
Source: https://cdn.huggingface.co/bert-large-cased-whole-word-masking-finetuned-squad-vocab.txt
func CreateAddedTokens ¶
func CreateAddedTokens(data []tokenizer.TokenConfig) (specialToks, toks []tokenizer.AddedToken)
func CreateNormalizer ¶
func CreateNormalizer(config map[string]interface{}) (normalizer.Normalizer, error)
CreateNormalizer creates Normalizer from config data.
func CreatePaddingParams ¶
func CreatePaddingParams(config map[string]interface{}) (*tokenizer.PaddingParams, error)
func CreatePostProcessor ¶
func CreatePostProcessor(config map[string]interface{}) (tokenizer.PostProcessor, error)
func CreatePreTokenizer ¶
func CreatePreTokenizer(config map[string]interface{}) (tokenizer.PreTokenizer, error)
func CreateTruncationParams ¶
func CreateTruncationParams(config map[string]interface{}) (*tokenizer.TruncationParams, error)
func FromReader ¶
FromReader constructs a new Tokenizer from json data reader.
func GPT2 ¶
GPT2 loads GPT2 (small) tokenizer from vocab and merges files.
Params:
- addPrefixSpace: set whether to add a leading space to the first word. It allows to treat the leading word just as any other words.
- trimOffsets: set Whether the post processing step should trim offsets to avoid including whitespaces.
Special tokens: - cls-token: "<s>" - sep token: "</s>" - pad token: "<pad>" - space token: "Ġ"
Source: "https://cdn.huggingface.co/gpt2-merges.txt" "https://cdn.huggingface.co/gpt2-vocab.json"
func RobertaBase ¶
RobertaBase loads pretrained RoBERTa tokenizer.
Params:
- addPrefixSpace: set whether to add a leading space to the first word. It allows to treat the leading word just as any other words.
- trimOffsets: set Whether the post processing step should trim offsets to avoid including whitespaces.
Special tokens: - cls-token: "<s>" - sep token: "</s>" - pad token: "<pad>" - space token: "Ġ"
Source: - vocab: "https://cdn.huggingface.co/roberta-base-vocab.json", - merges: "https://cdn.huggingface.co/roberta-base-merges.txt",
func RobertaBaseSquad2 ¶
RobertaBaseSquad2 loads pretrained RoBERTa fine-tuned SQuAD Question Answering tokenizer.
Params:
- addPrefixSpace: set whether to add a leading space to the first word. It allows to treat the leading word just as any other words.
- trimOffsets: set Whether the post processing step should trim offsets to avoid including whitespaces.
Special tokens: - cls-token: "<s>" - sep token: "</s>" - pad token: "<pad>" - space token: "Ġ"
Source: - vocab: "https://cdn.huggingface.co/deepset/roberta-base-squad2/vocab.json", - merges: "https://cdn.huggingface.co/deepset/roberta-base-squad2/merges.txt",
Types ¶
This section is empty.