Documentation ¶
Overview ¶
Package basetokenizer provides an implementations of a very simple tokenizer that splits by white-spaces (and alike) and punctuation symbols. Please note that abbreviations, real numbers, apostrophes and other expressions are tokenized without any linguistic criteria. It makes disasters on URLs, emails, etc.
Index ¶
Constants ¶
This section is empty.
Variables ¶
This section is empty.
Functions ¶
This section is empty.
Types ¶
type BaseTokenizer ¶
type BaseTokenizer struct {
// contains filtered or unexported fields
}
BaseTokenizer is a straightforward tokenizer implementations, which splits by whitespace and punctuation characters.
func (*BaseTokenizer) Tokenize ¶
func (t *BaseTokenizer) Tokenize(text string) []tokenizers.StringOffsetsPair
Tokenize converts the input text to a slice of tokens, where each token is a white-separated word, a number or a punctuation sign. The resulting tokens preserve the alignment with the portion of the original text they belong to.
type Option ¶
type Option func(*BaseTokenizer)
Option allows to configure a new BaseTokenizer with your specific needs.
func RegisterSpecialWords ¶
RegisterSpecialWords is an option to register a special word.