Documentation ¶
Index ¶
- func NewIterTokenizer(opts ...TokenizerOptFunc) *iterTokenizer
- type RegexpTokenizer
- type Token
- type TokenTester
- type Tokenizer
- type TokenizerOptFunc
- func UsingContractions(x []string) TokenizerOptFunc
- func UsingEmoticons(x map[string]int) TokenizerOptFunc
- func UsingIsUnsplittable(x TokenTester) TokenizerOptFunc
- func UsingPrefixes(x []string) TokenizerOptFunc
- func UsingSanitizer(x *strings.Replacer) TokenizerOptFunc
- func UsingSpecialRE(x *regexp.Regexp) TokenizerOptFunc
- func UsingSplitCases(x []string) TokenizerOptFunc
- func UsingSuffixes(x []string) TokenizerOptFunc
- func WithoutSuffix() TokenizerOptFunc
- type TreebankWordTokenizer
Constants ¶
This section is empty.
Variables ¶
This section is empty.
Functions ¶
func NewIterTokenizer ¶
func NewIterTokenizer(opts ...TokenizerOptFunc) *iterTokenizer
Constructor for default iterTokenizer
Types ¶
type RegexpTokenizer ¶
type RegexpTokenizer struct {
// contains filtered or unexported fields
}
RegexpTokenizer splits a string into substrings using a regular expression.
func NewBlanklineTokenizer ¶
func NewBlanklineTokenizer() *RegexpTokenizer
NewBlanklineTokenizer is a RegexpTokenizer constructor.
This tokenizer splits on any sequence of blank lines.
func NewRegexpTokenizer ¶
func NewRegexpTokenizer(pattern string, gaps, discard bool) *RegexpTokenizer
NewRegexpTokenizer is a RegexpTokenizer constructor that takes three arguments: a pattern to base the tokenizer on, a boolean value indicating whether or not to look for separators between tokens, and boolean value indicating whether or not to discard empty tokens.
func NewWordBoundaryTokenizer ¶
func NewWordBoundaryTokenizer() *RegexpTokenizer
NewWordBoundaryTokenizer is a RegexpTokenizer constructor.
This tokenizer splits text into a sequence of word-like tokens.
func NewWordPunctTokenizer ¶
func NewWordPunctTokenizer() *RegexpTokenizer
NewWordPunctTokenizer is a RegexpTokenizer constructor.
This tokenizer splits text into a sequence of alphabetic and non-alphabetic characters.
func (RegexpTokenizer) Tokenize ¶
func (r RegexpTokenizer) Tokenize(text string) []string
Tokenize splits text into a slice of tokens according to its regexp pattern.
type Token ¶
type Token struct {
Text string // The token's actual content.
}
A Token represents an individual token of text such as a word or punctuation symbol.
type TokenTester ¶
type TokenizerOptFunc ¶
type TokenizerOptFunc func(*iterTokenizer)
func UsingContractions ¶
func UsingContractions(x []string) TokenizerOptFunc
Use the provided contractions.
func UsingEmoticons ¶
func UsingEmoticons(x map[string]int) TokenizerOptFunc
Use the provided map of emoticons.
func UsingIsUnsplittable ¶
func UsingIsUnsplittable(x TokenTester) TokenizerOptFunc
UsingIsUnsplittableFN gives a function that tests whether a token is splittable or not.
func UsingSanitizer ¶
func UsingSanitizer(x *strings.Replacer) TokenizerOptFunc
Use the provided sanitizer.
func UsingSpecialRE ¶
func UsingSpecialRE(x *regexp.Regexp) TokenizerOptFunc
Use the provided special regex for unsplittable tokens.
func UsingSplitCases ¶
func UsingSplitCases(x []string) TokenizerOptFunc
Use the provided splitCases.
func WithoutSuffix ¶ added in v0.4.0
func WithoutSuffix() TokenizerOptFunc
type TreebankWordTokenizer ¶ added in v0.3.0
type TreebankWordTokenizer struct { }
TreebankWordTokenizer splits a sentence into words.
This implementation is a port of the Sed script written by Robert McIntyre, which is available at https://gist.github.com/jdkato/fc8b8c4266dba22d45ac85042ae53b1e.
func NewTreebankWordTokenizer ¶ added in v0.3.0
func NewTreebankWordTokenizer() *TreebankWordTokenizer
NewTreebankWordTokenizer is a TreebankWordTokenizer constructor.
func (TreebankWordTokenizer) Tokenize ¶ added in v0.3.0
func (t TreebankWordTokenizer) Tokenize(text string) []string
Tokenize splits a sentence into a slice of words.
This tokenizer performs the following steps: (1) split on contractions (e.g., "don't" -> [do n't]), (2) split on non-terminating punctuation, (3) split on single quotes when followed by whitespace, and (4) split on periods that appear at the end of lines.
NOTE: As mentioned above, this function expects a sentence (not raw text) as input.