Documentation ¶
Overview ¶
Package tokenize implements functions to split strings into slices of substrings.
Index ¶
Examples ¶
Constants ¶
This section is empty.
Variables ¶
This section is empty.
Functions ¶
func TextToWords ¶
TextToWords converts the string text into a slice of words.
It does so by tokenizing text into sentences (using a port of NLTK's punkt tokenizer; see https://github.com/neurosnap/sentences) and then tokenizing the sentences into words via TreebankWordTokenizer.
Types ¶
type PragmaticSegmenter ¶
type PragmaticSegmenter struct {
// contains filtered or unexported fields
}
PragmaticSegmenter is a multilingual, rule-based sentence boundary detector.
This is a port of the Ruby library by Kevin S. Dias (https://github.com/diasks2/pragmatic_segmenter).
func NewPragmaticSegmenter ¶
func NewPragmaticSegmenter(lang string) (*PragmaticSegmenter, error)
NewPragmaticSegmenter creates a new PragmaticSegmenter according to the specified language. If the given language is not supported, an error will be returned.
Languages are specified by their two-character ISO 639-1 code. The supported languages are "en" (English), "es" (Spanish), "fr" (French) ... (WIP)
func (*PragmaticSegmenter) Tokenize ¶
func (p *PragmaticSegmenter) Tokenize(text string) []string
Tokenize splits text into sentences.
type ProseTokenizer ¶
ProseTokenizer is the interface implemented by an object that takes a string and returns a slice of substrings.
type PunktSentenceTokenizer ¶
type PunktSentenceTokenizer struct {
// contains filtered or unexported fields
}
PunktSentenceTokenizer is an extension of the Go implementation of the Punkt sentence tokenizer (https://github.com/neurosnap/sentences), with a few minor improvements (see https://github.com/neurosnap/sentences/pull/18).
func NewPunktSentenceTokenizer ¶
func NewPunktSentenceTokenizer() *PunktSentenceTokenizer
NewPunktSentenceTokenizer creates a new PunktSentenceTokenizer and loads its English model.
func (PunktSentenceTokenizer) Tokenize ¶
func (p PunktSentenceTokenizer) Tokenize(text string) []string
Tokenize splits text into sentences.
type RegexpTokenizer ¶
type RegexpTokenizer struct {
// contains filtered or unexported fields
}
RegexpTokenizer splits a string into substrings using a regular expression.
func NewBlanklineTokenizer ¶
func NewBlanklineTokenizer() *RegexpTokenizer
NewBlanklineTokenizer is a RegexpTokenizer constructor.
This tokenizer splits on any sequence of blank lines.
Example ¶
t := NewBlanklineTokenizer() fmt.Println(t.Tokenize("They'll save and invest more.\n\nThanks!"))
Output: [They'll save and invest more. Thanks!]
func NewRegexpTokenizer ¶
func NewRegexpTokenizer(pattern string, gaps, discard bool) *RegexpTokenizer
NewRegexpTokenizer is a RegexpTokenizer constructor that takes three arguments: a pattern to base the tokenizer on, a boolean value indicating whether or not to look for separators between tokens, and boolean value indicating whether or not to discard empty tokens.
func NewWordBoundaryTokenizer ¶
func NewWordBoundaryTokenizer() *RegexpTokenizer
NewWordBoundaryTokenizer is a RegexpTokenizer constructor.
This tokenizer splits text into a sequence of word-like tokens.
Example ¶
t := NewWordBoundaryTokenizer() fmt.Println(t.Tokenize("They'll save and invest more."))
Output: [They'll save and invest more]
func NewWordPunctTokenizer ¶
func NewWordPunctTokenizer() *RegexpTokenizer
NewWordPunctTokenizer is a RegexpTokenizer constructor.
This tokenizer splits text into a sequence of alphabetic and non-alphabetic characters.
Example ¶
t := NewWordPunctTokenizer() fmt.Println(t.Tokenize("They'll save and invest more."))
Output: [They ' ll save and invest more .]
func (RegexpTokenizer) Tokenize ¶
func (r RegexpTokenizer) Tokenize(text string) []string
Tokenize splits text into a slice of tokens according to its regexp pattern.
type TreebankWordTokenizer ¶
type TreebankWordTokenizer struct { }
TreebankWordTokenizer splits a sentence into words.
This implementation is a port of the Sed script written by Robert McIntyre, which is available at https://gist.github.com/jdkato/fc8b8c4266dba22d45ac85042ae53b1e.
func NewTreebankWordTokenizer ¶
func NewTreebankWordTokenizer() *TreebankWordTokenizer
NewTreebankWordTokenizer is a TreebankWordTokenizer constructor.
Example ¶
t := NewTreebankWordTokenizer() fmt.Println(t.Tokenize("They'll save and invest more."))
Output: [They 'll save and invest more .]
func (TreebankWordTokenizer) Tokenize ¶
func (t TreebankWordTokenizer) Tokenize(text string) []string
Tokenize splits a sentence into a slice of words.
This tokenizer performs the following steps: (1) split on contractions (e.g., "don't" -> [do n't]), (2) split on non-terminating punctuation, (3) split on single quotes when followed by whitespace, and (4) split on periods that appear at the end of lines.
NOTE: As mentioned above, this function expects a sentence (not raw text) as input.