tokenize

package
v0.8.0 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Nov 7, 2023 License: MIT Imports: 5 Imported by: 0

Documentation

Index

Constants

This section is empty.

Variables

This section is empty.

Functions

func NewIterTokenizer

func NewIterTokenizer(opts ...TokenizerOptFunc) *iterTokenizer

Constructor for default iterTokenizer

Types

type RegexpTokenizer

type RegexpTokenizer struct {
	// contains filtered or unexported fields
}

RegexpTokenizer splits a string into substrings using a regular expression.

func NewBlanklineTokenizer

func NewBlanklineTokenizer() *RegexpTokenizer

NewBlanklineTokenizer is a RegexpTokenizer constructor.

This tokenizer splits on any sequence of blank lines.

func NewRegexpTokenizer

func NewRegexpTokenizer(pattern string, gaps, discard bool) *RegexpTokenizer

NewRegexpTokenizer is a RegexpTokenizer constructor that takes three arguments: a pattern to base the tokenizer on, a boolean value indicating whether or not to look for separators between tokens, and boolean value indicating whether or not to discard empty tokens.

func NewWordBoundaryTokenizer

func NewWordBoundaryTokenizer() *RegexpTokenizer

NewWordBoundaryTokenizer is a RegexpTokenizer constructor.

This tokenizer splits text into a sequence of word-like tokens.

func NewWordPunctTokenizer

func NewWordPunctTokenizer() *RegexpTokenizer

NewWordPunctTokenizer is a RegexpTokenizer constructor.

This tokenizer splits text into a sequence of alphabetic and non-alphabetic characters.

func (RegexpTokenizer) Tokenize

func (r RegexpTokenizer) Tokenize(text string) []string

Tokenize splits text into a slice of tokens according to its regexp pattern.

type Token

type Token struct {
	Text string // The token's actual content.
}

A Token represents an individual token of text such as a word or punctuation symbol.

type TokenTester

type TokenTester func(string) bool

type Tokenizer

type Tokenizer interface {
	Tokenize(string) []*Token
}

type TokenizerOptFunc

type TokenizerOptFunc func(*iterTokenizer)

func UsingContractions

func UsingContractions(x []string) TokenizerOptFunc

Use the provided contractions.

func UsingEmoticons

func UsingEmoticons(x map[string]int) TokenizerOptFunc

Use the provided map of emoticons.

func UsingIsUnsplittable

func UsingIsUnsplittable(x TokenTester) TokenizerOptFunc

UsingIsUnsplittableFN gives a function that tests whether a token is splittable or not.

func UsingPrefixes

func UsingPrefixes(x []string) TokenizerOptFunc

Use the provided prefixes.

func UsingSanitizer

func UsingSanitizer(x *strings.Replacer) TokenizerOptFunc

Use the provided sanitizer.

func UsingSpecialRE

func UsingSpecialRE(x *regexp.Regexp) TokenizerOptFunc

Use the provided special regex for unsplittable tokens.

func UsingSplitCases

func UsingSplitCases(x []string) TokenizerOptFunc

Use the provided splitCases.

func UsingSuffixes

func UsingSuffixes(x []string) TokenizerOptFunc

Use the provided suffixes.

func WithoutSuffix added in v0.4.0

func WithoutSuffix() TokenizerOptFunc

type TreebankWordTokenizer added in v0.3.0

type TreebankWordTokenizer struct {
}

TreebankWordTokenizer splits a sentence into words.

This implementation is a port of the Sed script written by Robert McIntyre, which is available at https://gist.github.com/jdkato/fc8b8c4266dba22d45ac85042ae53b1e.

func NewTreebankWordTokenizer added in v0.3.0

func NewTreebankWordTokenizer() *TreebankWordTokenizer

NewTreebankWordTokenizer is a TreebankWordTokenizer constructor.

func (TreebankWordTokenizer) Tokenize added in v0.3.0

func (t TreebankWordTokenizer) Tokenize(text string) []string

Tokenize splits a sentence into a slice of words.

This tokenizer performs the following steps: (1) split on contractions (e.g., "don't" -> [do n't]), (2) split on non-terminating punctuation, (3) split on single quotes when followed by whitespace, and (4) split on periods that appear at the end of lines.

NOTE: As mentioned above, this function expects a sentence (not raw text) as input.

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL