tokenize

package

v0.8.0 Latest Latest Go to latest Published: Nov 7, 2023 License: MIT Imports: 5 Imported by: 0

Details

Valid go.mod file
Redistributable license
Tagged version
Stable version
Learn more about best practices

Repository

github.com/jdkato/twine

Links

Open Source Insights

Documentation ¶

Index ¶

func NewIterTokenizer(opts ...TokenizerOptFunc) *iterTokenizer
type RegexpTokenizer
- func (r RegexpTokenizer) Tokenize(text string) []string
type Token
type TokenTester
type Tokenizer
type TokenizerOptFunc
type TreebankWordTokenizer
- func NewTreebankWordTokenizer() *TreebankWordTokenizer
- func (t TreebankWordTokenizer) Tokenize(text string) []string

Constants ¶

This section is empty.

Variables ¶

This section is empty.

Functions ¶

Types ¶

type RegexpTokenizer ¶

type RegexpTokenizer struct {
	// contains filtered or unexported fields
}

RegexpTokenizer splits a string into substrings using a regular expression.

func NewBlanklineTokenizer ¶

func NewBlanklineTokenizer() *RegexpTokenizer

NewBlanklineTokenizer is a RegexpTokenizer constructor.

This tokenizer splits on any sequence of blank lines.

func NewRegexpTokenizer ¶

func NewRegexpTokenizer(pattern string, gaps, discard bool) *RegexpTokenizer

NewRegexpTokenizer is a RegexpTokenizer constructor that takes three arguments: a pattern to base the tokenizer on, a boolean value indicating whether or not to look for separators between tokens, and boolean value indicating whether or not to discard empty tokens.

func NewWordBoundaryTokenizer ¶

func NewWordBoundaryTokenizer() *RegexpTokenizer

NewWordBoundaryTokenizer is a RegexpTokenizer constructor.

This tokenizer splits text into a sequence of word-like tokens.

func NewWordPunctTokenizer ¶

func NewWordPunctTokenizer() *RegexpTokenizer

NewWordPunctTokenizer is a RegexpTokenizer constructor.

This tokenizer splits text into a sequence of alphabetic and non-alphabetic characters.

func (RegexpTokenizer) Tokenize ¶

func (r RegexpTokenizer) Tokenize(text string) []string

Tokenize splits text into a slice of tokens according to its regexp pattern.

type Token ¶

type Token struct {
	Text string // The token's actual content.
}

A Token represents an individual token of text such as a word or punctuation symbol.

type TokenTester ¶

type TokenTester func(string) bool

type Tokenizer ¶

type Tokenizer interface {
	Tokenize(string) []*Token
}

type TokenizerOptFunc ¶

type TokenizerOptFunc func(*iterTokenizer)

func UsingContractions ¶

func UsingContractions(x []string) TokenizerOptFunc

Use the provided contractions.

func UsingEmoticons ¶

func UsingEmoticons(x map[string]int) TokenizerOptFunc

Use the provided map of emoticons.

func UsingIsUnsplittable ¶

func UsingIsUnsplittable(x TokenTester) TokenizerOptFunc

UsingIsUnsplittableFN gives a function that tests whether a token is splittable or not.

func UsingPrefixes ¶

func UsingPrefixes(x []string) TokenizerOptFunc

Use the provided prefixes.

func UsingSanitizer ¶

func UsingSanitizer(x *strings.Replacer) TokenizerOptFunc

Use the provided sanitizer.

func UsingSpecialRE ¶

func UsingSpecialRE(x *regexp.Regexp) TokenizerOptFunc

Use the provided special regex for unsplittable tokens.

func UsingSplitCases ¶

func UsingSplitCases(x []string) TokenizerOptFunc

Use the provided splitCases.

func UsingSuffixes ¶

func UsingSuffixes(x []string) TokenizerOptFunc

Use the provided suffixes.

func WithoutSuffix ¶ added in v0.4.0

func WithoutSuffix() TokenizerOptFunc

type TreebankWordTokenizer ¶ added in v0.3.0

type TreebankWordTokenizer struct {
}

TreebankWordTokenizer splits a sentence into words.

This implementation is a port of the Sed script written by Robert McIntyre, which is available at https://gist.github.com/jdkato/fc8b8c4266dba22d45ac85042ae53b1e.

func NewTreebankWordTokenizer ¶ added in v0.3.0

func NewTreebankWordTokenizer() *TreebankWordTokenizer

NewTreebankWordTokenizer is a TreebankWordTokenizer constructor.

func (TreebankWordTokenizer) Tokenize ¶ added in v0.3.0

func (t TreebankWordTokenizer) Tokenize(text string) []string

Tokenize splits a sentence into a slice of words.

This tokenizer performs the following steps: (1) split on contractions (e.g., "don't" -> [do n't]), (2) split on non-terminating punctuation, (3) split on single quotes when followed by whitespace, and (4) split on periods that appear at the end of lines.

NOTE: As mentioned above, this function expects a sentence (not raw text) as input.

Source Files ¶

View all Source files

?	: This menu
/	: Search site
f or F	: Jump to
y or Y	: Canonical URL