tokenizer

package
v1.1.7 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Aug 25, 2024 License: Apache-2.0 Imports: 7 Imported by: 4

Documentation

Index

Constants

This section is empty.

Variables

View Source
var IdeographRegexp = regexp.MustCompile(`\p{Han}|\p{Hangul}|\p{Hiragana}|\p{Katakana}`)

Functions

func MakeToken

func MakeToken(input []byte) *analysis.Token

func MakeTokenStream

func MakeTokenStream(input []byte) analysis.TokenStream

Types

type CharacterTokenizer

type CharacterTokenizer struct {
	// contains filtered or unexported fields
}

func NewCharacterTokenizer

func NewCharacterTokenizer(f IsTokenRune) *CharacterTokenizer

func NewLetterTokenizer

func NewLetterTokenizer() *CharacterTokenizer

func NewWhitespaceTokenizer

func NewWhitespaceTokenizer() *CharacterTokenizer

func (*CharacterTokenizer) Tokenize

func (c *CharacterTokenizer) Tokenize(input []byte) analysis.TokenStream

type ExceptionsTokenizer

type ExceptionsTokenizer struct {
	// contains filtered or unexported fields
}

ExceptionsTokenizer implements a Tokenizer which extracts pieces matched by a regular expression from the input data, delegates the rest to another tokenizer, then insert back extracted parts in the token stream. Use it to preserve sequences which a regular tokenizer would alter or remove.

Its constructor takes the following arguments:

"exceptions" ([]string): one or more Go regular expressions matching the sequence to preserve. Multiple expressions are combined with "|".

"tokenizer" (string): the name of the tokenizer processing the data not matched by "exceptions".

func NewExceptionsTokenizer

func NewExceptionsTokenizer(exception *regexp.Regexp, remaining analysis.Tokenizer) *ExceptionsTokenizer

func NewWebTokenizer

func NewWebTokenizer() *ExceptionsTokenizer

func (*ExceptionsTokenizer) Tokenize

func (t *ExceptionsTokenizer) Tokenize(input []byte) analysis.TokenStream

type IsTokenRune

type IsTokenRune func(r rune) bool

type RegexpTokenizer

type RegexpTokenizer struct {
	// contains filtered or unexported fields
}

func NewRegexpTokenizer

func NewRegexpTokenizer(r *regexp.Regexp) *RegexpTokenizer

func (*RegexpTokenizer) Tokenize

func (rt *RegexpTokenizer) Tokenize(input []byte) analysis.TokenStream

type SingleTokenTokenizer

type SingleTokenTokenizer struct{}

func NewSingleTokenTokenizer

func NewSingleTokenTokenizer() *SingleTokenTokenizer

func (*SingleTokenTokenizer) Tokenize

func (t *SingleTokenTokenizer) Tokenize(input []byte) analysis.TokenStream

type UnicodeTokenizer

type UnicodeTokenizer struct{}

func NewUnicodeTokenizer

func NewUnicodeTokenizer() *UnicodeTokenizer

func (*UnicodeTokenizer) Tokenize

func (rt *UnicodeTokenizer) Tokenize(input []byte) analysis.TokenStream

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL