basetokenizer

package

v0.0.0-...-53a6fda Latest Latest Go to latest Published: Aug 6, 2024 License: BSD-2-Clause Imports: 2 Imported by: 0

Details

Valid go.mod file
Redistributable license
Tagged version
Stable version
Learn more about best practices

Repository

github.com/yinziyang/cybertron

Links

Open Source Insights

Documentation ¶

Overview ¶

Package basetokenizer provides an implementations of a very simple tokenizer that splits by white-spaces (and alike) and punctuation symbols. Please note that abbreviations, real numbers, apostrophes and other expressions are tokenized without any linguistic criteria. It makes disasters on URLs, emails, etc.

Index ¶

type BaseTokenizer
- func New(opts ...Option) *BaseTokenizer
- func (t *BaseTokenizer) Tokenize(text string) []tokenizers.StringOffsetsPair
type Option
- func RegisterSpecialWords(specialWords ...string) Option

Constants ¶

This section is empty.

Variables ¶

This section is empty.

Functions ¶

This section is empty.

Types ¶

type BaseTokenizer ¶

type BaseTokenizer struct {
	// contains filtered or unexported fields
}

BaseTokenizer is a straightforward tokenizer implementations, which splits by whitespace and punctuation characters.

func New ¶

func New(opts ...Option) *BaseTokenizer

New returns a new base tokenizer ready to use.

func (*BaseTokenizer) Tokenize ¶

func (t *BaseTokenizer) Tokenize(text string) []tokenizers.StringOffsetsPair

Tokenize converts the input text to a slice of tokens, where each token is a white-separated word, a number or a punctuation sign. The resulting tokens preserve the alignment with the portion of the original text they belong to.

type Option ¶

type Option func(*BaseTokenizer)

Option allows to configure a new BaseTokenizer with your specific needs.

func RegisterSpecialWords ¶

func RegisterSpecialWords(specialWords ...string) Option

RegisterSpecialWords is an option to register a special word.

Source Files ¶

View all Source files

tokenizer.go

?	: This menu
/	: Search site
f or F	: Jump to
y or Y	: Canonical URL