basetokenizer

package
v0.0.0-...-53a6fda Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Aug 6, 2024 License: BSD-2-Clause Imports: 2 Imported by: 0

Documentation

Overview

Package basetokenizer provides an implementations of a very simple tokenizer that splits by white-spaces (and alike) and punctuation symbols. Please note that abbreviations, real numbers, apostrophes and other expressions are tokenized without any linguistic criteria. It makes disasters on URLs, emails, etc.

Index

Constants

This section is empty.

Variables

This section is empty.

Functions

This section is empty.

Types

type BaseTokenizer

type BaseTokenizer struct {
	// contains filtered or unexported fields
}

BaseTokenizer is a straightforward tokenizer implementations, which splits by whitespace and punctuation characters.

func New

func New(opts ...Option) *BaseTokenizer

New returns a new base tokenizer ready to use.

func (*BaseTokenizer) Tokenize

func (t *BaseTokenizer) Tokenize(text string) []tokenizers.StringOffsetsPair

Tokenize converts the input text to a slice of tokens, where each token is a white-separated word, a number or a punctuation sign. The resulting tokens preserve the alignment with the portion of the original text they belong to.

type Option

type Option func(*BaseTokenizer)

Option allows to configure a new BaseTokenizer with your specific needs.

func RegisterSpecialWords

func RegisterSpecialWords(specialWords ...string) Option

RegisterSpecialWords is an option to register a special word.

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL