tokenizer

package

v0.0.6 Latest Latest Go to latest Published: Jul 19, 2020 License: Apache-2.0 Imports: 4 Imported by: 5

Details

Valid go.mod file
Redistributable license
Tagged version
Stable version
Learn more about best practices

Repository

github.com/alexamies/chinesenotes-go

Links

Open Source Insights

Documentation ¶

Index ¶

type DictTokenizer
- func (tokenizer DictTokenizer) Tokenize(fragment string) []TextToken
type TextToken
type Tokenizer

Constants ¶

This section is empty.

Variables ¶

This section is empty.

Functions ¶

This section is empty.

Types ¶

type DictTokenizer ¶

type DictTokenizer struct {
	WDict map[string]dicttypes.Word
}

Tokenizes Chinese text using a dictionary

func (DictTokenizer) Tokenize ¶

func (tokenizer DictTokenizer) Tokenize(fragment string) []TextToken

Tokenizes a Chinese text string into words and other terms in the dictionary. If the terms are not found in the dictionary then individual characters will be returned. Compares left to right and right to left greedy methods, taking the one with the least tokens.

type TextToken ¶

type TextToken struct {
	Token     string
	DictEntry dicttypes.Word
	Senses    []dicttypes.WordSense
}

A text token contains the results of tokenizing a string

type Tokenizer ¶

type Tokenizer interface {
	Tokenize(fragment string) []TextToken
}

Tokenizes Chinese text

Source Files ¶

View all Source files

tokenizer.go

?	: This menu
/	: Search site
f or F	: Jump to
y or Y	: Canonical URL