tokenizer

package
v0.0.15 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Sep 7, 2020 License: Apache-2.0 Imports: 4 Imported by: 5

Documentation

Index

Constants

This section is empty.

Variables

This section is empty.

Functions

This section is empty.

Types

type DictTokenizer

type DictTokenizer struct {
	WDict map[string]dicttypes.Word
}

Tokenizes Chinese text using a dictionary

func (DictTokenizer) Tokenize

func (tokenizer DictTokenizer) Tokenize(fragment string) []TextToken

Tokenizes a Chinese text string into words and other terms in the dictionary. If the terms are not found in the dictionary then individual characters will be returned. Compares left to right and right to left greedy methods, taking the one with the least tokens.

type TextToken

type TextToken struct {
	Token     string
	DictEntry dicttypes.Word
	Senses    []dicttypes.WordSense
}

A text token contains the results of tokenizing a string

type Tokenizer

type Tokenizer interface {
	Tokenize(fragment string) []TextToken
}

Tokenizes Chinese text

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL