tokenizer

package

v0.0.134 Latest Latest Go to latest Published: Aug 6, 2022 License: Apache-2.0 Imports: 4 Imported by: 5

Details

Valid go.mod file
Redistributable license
Tagged version
Stable version
Learn more about best practices

Repository

github.com/alexamies/chinesenotes-go

Links

Open Source Insights

Documentation ¶

Overview ¶

Package for tokenizing of Chinese text into multi-character terms and corresponding English equivalents.

Index ¶

type DictTokenizer
- func NewDictTokenizer[V any](wDict map[string]V) *DictTokenizer[V]
- func (tokenizer DictTokenizer[V]) Tokenize(text string) []TextToken
type TextSegment
- func Segment(text string) []TextSegment
type TextToken
type Tokenizer

Examples ¶

Segment

Constants ¶

This section is empty.

Variables ¶

This section is empty.

Functions ¶

This section is empty.

Types ¶

type DictTokenizer ¶

type DictTokenizer[V any] struct {
	// contains filtered or unexported fields
}

Tokenizes Chinese text using a dictionary

func NewDictTokenizer ¶ added in v0.0.101

func NewDictTokenizer[V any](wDict map[string]V) *DictTokenizer[V]

func (DictTokenizer[V]) Tokenize ¶

func (tokenizer DictTokenizer[V]) Tokenize(text string) []TextToken

Tokenizes a Chinese text string into words and other terms in the dictionary. If the terms are not found in the dictionary then individual characters will be returned. Compares left to right and right to left greedy methods, taking the one with the least tokens. Long text is handled by breaking the string into segments delimited by punctuation or non-Chinese characters.

type TextSegment ¶ added in v0.0.28

type TextSegment struct {

	// The text contained in the segment
	Text string

	// False if punctuation or non-Chinese text
	Chinese bool
}

A text segment that contains either Chinese or non-Chinese text

func Segment ¶ added in v0.0.28

func Segment(text string) []TextSegment

Segment a text document into segments of Chinese separated by either puncuation or non-Chinese text.

Example ¶

A basic example of the function Segment

segments := Segment("你好 means hello")
fmt.Printf("Text: %s, Chinese: %t\n", segments[0].Text, segments[0].Chinese)
fmt.Printf("Text: %s, Chinese: %t\n", strings.TrimSpace(segments[1].Text), segments[1].Chinese)

Output:

Text: 你好, Chinese: true
Text: means hello, Chinese: false

type TextToken ¶

type TextToken struct {
	Token     string
	DictEntry dicttypes.Word
	Senses    []dicttypes.WordSense
}

A text token contains the results of tokenizing a string

type Tokenizer ¶

type Tokenizer interface {
	Tokenize(fragment string) []TextToken
}

Tokenizes Chinese text

Source Files ¶

View all Source files

?	: This menu
/	: Search site
f or F	: Jump to
y or Y	: Canonical URL