Documentation ¶
Overview ¶
Package for tokenizing of Chinese text into multi-character terms and corresponding English equivalents.
Index ¶
Examples ¶
Constants ¶
This section is empty.
Variables ¶
This section is empty.
Functions ¶
This section is empty.
Types ¶
type DictTokenizer ¶
type DictTokenizer[V any] struct { // contains filtered or unexported fields }
Tokenizes Chinese text using a dictionary
func NewDictTokenizer ¶ added in v0.0.101
func NewDictTokenizer[V any](wDict map[string]V) *DictTokenizer[V]
func (DictTokenizer[V]) Tokenize ¶
func (tokenizer DictTokenizer[V]) Tokenize(text string) []TextToken
Tokenizes a Chinese text string into words and other terms in the dictionary. If the terms are not found in the dictionary then individual characters will be returned. Compares left to right and right to left greedy methods, taking the one with the least tokens. Long text is handled by breaking the string into segments delimited by punctuation or non-Chinese characters.
type TextSegment ¶ added in v0.0.28
type TextSegment struct { // The text contained in the segment Text string // False if punctuation or non-Chinese text Chinese bool }
A text segment that contains either Chinese or non-Chinese text
func Segment ¶ added in v0.0.28
func Segment(text string) []TextSegment
Segment a text document into segments of Chinese separated by either puncuation or non-Chinese text.
Example ¶
A basic example of the function Segment
segments := Segment("你好 means hello") fmt.Printf("Text: %s, Chinese: %t\n", segments[0].Text, segments[0].Chinese) fmt.Printf("Text: %s, Chinese: %t\n", strings.TrimSpace(segments[1].Text), segments[1].Chinese)
Output: Text: 你好, Chinese: true Text: means hello, Chinese: false