Documentation ¶
Overview ¶
Package chinese provides utilities for dealing with Chinese text, including text segmentation.
Chinese text is commonly written without any spaces between the words. This package uses the viterbi algorithm and word frequency information to find the best placement of spaces in the sentences.
It is designed to take up very little memory. In my tests, loading the default dictionary will use 160MB of RAM. However, the memory used for loading is then immediately released so the total memory consumed for the dictionary of 589000 words and frequencies is 1.1MB
To use it, create a new text segmenter. By default, a model of word frequencies from the web is loaded. Then call Segment() passing in some text. The return value is the text split into strings containing individual words, unrecognized words, or spaces and punctuation. You can get back the original input by concatenating the results together.
Index ¶
Examples ¶
Constants ¶
This section is empty.
Variables ¶
This section is empty.
Functions ¶
This section is empty.
Types ¶
type Segmenter ¶
type Segmenter struct {
// contains filtered or unexported fields
}
Segmenter will break chinese text into words, based on a single word frequency model that you provide
func NewSegmenter ¶
func NewSegmenter(args ...interface{}) *Segmenter
NewSegmenter returns a new text segmenter. When passed no arguments, it loads the default model from the web. Otherwise, you must create a model and pass it in as the first argument to use it.
Example ¶
In this example, we load the default model (from the web) and use it to segment some text.
package main import ( "fmt" "strings" "github.com/smhanov/chinese" ) func main() { segments := chinese.NewSegmenter().Segment("我儿子四岁。他的名字叫Zack。") fmt.Printf("%s\n", strings.Join(segments, " ")) }
Output: 我 儿子 四岁 。 他 的 名字 叫 Zack。
type WordModel ¶
type WordModel struct {
// contains filtered or unexported fields
}
WordModel is a structure that can both find all words that are prefixes of a given string, and return the log frequencies of those words.
func LoadModel ¶
LoadModel returns a model that you open from the given file. The model is a text file. Each line is a word and raw frequency (not log) separated by space. The format is inferred from the first line. If the file ends in .bz2 or .gz, it will be decompressed. If the file is an URL, it will be fetched. If the file is an io.Reader, it will be read from.
func NewWordModel ¶
func NewWordModel() *WordModel
NewWordModel returns a new word model. You must add words to this using AddWord() and then call Finish() before using it.
Example ¶
ExampleSegmentation will create a simple model with some chinese words. Then it will split a sentence.
package main import ( "fmt" "strings" "github.com/smhanov/chinese" ) func main() { model := chinese.NewWordModel() model.AddWord("他", 1) model.AddWord("儿", 1) model.AddWord("儿子", 2) model.AddWord("叫", 1) model.AddWord("名", 1) model.AddWord("名字", 2) model.AddWord("四", 1) model.AddWord("子", 1) model.AddWord("字", 1) model.AddWord("岁", 1) model.AddWord("的", 1) model.Finish() segmenter := chinese.NewSegmenter(model) segments := segmenter.Segment("我儿子四岁。他的名字叫Zack。") fmt.Printf("%s", strings.Join(segments, " ")) }
Output: 我 儿子 四 岁 。 他 的 名字 叫 Zack。
func (*WordModel) AddWord ¶
AddWord adds a word and log frequency to the model. If frequency of all words is not known, use the length of the word. This will cause the segmenter to try to break the text into the fewest number of words.
Words must be added in alphabetical order, and must not be repeated. Otherwise, it will cause a panic()
func (*WordModel) FindAllPrefixesOf ¶
FindAllPrefixesOf finds all prefixes of the input string that are words, and returns their log inverse probabilities.