chinese

package module

v0.0.0-...-100fa8a Latest Latest Go to latest Published: Oct 27, 2020 License: MIT Imports: 17 Imported by: 0

Details

Valid go.mod file

The Go module system was introduced in Go 1.11 and is the official dependency management solution for Go.
Redistributable license

Redistributable licenses place minimal restrictions on how software can be used, modified, and redistributed.
Tagged version

Modules with tagged versions give importers more predictable builds.
Stable version

When a project reaches major version v1 it is considered stable.
Learn more about best practices

Repository

github.com/smhanov/chinese

Links

Open Source Insights

README ¶

chinese

Package chinese provides utilities for dealing with Chinese text, including text segmentation.

Download:

go get github.com/smhanov/chinese

Package chinese provides utilities for dealing with Chinese text, including text segmentation.

Chinese text is commonly written without any spaces between the words. This package uses the viterbi algorithm and word frequency information to find the best placement of spaces in the sentences.

It is designed to take up very little memory. In my tests, loading the default dictionary will use 160MB of RAM. However, the memory used for loading is then immediately released so the total memory consumed for the dictionary of 589000 words and frequencies is 1.1MB

To use it, create a new text segmenter. By default, a model of word frequencies from the web is loaded. Then call Segment() passing in some text. The return value is the text split into strings containing individual words, unrecognized words, or spaces and punctuation. You can get back the original input by concatenating the results together.

Automatically generated by autoreadme on 2019.04.08

Documentation ¶

Overview ¶

Package chinese provides utilities for dealing with Chinese text, including text segmentation.

Chinese text is commonly written without any spaces between the words. This package uses the viterbi algorithm and word frequency information to find the best placement of spaces in the sentences.

It is designed to take up very little memory. In my tests, loading the default dictionary will use 160MB of RAM. However, the memory used for loading is then immediately released so the total memory consumed for the dictionary of 589000 words and frequencies is 1.1MB

To use it, create a new text segmenter. By default, a model of word frequencies from the web is loaded. Then call Segment() passing in some text. The return value is the text split into strings containing individual words, unrecognized words, or spaces and punctuation. You can get back the original input by concatenating the results together.

Index ¶

type Model
type Segmenter
- func NewSegmenter(args ...interface{}) *Segmenter
- func (s *Segmenter) Segment(inputStr string) []string
type WordFreq
type WordModel
- func LoadModel(args ...interface{}) (*WordModel, error)
- func NewWordModel() *WordModel

Constants ¶

This section is empty.

Variables ¶

This section is empty.

Functions ¶

This section is empty.

Types ¶

type Model ¶

type Model interface {
	FindAllPrefixesOf(input string) []WordFreq
}

Model is a dictionary that can find all words that are a prefix of the given string.

type Segmenter ¶

type Segmenter struct {
	// contains filtered or unexported fields
}

Segmenter will break chinese text into words, based on a single word frequency model that you provide

func NewSegmenter ¶

func NewSegmenter(args ...interface{}) *Segmenter

NewSegmenter returns a new text segmenter. When passed no arguments, it loads the default model from the web. Otherwise, you must create a model and pass it in as the first argument to use it.

Example ¶

In this example, we load the default model (from the web) and use it to segment some text.

package main

import (
	"fmt"
	"strings"

	"github.com/smhanov/chinese"
)

func main() {
	segments := chinese.NewSegmenter().Segment("我儿子四岁。他的名字叫Zack。")
	fmt.Printf("%s\n", strings.Join(segments, " "))
}

Output:

我 儿子 四岁 。 他 的 名字 叫 Zack。

func (*Segmenter) Segment ¶

func (s *Segmenter) Segment(inputStr string) []string

Segment breaks the input string into separate words. Whitespace or other characters will be returned as their own entry in the result, so the original input can be obtained as the concatenation of the strings in the result.

type WordFreq ¶

type WordFreq struct {
	Word           string
	LogProbability float32
}

WordFreq represents a word and a frequency returned from a model

type WordModel ¶

type WordModel struct {
	// contains filtered or unexported fields
}

WordModel is a structure that can both find all words that are prefixes of a given string, and return the log frequencies of those words.

func LoadModel ¶

func LoadModel(args ...interface{}) (*WordModel, error)

LoadModel returns a model that you open from the given file. The model is a text file. Each line is a word and raw frequency (not log) separated by space. The format is inferred from the first line. If the file ends in .bz2 or .gz, it will be decompressed. If the file is an URL, it will be fetched. If the file is an io.Reader, it will be read from.

func NewWordModel ¶

func NewWordModel() *WordModel

NewWordModel returns a new word model. You must add words to this using AddWord() and then call Finish() before using it.

Example ¶

ExampleSegmentation will create a simple model with some chinese words. Then it will split a sentence.

package main

import (
	"fmt"
	"strings"

	"github.com/smhanov/chinese"
)

func main() {
	model := chinese.NewWordModel()
	model.AddWord("他", 1)
	model.AddWord("儿", 1)
	model.AddWord("儿子", 2)
	model.AddWord("叫", 1)
	model.AddWord("名", 1)
	model.AddWord("名字", 2)
	model.AddWord("四", 1)
	model.AddWord("子", 1)
	model.AddWord("字", 1)
	model.AddWord("岁", 1)
	model.AddWord("的", 1)
	model.Finish()

	segmenter := chinese.NewSegmenter(model)
	segments := segmenter.Segment("我儿子四岁。他的名字叫Zack。")

	fmt.Printf("%s", strings.Join(segments, " "))
}

Output:

我 儿子 四 岁 。 他 的 名字 叫 Zack。

func (*WordModel) AddWord ¶

func (m *WordModel) AddWord(word string, freqCount float32)

AddWord adds a word and log frequency to the model. If frequency of all words is not known, use the length of the word. This will cause the segmenter to try to break the text into the fewest number of words.

Words must be added in alphabetical order, and must not be repeated. Otherwise, it will cause a panic()

func (*WordModel) FindAllPrefixesOf ¶

func (m *WordModel) FindAllPrefixesOf(input string) []WordFreq

FindAllPrefixesOf finds all prefixes of the input string that are words, and returns their log inverse probabilities.

func (*WordModel) Finish ¶

func (m *WordModel) Finish()

Finish signals that the word model is finished and ready to be used.

Source Files ¶

View all Source files

?	: This menu
/	: Search site
f or F	: Jump to
y or Y	: Canonical URL