README
¶
结巴分词 Go 语言版:Jiebago
结巴分词 是由 @fxsjy 使用 Python 编写的中文分词组件,jiebago 是结巴分词的 Golang 语言实现。
安装
go get github.com/Soontao/jiebago/...
使用
package main
import (
"fmt"
"github.com/Soontao/jiebago"
)
var seg jiebago.Segmenter
func init() {
seg.LoadDictionary("dict.txt")
}
func print(ch <-chan string) {
for word := range ch {
fmt.Printf(" %s /", word)
}
fmt.Println()
}
func Example() {
fmt.Print("【全模式】:")
print(seg.CutAll("我来到北京清华大学"))
fmt.Print("【精确模式】:")
print(seg.Cut("我来到北京清华大学", false))
fmt.Print("【新词识别】:")
print(seg.Cut("他来到了网易杭研大厦", true))
fmt.Print("【搜索引擎模式】:")
print(seg.CutForSearch("小明硕士毕业于中国科学院计算所,后在日本京都大学深造", true))
}
输出结果:
【全模式】: 我 / 来到 / 北京 / 清华 / 清华大学 / 华大 / 大学 /
【精确模式】: 我 / 来到 / 北京 / 清华大学 /
【新词识别】: 他 / 来到 / 了 / 网易 / 杭研 / 大厦 /
【搜索引擎模式】: 小明 / 硕士 / 毕业 / 于 / 中国 / 科学 / 学院 / 科学院 / 中国科学院 / 计算 / 计算所 / , / 后 / 在 / 日本 / 京都 / 大学 / 日本京都大学 / 深造 /
更多信息请参考文档。
分词速度
- 2MB / Second in Full Mode
- 700KB / Second in Default Mode
- Test Env: AMD Phenom(tm) II X6 1055T CPU @ 2.8GHz; 《金庸全集》
许可证
Documentation
¶
Overview ¶
Package jiebago is the Golang implemention of [Jieba](https://github.com/fxsjy/jieba), Python Chinese text segmentation module.
Example ¶
Output: 【全模式】: 我 / 来到 / 北京 / 清华 / 清华大学 / 华大 / 大学 / 【精确模式】: 我 / 来到 / 北京 / 清华大学 / 【新词识别】: 他 / 来到 / 了 / 网易 / 杭研 / 大厦 / 【搜索引擎模式】: 小明 / 硕士 / 毕业 / 于 / 中国 / 科学 / 学院 / 科学院 / 中国科学院 / 计算 / 计算所 / , / 后 / 在 / 日本 / 京都 / 大学 / 日本京都大学 / 深造 /
Example (LoadUserDictionary) ¶
Output: Before: 李小福 / 是 / 创新 / 办 / 主任 / 也 / 是 / 云 / 计算 / 方面 / 的 / 专家 / After: 李小福 / 是 / 创新办 / 主任 / 也 / 是 / 云计算 / 方面 / 的 / 专家 /
Example (SuggestFrequency) ¶
Output: Before: 超敏 / C / 反应 / 蛋白 / 是 / 什么 / ? / 超敏C反应蛋白 current frequency: 0.000000, suggest: 1.000000. After: 超敏C反应蛋白 / 是 / 什么 / ? / Before: 如果 / 放到 / post / 中将 / 出错 / 中将 current frequency: 763.000000, suggest: 494.000000. After: 如果 / 放到 / post / 中 / 将 / 出错 / Before: 今天天气 / 不错 / 今天天气 current frequency: 3.000000, suggest: 0.000000. After: 今天 / 天气 / 不错 /
Index ¶
- type Dictionary
- type Segmenter
- func (seg *Segmenter) AddWord(word string, frequency float64)
- func (seg *Segmenter) Cut(sentence string, hmm bool) <-chan string
- func (seg *Segmenter) CutAll(sentence string) <-chan string
- func (seg *Segmenter) CutForSearch(sentence string, hmm bool) <-chan string
- func (seg *Segmenter) DeleteWord(word string)
- func (seg *Segmenter) Frequency(word string) (float64, bool)
- func (seg *Segmenter) LoadDictionary(resource string) error
- func (seg *Segmenter) LoadUserDictionary(resource string) error
- func (seg *Segmenter) SuggestFrequency(words ...string) float64
Examples ¶
Constants ¶
This section is empty.
Variables ¶
This section is empty.
Functions ¶
This section is empty.
Types ¶
type Dictionary ¶
A Dictionary represents a thread-safe dictionary used for word segmentation.
func (*Dictionary) AddToken ¶
func (d *Dictionary) AddToken(token dictionary.Token)
AddToken adds one token
func (*Dictionary) Frequency ¶
func (d *Dictionary) Frequency(key string) (float64, bool)
Frequency returns the frequency and existence of give word
func (*Dictionary) Load ¶
func (d *Dictionary) Load(ch <-chan dictionary.Token)
Load loads all tokens from given channel
type Segmenter ¶
type Segmenter struct {
// contains filtered or unexported fields
}
Segmenter is a Chinese words segmentation struct.
func (*Segmenter) Cut ¶
Cut cuts a sentence into words using accurate mode. Parameter hmm controls whether to use the Hidden Markov Model. Accurate mode attempts to cut the sentence into the most accurate segmentations, which is suitable for text analysis.
func (*Segmenter) CutAll ¶
CutAll cuts a sentence into words using full mode. Full mode gets all the possible words from the sentence. Fast but not accurate.
func (*Segmenter) CutForSearch ¶
CutForSearch cuts sentence into words using search engine mode. Search engine mode, based on the accurate mode, attempts to cut long words into several short words, which can raise the recall rate. Suitable for search engines.
func (*Segmenter) DeleteWord ¶
DeleteWord removes a word from dictionary
func (*Segmenter) LoadDictionary ¶
LoadDictionary loads dictionary from given file name. Everytime LoadDictionary is called, previously loaded dictionary will be cleard.
func (*Segmenter) LoadUserDictionary ¶
LoadUserDictionary loads a user specified dictionary, it must be called after LoadDictionary, and it will not clear any previous loaded dictionary, instead it will override exist entries.
func (*Segmenter) SuggestFrequency ¶
SuggestFrequency returns a suggested frequncy of a word or a long word cutted into several short words.
This method is useful when a word in the sentence is not cutted out correctly.
If a word should not be further cutted, for example word "石墨烯" should not be cutted into "石墨" and "烯", SuggestFrequency("石墨烯") will return the maximu frequency for this word.
If a word should be further cutted, for example word "今天天气" should be further cutted into two words "今天" and "天气", SuggestFrequency("今天", "天气") should return the minimum frequency for word "今天天气".
Directories
¶
Path | Synopsis |
---|---|
Package analyse is the Golang implementation of Jieba's analyse module.
|
Package analyse is the Golang implementation of Jieba's analyse module. |
Package dictionary contains a interface and wraps all io related work.
|
Package dictionary contains a interface and wraps all io related work. |
Package finalseg is the Golang implementation of Jieba's finalseg module.
|
Package finalseg is the Golang implementation of Jieba's finalseg module. |
Package posseg is the Golang implementation of Jieba's posseg module.
|
Package posseg is the Golang implementation of Jieba's posseg module. |
Package util contains some util functions used by jiebago.
|
Package util contains some util functions used by jiebago. |