Documentation
¶
Overview ¶
Package gse Go efficient text segmentation, Go 语言高性能分词
Index ¶
- Constants
- Variables
- func DictPaths(dictDir, filePath string) (files []string)
- func GetVersion() string
- func IsJp(segText string) bool
- func Join(text []Text) string
- func ToSlice(segs []Segment, searchMode ...bool) (output []string)
- func ToString(segs []Segment, searchMode ...bool) (output string)
- type Dictionary
- type Prob
- type Segment
- type Segmenter
- func (seg *Segmenter) AddToken(text string, frequency int, pos ...string)
- func (seg *Segmenter) AddTokenForce(text string, frequency int, pos ...string)
- func (seg *Segmenter) CalcToken()
- func (seg *Segmenter) Cut(str string, hmm ...bool) []string
- func (seg *Segmenter) CutAll(str string) []string
- func (seg *Segmenter) CutSearch(str string, hmm ...bool) []string
- func (seg *Segmenter) Dictionary() *Dictionary
- func (seg *Segmenter) Find(str string) (int, bool)
- func (seg *Segmenter) HMMCut(str string) []string
- func (seg *Segmenter) HMMCutMod(str string, prob ...map[rune]float64) []string
- func (seg *Segmenter) LoadDict(files ...string) error
- func (seg *Segmenter) LoadModel(prob ...map[rune]float64)
- func (seg *Segmenter) ModeSegment(bytes []byte, searchMode ...bool) []Segment
- func (seg *Segmenter) Read(file string) error
- func (seg *Segmenter) Segment(bytes []byte) []Segment
- func (seg *Segmenter) Slice(bytes []byte, searchMode ...bool) []string
- func (seg *Segmenter) String(bytes []byte, searchMode ...bool) string
- type Text
- type Token
- type TokenJson
Constants ¶
const ( // RatioWord ratio words and letters RatioWord float32 = 1.5 // RatioWordFull full ratio words and letters RatioWordFull float32 = 1 )
Variables ¶
var ( // LoadNoFreq load not have freq dict word LoadNoFreq bool // MinTokenFreq load min freq token MinTokenFreq = 2 )
Functions ¶
func ToSlice ¶
ToSlice segments to slice 输出分词结果到一个字符串 slice
有两种输出模式,以 "山达尔星联邦共和国" 为例
普通模式(searchMode=false)输出一个分词"[山达尔星联邦共和国]" 搜索模式(searchMode=true) 输出普通模式的再细致切分: "[山达尔星 联邦 共和 国 共和国 联邦共和国 山达尔星联邦共和国]"
默认 searchMode=false 搜索模式主要用于给搜索引擎提供尽可能多的关键字,详情请见Token结构体的注释。
Types ¶
type Dictionary ¶
type Dictionary struct {
// contains filtered or unexported fields
}
Dictionary 结构体实现了一个字串前缀树, 一个分词可能出现在叶子节点也有可能出现在非叶节点
type Segmenter ¶
type Segmenter struct {
// contains filtered or unexported fields
}
Segmenter 分词器结构体
func (*Segmenter) AddTokenForce ¶
AddTokenForce add new text to token and force
func (*Segmenter) Cut ¶
Cut cuts a str into words using accurate mode. Parameter hmm controls whether to use the HMM(Hidden Markov Model) or use the user's model.
func (*Segmenter) LoadDict ¶
LoadDict load the dictionary from the file
The format of the dictionary is (one for each participle):
participle text, frequency, part of speech
Can load multiple dictionary files, the file name separated by "," the front of the dictionary preferentially load the participle,
such as: "user_dictionary.txt,common_dictionary.txt"
When a participle appears both in the user dictionary and in the `common dictionary`, the `user dictionary` is given priority.
从文件中载入词典
可以载入多个词典文件,文件名用 "," 分隔,排在前面的词典优先载入分词,比如:
"用户词典.txt,通用词典.txt"
当一个分词既出现在用户词典也出现在 `通用词典` 中,则优先使用 `用户词典`。
词典的格式为(每个分词一行):
分词文本 频率 词性
func (*Segmenter) LoadModel ¶
LoadModel load the hmm model
Use the user's model:
seg.LoadModel(B, E, M, S map[rune]float64)
func (*Segmenter) ModeSegment ¶
ModeSegment segment using search mode if searchMode is true
type Text ¶
type Text []byte
Text 字串类型,可以用来表达
- 一个字元,比如 "世" 又如 "界", 英文的一个字元是一个词
- 一个分词,比如 "世界" 又如 "人口"
- 一段文字,比如 "世界有七十亿人口"
Source Files
¶
Directories
¶
Path | Synopsis |
---|---|
Package hmm is the Golang HMM cut module Package hmm model data The data from https://github.com/fxsjy/jieba
|
Package hmm is the Golang HMM cut module Package hmm model data The data from https://github.com/fxsjy/jieba |