sego

package module

v0.0.0-...-d06fe1b Latest Latest Go to latest Published: Nov 1, 2015 License: Apache-2.0 Imports: 12 Imported by: 0

Details

Valid go.mod file
Redistributable license
Tagged version
Stable version
Learn more about best practices

Repository

github.com/plhwin/sego

Links

Open Source Insights

README ¶

sego

Go中文分词

词典用双数组trie（Double-Array Trie）实现，分词器算法为基于词频的最短路径加动态规划。

支持普通和搜索引擎两种分词模式，支持用户词典、词性标注，可运行JSON RPC服务。

分词速度单线程9MB/s，goroutines并发42MB/s（8核Macbook Pro）。

安装/更新

go get -u github.com/huichen/sego

使用

package main

import (
	"fmt"
	"github.com/huichen/sego"
)

func main() {
	// 载入词典
	var segmenter sego.Segmenter
	segmenter.LoadDictionary("github.com/huichen/sego/data/dictionary.txt")

	// 分词
	text := []byte("中华人民共和国中央人民政府")
	segments := segmenter.Segment(text)
  
	// 处理分词结果
	// 支持普通模式和搜索模式两种分词，见代码中SegmentsToString函数的注释。
	fmt.Println(sego.SegmentsToString(segments, false)) 
}

Documentation ¶

Overview ¶

Go中文分词

Index ¶

func SegmentsToSlice(segs []Segment, searchMode bool) (output []string)
func SegmentsToString(segs []Segment, searchMode bool) (output string)
type Dictionary
- func NewDictionary() *Dictionary
type Segment
type Segmenter
type Text
type Token

Constants ¶

This section is empty.

Variables ¶

This section is empty.

Functions ¶

func SegmentsToSlice ¶

func SegmentsToSlice(segs []Segment, searchMode bool) (output []string)

func SegmentsToString ¶

func SegmentsToString(segs []Segment, searchMode bool) (output string)

输出分词结果为字符串

有两种输出模式，以"中华人民共和国"为例

普通模式（searchMode=false）输出一个分词"中华人民共和国/ns "
搜索模式（searchMode=true） 输出普通模式的再细致切分：
    "中华/nz 人民/n 共和/nz 共和国/ns 人民共和国/nt 中华人民共和国/ns "

搜索模式主要用于给搜索引擎提供尽可能多的关键字，详情请见Token结构体的注释。

Types ¶

type Dictionary ¶

type Dictionary struct {
	// contains filtered or unexported fields
}

Dictionary结构体实现了一个字串前缀树，一个分词可能出现在叶子节点也有可能出现在非叶节点

func NewDictionary ¶

func NewDictionary() *Dictionary

func (*Dictionary) MaxTokenLength ¶

func (dict *Dictionary) MaxTokenLength() int

词典中最长的分词

func (*Dictionary) NumTokens ¶

func (dict *Dictionary) NumTokens() int

词典中分词数目

func (*Dictionary) TotalFrequency ¶

func (dict *Dictionary) TotalFrequency() int64

词典中所有分词的频率之和

type Segment ¶

type Segment struct {
	// contains filtered or unexported fields
}

文本中的一个分词

func (*Segment) End ¶

func (s *Segment) End() int

返回分词在文本中的结束字节位置（不包括该位置）

func (*Segment) Start ¶

func (s *Segment) Start() int

返回分词在文本中的起始字节位置

func (*Segment) Token ¶

func (s *Segment) Token() *Token

返回分词信息

type Segmenter ¶

type Segmenter struct {
	// contains filtered or unexported fields
}

分词器结构体

func (*Segmenter) Dictionary ¶

func (seg *Segmenter) Dictionary() *Dictionary

返回分词器使用的词典

func (*Segmenter) LoadDictionary ¶

func (seg *Segmenter) LoadDictionary(files string)

从文件中载入词典

可以载入多个词典文件，文件名用","分隔，排在前面的词典优先载入分词，比如

"用户词典.txt,通用词典.txt"

当一个分词既出现在用户词典也出现在通用词典中，则优先使用用户词典。

词典的格式为（每个分词一行）：

分词文本 频率 词性

func (*Segmenter) Segment ¶

func (seg *Segmenter) Segment(bytes []byte) []Segment

对文本分词

输入参数：

bytes	UTF8文本的字节数组

输出：

[]Segment	划分的分词

type Text ¶

type Text []byte

字串类型，可以用来表达

一个字元，比如"中"又如"国", 英文的一个字元是一个词
一个分词，比如"中国"又如"人口"
一段文字，比如"中国有十三亿人口"

type Token ¶

type Token struct {
	// contains filtered or unexported fields
}

一个分词

func (*Token) Frequency ¶

func (token *Token) Frequency() int

返回分词在语料库中的词频

func (*Token) Pos ¶

func (token *Token) Pos() string

返回分词词性标注

func (*Token) Segments ¶

func (token *Token) Segments() []*Segment

该分词文本的进一步分词划分，比如"中华人民共和国中央人民政府"这个分词有两个子分词"中华人民共和国"和"中央人民政府"。子分词也可以进一步有子分词形成一个树结构，遍历这个树就可以得到该分词的所有细致分词划分，这主要用于搜索引擎对一段文本进行全文搜索。

func (*Token) Text ¶

func (token *Token) Text() string

返回分词文本

Source Files ¶

View all Source files

Directories ¶

Path	Synopsis
server
tools

?	: This menu
/	: Search site
f or F	: Jump to
y or Y	: Canonical URL