sego

package module

v1.0.1 Latest Latest Go to latest Published: Mar 10, 2022 License: Apache-2.0 Imports: 12 Imported by: 0

Details

Valid go.mod file

The Go module system was introduced in Go 1.11 and is the official dependency management solution for Go.
Redistributable license

Redistributable licenses place minimal restrictions on how software can be used, modified, and redistributed.
Tagged version

Modules with tagged versions give importers more predictable builds.
Stable version

When a project reaches major version v1 it is considered stable.
Learn more about best practices

Repository

github.com/ChinaDagger/sego

Links

Open Source Insights

README ¶

sego

Go中文分词

该项目来自github.com/huichen/sego，个人属于研究学习使用，请商业用途的使用原版。

词典用双数组trie（Double-Array Trie）实现，分词器算法为基于词频的最短路径加动态规划。

支持普通和搜索引擎两种分词模式，支持用户词典、词性标注，可运行JSON RPC服务。

分词速度单线程9MB/s，goroutines并发42MB/s（8核Macbook Pro）。

安装/更新

go get -u github.com/huichen/sego

使用

package main

import (
	"fmt"
	"github.com/huichen/sego"
)

func main() {
	// 载入词典
	var segmenter sego.Segmenter
	segmenter.LoadDictionary("github.com/huichen/sego/data/dictionary.txt")

	// 分词
	text := []byte("中华人民共和国中央人民政府")
	segments := segmenter.Segment(text)
  
	// 处理分词结果
	// 支持普通模式和搜索模式两种分词，见代码中SegmentsToString函数的注释。
	fmt.Println(sego.SegmentsToString(segments, false)) 
}

从自定义渠道加载字典

package main

import (
	"fmt"
	"github.com/huichen/sego"
)

func main() {
	// 载入词典
	var segmenter sego.Segmenter

	//根据实际场景获取字典，例如数据库或者接口
	dictArray := []string{"1号店 3 n","4S店 3 n"}
	for _, dictItem := range dictArray {
		segmenter.AddDictionary("文本","词频","词性")
	}
	segmenter.RefreshDictionary()

	// 分词
	text := []byte("中华人民共和国中央人民政府")
	segments := segmenter.Segment(text)
  
	// 处理分词结果
	// 支持普通模式和搜索模式两种分词，见代码中SegmentsToString函数的注释。
	fmt.Println(sego.SegmentsToString(segments, false)) 
}

Documentation ¶

Overview ¶

Go中文分词

Index ¶

func Join(a []Text) string
func SegmentsToSlice(segs []Segment, searchMode bool) (output []string)
func SegmentsToString(segs []Segment, searchMode bool) (output string)
type Dictionary
- func NewDictionary() *Dictionary
type Segment
type Segmenter
type Text
type Token

Constants ¶

This section is empty.

Variables ¶

This section is empty.

Functions ¶

func Join ¶

func Join(a []Text) string

func SegmentsToSlice ¶

func SegmentsToSlice(segs []Segment, searchMode bool) (output []string)

func SegmentsToString ¶

func SegmentsToString(segs []Segment, searchMode bool) (output string)

输出分词结果为字符串

有两种输出模式，以"中华人民共和国"为例

普通模式（searchMode=false）输出一个分词"中华人民共和国/ns "
搜索模式（searchMode=true） 输出普通模式的再细致切分：
    "中华/nz 人民/n 共和/nz 共和国/ns 人民共和国/nt 中华人民共和国/ns "

搜索模式主要用于给搜索引擎提供尽可能多的关键字，详情请见Token结构体的注释。

Types ¶

type Dictionary ¶

type Dictionary struct {
	// contains filtered or unexported fields
}

Dictionary结构体实现了一个字串前缀树，一个分词可能出现在叶子节点也有可能出现在非叶节点

func NewDictionary ¶

func NewDictionary() *Dictionary

func (*Dictionary) Close ¶

func (dict *Dictionary) Close()

释放资源

func (*Dictionary) MaxTokenLength ¶

func (dict *Dictionary) MaxTokenLength() int

词典中最长的分词

func (*Dictionary) NumTokens ¶

func (dict *Dictionary) NumTokens() int

词典中分词数目

func (*Dictionary) TotalFrequency ¶

func (dict *Dictionary) TotalFrequency() int64

词典中所有分词的频率之和

type Segment ¶

type Segment struct {
	// contains filtered or unexported fields
}

文本中的一个分词

func (*Segment) End ¶

func (s *Segment) End() int

返回分词在文本中的结束字节位置（不包括该位置）

func (*Segment) Start ¶

func (s *Segment) Start() int

返回分词在文本中的起始字节位置

func (*Segment) Token ¶

func (s *Segment) Token() *Token

返回分词信息

type Segmenter ¶

type Segmenter struct {
	// contains filtered or unexported fields
}

分词器结构体

func (*Segmenter) AddDictionary ¶

func (seg *Segmenter) AddDictionary(text, freqText, pos string)

添加字典

func (*Segmenter) Close ¶

func (seg *Segmenter) Close()

释放资源

func (*Segmenter) Dictionary ¶

func (seg *Segmenter) Dictionary() *Dictionary

返回分词器使用的词典

func (*Segmenter) InternalSegment ¶

func (seg *Segmenter) InternalSegment(bytes []byte, searchMode bool) []Segment

func (*Segmenter) LoadDictionary ¶

func (seg *Segmenter) LoadDictionary(files string)

从文件中载入词典

可以载入多个词典文件，文件名用","分隔，排在前面的词典优先载入分词，比如

"用户词典.txt,通用词典.txt"

当一个分词既出现在用户词典也出现在通用词典中，则优先使用用户词典。

词典的格式为（每个分词一行）：

分词文本 频率 词性

func (*Segmenter) RefreshDictionary ¶

func (seg *Segmenter) RefreshDictionary()

刷新字典库

func (*Segmenter) Segment ¶

func (seg *Segmenter) Segment(bytes []byte) []Segment

对文本分词

输入参数：

bytes	UTF8文本的字节数组

输出：

[]Segment	划分的分词

type Text ¶

type Text []byte

字串类型，可以用来表达

一个字元，比如"中"又如"国", 英文的一个字元是一个词
一个分词，比如"中国"又如"人口"
一段文字，比如"中国有十三亿人口"

type Token ¶

type Token struct {
	// contains filtered or unexported fields
}

一个分词

func (*Token) Frequency ¶

func (token *Token) Frequency() int

返回分词在语料库中的词频

func (*Token) Pos ¶

func (token *Token) Pos() string

返回分词词性标注

func (*Token) Segments ¶

func (token *Token) Segments() []*Segment

该分词文本的进一步分词划分，比如"中华人民共和国中央人民政府"这个分词有两个子分词"中华人民共和国"和"中央人民政府"。子分词也可以进一步有子分词形成一个树结构，遍历这个树就可以得到该分词的所有细致分词划分，这主要用于搜索引擎对一段文本进行全文搜索。

func (*Token) Text ¶

func (token *Token) Text() string

返回分词文本

func (*Token) TextEquals ¶

func (token *Token) TextEquals(string string) bool

Source Files ¶

View all Source files

Directories ¶

Path	Synopsis
server
tools

?	: This menu
/	: Search site
f or F	: Jump to
y or Y	: Canonical URL