Kagome is an open source Japanese morphological analyzer written in pure golang.
The dictionary/statistical models such as MeCab-IPADIC, UniDic (unidic-mecab) and so on, are able to be embedded in binaries.
Improvements from v1.
- Dictionaries are maintained in a separate repository, and only the dictionaries you need are embedded in the binary.
- Brushed up and added several APIs.
Dictionaries
Experimental Features
Segmentation mode for search
Kagome has segmentation mode for search such as Kuromoji.
- Normal: Regular segmentation
- Search: Use a heuristic to do additional segmentation useful for search
- Extended: Similar to search mode, but also uni-gram unknown words
Untokenized |
Normal |
Search |
Extended |
関西国際空港 |
関西国際空港 |
関西 国際 空港 |
関西 国際 空港 |
日本経済新聞 |
日本経済新聞 |
日本 経済 新聞 |
日本 経済 新聞 |
シニアソフトウェアエンジニア |
シニアソフトウェアエンジニア |
シニア ソフトウェア エンジニア |
シニア ソフトウェア エンジニア |
デジカメを買った |
デジカメ を 買っ た |
デジカメ を 買っ た |
デ ジ カ メ を 買っ た |
Programming example
package main
import (
"fmt"
"strings"
"github.com/ikawaha/kagome-dict/ipa"
"github.com/ikawaha/kagome/v2/tokenizer"
)
func main() {
t, err := tokenizer.New(ipa.Dict(), tokenizer.OmitBosEos())
if err != nil {
panic(err)
}
// wakati
fmt.Println("---wakati---")
seg := t.Wakati("すもももももももものうち")
fmt.Println(seg)
// tokenize
fmt.Println("---tokenize---")
tokens := t.Tokenize("すもももももももものうち")
for _, token := range tokens {
features := strings.Join(token.Features(), ",")
fmt.Printf("%s\t%v\n", token.Surface, features)
}
}
output:
---wakati---
[すもも も もも も もも の うち]
---tokenize---
すもも 名詞,一般,*,*,*,*,すもも,スモモ,スモモ
も 助詞,係助詞,*,*,*,*,も,モ,モ
もも 名詞,一般,*,*,*,*,もも,モモ,モモ
も 助詞,係助詞,*,*,*,*,も,モ,モ
もも 名詞,一般,*,*,*,*,もも,モモ,モモ
の 助詞,連体化,*,*,*,*,の,ノ,ノ
うち 名詞,非自立,副詞可能,*,*,*,うち,ウチ,ウチ
Commands
Install
Go
env GO111MODULE=on go get -u github.com/ikawaha/kagome/v2
Homebrew tap
brew install ikawaha/kagome/kagome
Usage
$ kagome -h
Japanese Morphological Analyzer -- github.com/ikawaha/kagome/v2
usage: kagome <command>
The commands are:
[tokenize] - command line tokenize (*default)
server - run tokenize server
lattice - lattice viewer
version - show version
tokenize [-file input_file] [-dict dic_file] [-userdict userdic_file] [-sysdict (ipa|uni)] [-simple false] [-mode (normal|search|extended)]
-dict string
dict
-file string
input file
-mode string
tokenize mode (normal|search|extended) (default "normal")
-simple
display abbreviated dictionary contents
-sysdict string
system dict type (ipa|uni) (default "ipa")
-udict string
user dict
Tokenize command
% kagome
すもももももももものうち
すもも 名詞,一般,*,*,*,*,すもも,スモモ,スモモ
も 助詞,係助詞,*,*,*,*,も,モ,モ
もも 名詞,一般,*,*,*,*,もも,モモ,モモ
も 助詞,係助詞,*,*,*,*,も,モ,モ
もも 名詞,一般,*,*,*,*,もも,モモ,モモ
の 助詞,連体化,*,*,*,*,の,ノ,ノ
うち 名詞,非自立,副詞可能,*,*,*,うち,ウチ,ウチ
EOS
Server command
API
Start a server and try to access the "/tokenize" endpoint.
% kagome server &
% curl -XPUT localhost:6060/tokenize -d'{"sentence":"すもももももももものうち", "mode":"normal"}' | jq .
Web App
Start a server and access http://localhost:6060
.
(To draw a lattice, demo application uses graphviz . You need graphviz installed.)
% kagome server &
Lattice command
A debug tool of tokenize process outputs a lattice in graphviz dot format.
% kagome lattice 私は鰻 | dot -Tpng -o lattice.png
Docker
Licence
MIT