tokenizer

package

v2.0.4 Latest Latest Go to latest Published: Aug 10, 2020 License: MIT Imports: 7 Imported by: 87

Details

Valid go.mod file
Redistributable license
Tagged version
Stable version
Learn more about best practices

Repository

github.com/ikawaha/kagome

Links

Open Source Insights

Documentation ¶

Overview ¶

Package tokenizer is a japanese morphological analyzer library.

Example (Tokenize_mode) ¶

d, err := dict.LoadDictFile(testDictPath)
if err != nil {
	panic(err)
}
t, err := New(d)
if err != nil {
	panic(err)
}
for _, mode := range []TokenizeMode{Normal, Search, Extended} {
	tokens := t.Analyze("関西国際空港", Normal)
	fmt.Printf("---%s---", mode)
	for _, token := range tokens {
		if token.Class == DUMMY {
			// BOS: Begin Of Sentence, EOS: End Of Sentence.
			fmt.Printf("%s\n", token.Surface)
			continue
		}
		features := strings.Join(token.Features(), ",")
		fmt.Printf("%s\t%v\n", token.Surface, features)
	}
}

Output:

Index ¶

Constants
type Option
- func Nop() Option
- func UserDict(d *dict.UserDict) Option
type Token
type TokenClass
- func (c TokenClass) String() string
type TokenizeMode
- func (m TokenizeMode) String() string
type Tokenizer
- func New(d *dict.Dict, opts ...Option) (*Tokenizer, error)

Examples ¶

Package (Tokenize_mode)

Constants ¶

View Source

const (
	// DUMMY represents the dummy token.
	DUMMY = TokenClass(lattice.DUMMY)
	// KNOWN represents the token in the dictionary.
	KNOWN = TokenClass(lattice.KNOWN)
	// UNKNOWN represents the token which is not in the dictionary.
	UNKNOWN = TokenClass(lattice.UNKNOWN)
	// USER represents the token in the user dictionary.
	USER = TokenClass(lattice.USER)
)

Variables ¶

This section is empty.

Functions ¶

This section is empty.

Types ¶

type Option ¶

type Option func(*Tokenizer) error

Option represents an option for the tokenizer.

func Nop ¶

func Nop() Option

Nop represents a no operation option.

func UserDict ¶

func UserDict(d *dict.UserDict) Option

UserDict is a tokenizer option to sets a user dictionary.

type Token ¶

type Token struct {
	ID      int
	Class   TokenClass
	Start   int
	End     int
	Surface string
	// contains filtered or unexported fields
}

Token represents a morph of a sentence.

func (Token) BaseForm ¶

func (t Token) BaseForm() (string, bool)

BaseForm returns the base form features if exists.

func (Token) FeatureAt ¶

func (t Token) FeatureAt(i int) (string, bool)

FeatureAt returns the i th feature if exists.

func (Token) Features ¶

func (t Token) Features() []string

Features returns contents of a token.

func (Token) InflectionalForm ¶

func (t Token) InflectionalForm() (string, bool)

InflectionalForm returns the inflectional form feature if exists.

func (Token) InflectionalType ¶

func (t Token) InflectionalType() (string, bool)

InflectionalType returns the inflectional type feature if exists.

func (Token) POS ¶

func (t Token) POS() []string

POS returns POS elements of features.

func (Token) Pronunciation ¶

func (t Token) Pronunciation() (string, bool)

Pronunciation returns the pronunciation feature if exists.

func (Token) Reading ¶

func (t Token) Reading() (string, bool)

Reading returns the reading feature if exists.

func (Token) String ¶

func (t Token) String() string

String returns a string representation of a token.

type TokenClass ¶

type TokenClass lattice.NodeClass

TokenClass represents the token class.

func (TokenClass) String ¶

func (c TokenClass) String() string

String returns string representation of a token class.

type TokenizeMode ¶

type TokenizeMode int

TokenizeMode represents a mode of tokenize.

const (
	// Segmentation mode for search
	// Kagome has segmentation mode for search such as Kuromoji.
	//    Normal: Regular segmentation
	//    Search: Use a heuristic to do additional segmentation useful for search
	//    Extended: Similar to search mode, but also unigram unknown words
	//
	// Normal is the normal tokenize mode.
	Normal TokenizeMode = iota + 1
	// Search is the tokenize mode for search.
	Search
	// Extended is the experimental tokenize mode.
	Extended
	// BosEosID means the beginning a sentence or the end of a sentence.
	BosEosID = lattice.BosEosID
)

func (TokenizeMode) String ¶

func (m TokenizeMode) String() string

type Tokenizer ¶

type Tokenizer struct {
	// contains filtered or unexported fields
}

Tokenizer represents morphological analyzer.

func New ¶

func New(d *dict.Dict, opts ...Option) (*Tokenizer, error)

New creates a tokenizer.

func (Tokenizer) Analyze ¶

func (t Tokenizer) Analyze(input string, mode TokenizeMode) (tokens []Token)

Analyze tokenizes a sentence in the specified mode.

func (Tokenizer) AnalyzeGraph ¶

func (t Tokenizer) AnalyzeGraph(w io.Writer, input string, mode TokenizeMode) (tokens []Token)

AnalyzeGraph returns morphs of a sentence and exports a lattice graph to dot format.

func (Tokenizer) Dot ¶

func (t Tokenizer) Dot(w io.Writer, input string) (tokens []Token)

Dot returns morphs of a sentence and exports a lattice graph to dot format in standard tokenize mode.

func (Tokenizer) Tokenize ¶

func (t Tokenizer) Tokenize(input string) []Token

Tokenize analyzes a sentence in standard tokenize mode.

func (Tokenizer) Wakati ¶

func (t Tokenizer) Wakati(input string) []string

Wakati tokenizes a sentence and returns its divided surface strings.

Source Files ¶

View all Source files

Directories ¶

Path	Synopsis
lattice Package lattice implements the core of the morph analyzer.	Package lattice implements the core of the morph analyzer.

?	: This menu
/	: Search site
f or F	: Jump to
y or Y	: Canonical URL