tokenizer

package
v1.11.2 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Nov 21, 2019 License: Apache-2.0 Imports: 9 Imported by: 61

Documentation

Overview

Package tokenizer is a japanese morphological analyzer library.

Index

Constants

View Source
const (
	// DUMMY represents the dummy token.
	DUMMY = TokenClass(lattice.DUMMY)
	// KNOWN represents the token in the dictionary.
	KNOWN = TokenClass(lattice.KNOWN)
	// UNKNOWN represents the token which is not in the dictionary.
	UNKNOWN = TokenClass(lattice.UNKNOWN)
	// USER represents the token in the user dictionary.
	USER = TokenClass(lattice.USER)
)

Variables

This section is empty.

Functions

This section is empty.

Types

type Dic added in v1.0.0

type Dic struct {
	// contains filtered or unexported fields
}

Dic represents a dictionary.

func NewDic added in v1.0.0

func NewDic(path string) (Dic, error)

NewDic loads a dictionary from a file.

func NewDicSimple added in v1.7.0

func NewDicSimple(path string) (Dic, error)

NewDicSimple loads a dictionary from a file w/o contents.

func SysDic added in v1.0.0

func SysDic() Dic

SysDic returns the system dictionary (IPA dictionary).

func SysDicIPA added in v1.3.0

func SysDicIPA() Dic

SysDicIPA returns the IPA dictionary as the system dictionary.

func SysDicIPASimple added in v1.7.0

func SysDicIPASimple() Dic

SysDicIPASimple returns the simple IPA dictionary as the system dictionary (w/o contents).

func SysDicSimple added in v1.7.0

func SysDicSimple() Dic

SysDicSimple returns the system dictionary (IPA dictionary w/o contents).

func SysDicUni added in v1.3.0

func SysDicUni() Dic

SysDicUni returns the UniDic dictionary as the system dictionary.

func SysDicUniSimple added in v1.7.0

func SysDicUniSimple() Dic

SysDicUniSimple returns the simple UniDic dictionary as the system dictionary (w/o contents).

type Token added in v1.0.0

type Token struct {
	ID      int
	Class   TokenClass
	Start   int
	End     int
	Surface string
	// contains filtered or unexported fields
}

Token represents a morph of a sentence.

func (Token) Features added in v1.0.0

func (t Token) Features() []string

Features returns contents of a token.

func (Token) Pos added in v1.4.0

func (t Token) Pos() string

Pos returns the first element of features.

func (Token) String added in v1.0.0

func (t Token) String() string

String returns a string representation of a token.

type TokenClass added in v1.0.0

type TokenClass lattice.NodeClass

TokenClass represents the token type.

func (TokenClass) String added in v1.0.0

func (c TokenClass) String() string

type TokenizeMode added in v1.0.0

type TokenizeMode int

TokenizeMode represents a mode of tokenize.

const (
	// Normal is the normal tokenize mode.
	Normal TokenizeMode = iota + 1
	// Search is the tokenize mode for search.
	Search
	// Extended is the experimental tokenize mode.
	Extended
	// BosEosID means the beginning a sentence or the end of a sentence.
	BosEosID = lattice.BosEosID
)

type Tokenizer added in v0.0.2

type Tokenizer struct {
	// contains filtered or unexported fields
}

Tokenizer represents morphological analyzer.

func New added in v1.0.0

func New() (t Tokenizer)

New create a default tokenize.

func NewWithDic added in v1.0.0

func NewWithDic(d Dic) (t Tokenizer)

NewWithDic create a tokenizer with specified dictionary.

func NewWithDicPath added in v1.8.0

func NewWithDicPath(p string) (Tokenizer, error)

NewWithDicPath create a tokenizer with a dictionary that loads from path.

func (Tokenizer) Analyze added in v1.0.0

func (t Tokenizer) Analyze(input string, mode TokenizeMode) (tokens []Token)

Analyze tokenizes a sentence in the specified mode.

func (Tokenizer) AnalyzeGraph added in v1.2.0

func (t Tokenizer) AnalyzeGraph(w io.Writer, input string, mode TokenizeMode) (tokens []Token)

AnalyzeGraph returns morphs of a sentence and exports a lattice graph to dot format.

func (Tokenizer) Dot added in v1.0.0

func (t Tokenizer) Dot(w io.Writer, input string) (tokens []Token)

Dot returns morphs of a sentence and exports a lattice graph to dot format in standard tokenize mode.

func (*Tokenizer) SetDic added in v1.0.0

func (t *Tokenizer) SetDic(d Dic)

SetDic sets dictionary to dic.

func (*Tokenizer) SetUserDic added in v0.0.2

func (t *Tokenizer) SetUserDic(d UserDic)

SetUserDic sets user dictionary to udic.

func (Tokenizer) Tokenize added in v0.0.2

func (t Tokenizer) Tokenize(input string) []Token

Tokenize analyze a sentence in standard tokenize mode.

type UserDic added in v1.0.0

type UserDic struct {
	// contains filtered or unexported fields
}

UserDic represents a user dictionary.

func NewUserDic added in v1.0.0

func NewUserDic(path string) (UserDic, error)

NewUserDic build a user dictionary from a file.

type UserDicRecord added in v1.5.0

type UserDicRecord struct {
	Text   string   `json:"text"`
	Tokens []string `json:"tokens"`
	Yomi   []string `json:"yomi"`
	Pos    string   `json:"pos"`
}

UserDicRecord represents a record of the user dictionary file format.

type UserDicRecords added in v1.5.0

type UserDicRecords []UserDicRecord

UserDicRecords represents user dictionary data.

func NewUserDicRecords added in v1.5.0

func NewUserDicRecords(r io.Reader) (UserDicRecords, error)

NewUserDicRecords loads user dictionary data from io.Reader.

func (UserDicRecords) Len added in v1.5.0

func (u UserDicRecords) Len() int

func (UserDicRecords) Less added in v1.5.0

func (u UserDicRecords) Less(i, j int) bool

func (UserDicRecords) NewUserDic added in v1.5.0

func (u UserDicRecords) NewUserDic() (UserDic, error)

NewUserDic builds a user dictionary.

func (UserDicRecords) Swap added in v1.5.0

func (u UserDicRecords) Swap(i, j int)

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL