Documentation ¶
Index ¶
- Constants
- type Automaton
- type Bits
- type DaTokenizer
- func (dat *DaTokenizer) GetSize() int
- func (dat *DaTokenizer) LoadFactor() float64
- func (dat *DaTokenizer) Save(file string) (n int64, err error)
- func (dat *DaTokenizer) TransCount() int
- func (dat *DaTokenizer) Transduce(r io.Reader, w io.Writer) bool
- func (dat *DaTokenizer) TransduceTokenWriter(r io.Reader, w *TokenWriter) bool
- func (DaTokenizer) Type() string
- func (dat *DaTokenizer) WriteTo(w io.Writer) (n int64, err error)
- type MatrixTokenizer
- func (mat *MatrixTokenizer) Save(file string) (n int64, err error)
- func (mat *MatrixTokenizer) Transduce(r io.Reader, w io.Writer) bool
- func (mat *MatrixTokenizer) TransduceTokenWriter(r io.Reader, w *TokenWriter) bool
- func (MatrixTokenizer) Type() string
- func (mat *MatrixTokenizer) WriteTo(w io.Writer) (n int64, err error)
- type TokenWriter
- type Tokenizer
Constants ¶
const ( DEBUG = false DAMAGIC = "DATOK" VERSION = uint16(1) FIRSTBIT uint32 = 1 << 31 SECONDBIT uint32 = 1 << 30 RESTBIT uint32 = ^uint32(0) &^ (FIRSTBIT | SECONDBIT) )
const ( PROPS = 1 SIGMA = 2 STATES = 3 NONE = 4 )
const ( MAMAGIC = "MATOK" EOT = 4 )
Variables ¶
This section is empty.
Functions ¶
This section is empty.
Types ¶
type Automaton ¶
type Automaton struct {
// contains filtered or unexported fields
}
Automaton is the intermediate representation of the tokenizer.
func LoadFomaFile ¶
ParseFoma reads the FST from a foma file and creates an internal representation, in case it follows the tokenizer's convention.
func ParseFoma ¶
ParseFoma reads the FST from a foma file reader and creates an internal representation, in case it follows the tokenizer's convention.
func (*Automaton) ToDoubleArray ¶
func (auto *Automaton) ToDoubleArray() *DaTokenizer
ToDoubleArray turns the intermediate tokenizer representation into a double array representation.
This is based on Mizobuchi et al (2000), p.128
func (*Automaton) ToMatrix ¶
func (auto *Automaton) ToMatrix() *MatrixTokenizer
ToMatrix turns the intermediate tokenizer into a matrix representation.
type DaTokenizer ¶
type DaTokenizer struct {
// contains filtered or unexported fields
}
DaTokenizer represents a tokenizer implemented as a Double Array FSA.
func LoadDatokFile ¶
func LoadDatokFile(file string) *DaTokenizer
LoadDatokFile reads a double array represented tokenizer from a file.
func ParseDatok ¶
func ParseDatok(ior io.Reader) *DaTokenizer
LoadDatokFile reads a double array represented tokenizer from an io.Reader
func (*DaTokenizer) LoadFactor ¶
func (dat *DaTokenizer) LoadFactor() float64
LoadFactor as defined in Kanda et al (2018), i.e. the proportion of non-empty elements to all elements.
func (*DaTokenizer) Save ¶
func (dat *DaTokenizer) Save(file string) (n int64, err error)
Save stores the double array data in a file
func (*DaTokenizer) TransCount ¶
func (dat *DaTokenizer) TransCount() int
TransCount as the number of transitions aka arcs in the finite state automaton
func (*DaTokenizer) TransduceTokenWriter ¶
func (dat *DaTokenizer) TransduceTokenWriter(r io.Reader, w *TokenWriter) bool
TransduceTokenWriter transduces an input string against the double array FSA. The rules are always greedy. If the automaton fails, it takes the last possible token ending branch.
Based on Mizobuchi et al (2000), p. 129, with additional support for IDENTITY, UNKNOWN and EPSILON transitions and NONTOKEN and TOKENEND handling.
type MatrixTokenizer ¶
type MatrixTokenizer struct {
// contains filtered or unexported fields
}
func LoadMatrixFile ¶
func LoadMatrixFile(file string) *MatrixTokenizer
LoadDatokFile reads a double array represented tokenizer from a file.
func ParseMatrix ¶
func ParseMatrix(ior io.Reader) *MatrixTokenizer
LoadMatrixFile reads a matrix represented tokenizer from an io.Reader
func (*MatrixTokenizer) Save ¶
func (mat *MatrixTokenizer) Save(file string) (n int64, err error)
Save stores the matrix data in a file
func (*MatrixTokenizer) TransduceTokenWriter ¶
func (mat *MatrixTokenizer) TransduceTokenWriter(r io.Reader, w *TokenWriter) bool
TransduceTokenWriter transduces an input string against the matrix FSA. The rules are always greedy. If the automaton fails, it takes the last possible token ending branch.
type TokenWriter ¶
type TokenWriter struct { SentenceEnd func(int) TextEnd func(int) Flush func() error Token func(int, []rune) }
func NewTokenWriter ¶
func NewTokenWriter(w io.Writer, flags Bits) *TokenWriter
Create a new token writer based on the options