datok

package module

v0.2.2 Latest Latest Go to latest Published: Sep 6, 2023 License: Apache-2.0 Imports: 11 Imported by: 0

Details

Valid go.mod file
Redistributable license
Tagged version
Stable version
Learn more about best practices

Repository

github.com/KorAP/datok

Links

Open Source Insights

README ¶

Datok - Finite State Tokenizer

Introduction to Datok

Implementation of a finite state automaton for high-performance large-scale natural language tokenization, based on a finite state transducer generated with Foma.

The repository currently contains precompiled tokenizer models for

The focus of development is on the tokenization of DeReKo, the german reference corpus.

Datok can be used as a standalone tool or as a library in Go.

Performance

Speed comparison of german tokenizers

Chart showing speed comparison of different tokenizers and sentence splitters for German. Effi refers to tokenizing and/or sentence splitting of one issue of Effi Briest. Datok is optimized for large batch sizes, while other tools may perform better in other scenarios. For further benchmarks, especially regarding the quality of tokenization, see Diewald/Kupietz/Lüngen (2022).

Tokenization

Usage: datok tokenize --tokenizer=STRING <input>

Arguments:
  <input>    Input file to tokenize (use - for STDIN)

Flags:
  -h, --help                  Show context-sensitive help.

  -t, --tokenizer=STRING      The Matrix or Double Array Tokenizer file
      --[no-]tokens           Print token surfaces (defaults to true)
      --[no-]sentences        Print sentence boundaries (defaults to true)
  -p, --token-positions       Print token offsets (defaults to false)
      --sentence-positions    Print sentence offsets (defaults to false)
      --newline-after-eot     Ignore newline after EOT (defaults to false)

The special END OF TRANSMISSION character (\x04) can be used to mark the end of a text.

Caution: When experimenting with STDIN and echo, you may need to disable history expansion.

Conversion

Usage: datok convert --foma=STRING --tokenizer=STRING

Flags:
  -h, --help                Show context-sensitive help.

  -i, --foma=STRING         The Foma FST file
  -o, --tokenizer=STRING    The Tokenizer file
  -d, --double-array        Convert to Double Array instead of Matrix
                            representation

Library

package main

import (
	"github.com/KorAP/datok"
	"os"
	"strings"
)

func main () {

	// Load transducer binary
	dat := datok.LoadTokenizerFile("tokenizer_de.matok")
	if dat == nil {
		panic("Can't load tokenizer")
	}

	// Create a new TokenWriter object
	tw := datok.NewTokenWriter(os.Stdout, datok.TOKENS|datok.SENTENCES)
	defer tw.Flush()

	// Create an io.Reader object refering to the data to tokenize
	r := strings.NewReader("Das ist <em>interessant</em>!")

	// The transduceTokenWriter accepts an io.Reader
	// object and a TokenWriter object to transduce the input
	dat.TransduceTokenWriter(r, tw)
}

Conventions

The FST generated by Foma must adhere to the following rules, to be convertible by Datok:

Character accepting arcs need to be translated only to themselves or to ε (the empty symbol). I.e. they will either be unchanged part of the output or ignored (e.g. whitespace characters).
Multi-character symbols are not allowed, except for the @_TOKEN_BOUND_@, that denotes the end of a token.
ε accepting arcs (transitions not consuming any character) need to be translated to the @_TOKEN_BOUND_@.
Two consecutive @_TOKEN_BOUND_@s mark a sentence end.
Flag diacritics are not supported.
Final states are ignored. The @_TOKEN_BOUND_@ marks the end of a token instead.

A minimal usable tokenizer written in XFST and following the guidelines to tokenizers in Beesley and Karttunen (2003) and Beesley (2004) would look like this:

define TB "@_TOKEN_BOUND_@";

define WS [" "|"\u000a"|"\u0009"];

define PUNCT ["."|"?"|"!"];

define Char \[WS|PUNCT];

define Word Char+;

! Compose token bounds
define Tokenizer [[Word|PUNCT] @-> ... TB] .o.
! Compose Whitespace ignorance
       [WS+ @-> 0] .o.
! Compose sentence ends
       [[PUNCT+] @-> ... TB \/ TB _ ];

read regex Tokenizer;

Hint: For development in Foma it's easier to replace @_TOKEN_BOUND_@ with a newline symbol.

Building

To build the tokenizer tool, run

$ make build

To create a foma file from the example sources, first install Foma, then run in the root directory of this repository

$ cd src && \
  foma -e "source de/tokenizer.xfst" \
  -e "save stack ../mytokenizer.fst" -q -s && \
  cd ..

This will load and compile the german tokenizer.xfst and will save the compiled FST as mytokenizer.fst in the root directory.

To generate a Datok FSA (matrix representation) based on this FST, run

$ datok convert -i mytokenizer.fst -o mytokenizer.datok

To generate a Datok FSA (double array representation*) based on this FST, run

$ datok convert -i mytokenizer.fst -o mytokenizer.datok -d

The final datok file can then be used as a model for the tokenizer.

This may take quite some time depending on the number of arcs in the FST and is therefore not recommended in most cases.

Technology

Internally the FSA is represented either as a matrix or as a double array.

Both representations mark all non-word-character targets with a leading bit. All ε (aka tokenend) transitions mark the end of a token or the end of a sentence (2 subsequential ε). The transduction is greedy with a single backtracking option to the last ε transition.

The double array representation (Aoe 1989) of all transitions in the FST is implemented as an extended DFA following Mizobuchi et al. (2000) and implementation details following Kanda et al. (2018).

References

Please cite this work as:

Diewald, Nils (2022): Matrix and Double-Array Representations for Efficient Finite State Tokenization. In: Proceedings of the 10th Workshop on Challenges in the Management of Large Corpora (CMLC-10) at LREC 2022. Marseille, France, pp. 20-26.

The library contains sources for a german tokenizer based on KorAP-Tokenizer.

For speed and quality analysis in comparison to other tokenizers for German, please refer to this article:

Diewald, Nils, Marc Kupietz, Harald Lüngen (2022): Tokenizing on scale - Preprocessing large text corpora on the lexical and sentence level. In: Proceedings of EURALEX 2022. Mannheim, Germany, pp. 208-221.

The benchmarks can be reproduced using this test suite.

License

Datok is published under the Apache 2.0 License.

The german and english tokenizer shipped is based on work done by the Lucene project (published under the Apache License), David Hall (published under the Apache License), Çağrı Çöltekin (published under the MIT License), and Marc Kupietz (published under the Apache License).

The english clitics list is based on Zwicky & Pullum (1983).

The foma parser is based on foma2js, written by Mans Hulden (published under the Apache License).

Bibliography

Aoe, Jun-ichi (1989): An Efficient Digital Search Algorithm by Using a Double-Array Structure. IEEE Transactions on Software Engineering, 15 (9), pp. 1066-1077.

Beesley, Kenneth R. & Lauri Karttunen (2003): Finite State Morphology. Stanford, CA: CSLI Publications.

Beesley, Kenneth R. (2004): Tokenizing Transducers. https://web.stanford.edu/~laurik/fsmbook/clarifications/tokfst.html

Hulden, Mans (2009): Foma: a finite-state compiler and library. In: Proceedings of the 12th Conference of the European Chapter of the Association for Computational Linguistics, Association for Computational Linguistics, pp. 29-32.

Mizobuchi, Shoji, Toru Sumitomo, Masao Fuketa & Jun-ichi Aoe (2000): An efficient representation for implementing finite state machines based on the double-array. Information Sciences 129, pp. 119-139.

Kanda, Shunsuke, Yuma Fujita, Kazuhiro Morita & Masao Fuketa (2018): Practical rearrangement methods for dynamic double-array dictionaries. Software: Practice and Experience (SPE), 48(1), pp. 65–83.

Zwicky, Arnold M., Geoffrey K. Pullum (1983): Cliticization vs. Inflection: English N’T. Language, 59, pp. 502-513.

Documentation ¶

Index ¶

Constants
type Automaton
- func LoadFomaFile(file string) *Automaton
- func ParseFoma(ior io.Reader) *Automaton
- func (auto *Automaton) ToDoubleArray() *DaTokenizer
- func (auto *Automaton) ToMatrix() *MatrixTokenizer
type Bits
type DaTokenizer
- func LoadDatokFile(file string) *DaTokenizer
- func ParseDatok(ior io.Reader) *DaTokenizer
type MatrixTokenizer
- func LoadMatrixFile(file string) *MatrixTokenizer
- func ParseMatrix(ior io.Reader) *MatrixTokenizer
type TokenWriter
- func NewTokenWriter(w io.Writer, flags Bits) *TokenWriter
type Tokenizer
- func LoadTokenizerFile(file string) Tokenizer

Constants ¶

View Source

const (
	DEBUG            = false
	DAMAGIC          = "DATOK"
	VERSION          = uint16(1)
	FIRSTBIT  uint32 = 1 << 31
	SECONDBIT uint32 = 1 << 30
	RESTBIT   uint32 = ^uint32(0) &^ (FIRSTBIT | SECONDBIT)
)

View Source

const (
	PROPS  = 1
	SIGMA  = 2
	STATES = 3
	NONE   = 4
)

View Source

const (
	MAMAGIC = "MATOK"
	EOT     = 4
)

Variables ¶

This section is empty.

Functions ¶

This section is empty.

Types ¶

type Automaton ¶

type Automaton struct {
	// contains filtered or unexported fields
}

Automaton is the intermediate representation of the tokenizer.

func LoadFomaFile ¶

func LoadFomaFile(file string) *Automaton

ParseFoma reads the FST from a foma file and creates an internal representation, in case it follows the tokenizer's convention.

func ParseFoma ¶

func ParseFoma(ior io.Reader) *Automaton

ParseFoma reads the FST from a foma file reader and creates an internal representation, in case it follows the tokenizer's convention.

func (*Automaton) ToDoubleArray ¶

func (auto *Automaton) ToDoubleArray() *DaTokenizer

ToDoubleArray turns the intermediate tokenizer representation into a double array representation.

This is based on Mizobuchi et al (2000), p.128

func (*Automaton) ToMatrix ¶

func (auto *Automaton) ToMatrix() *MatrixTokenizer

ToMatrix turns the intermediate tokenizer into a matrix representation.

type Bits ¶

type Bits uint8

const (
	TOKENS Bits = 1 << iota
	SENTENCES
	TOKEN_POS
	SENTENCE_POS
	NEWLINE_AFTER_EOT

	SIMPLE = TOKENS | SENTENCES
)

type DaTokenizer ¶

type DaTokenizer struct {
	// contains filtered or unexported fields
}

DaTokenizer represents a tokenizer implemented as a Double Array FSA.

func LoadDatokFile ¶

func LoadDatokFile(file string) *DaTokenizer

LoadDatokFile reads a double array represented tokenizer from a file.

func ParseDatok ¶

func ParseDatok(ior io.Reader) *DaTokenizer

LoadDatokFile reads a double array represented tokenizer from an io.Reader

func (*DaTokenizer) GetSize ¶

func (dat *DaTokenizer) GetSize() int

Get size of double array

func (*DaTokenizer) LoadFactor ¶

func (dat *DaTokenizer) LoadFactor() float64

LoadFactor as defined in Kanda et al (2018), i.e. the proportion of non-empty elements to all elements.

func (*DaTokenizer) Save ¶

func (dat *DaTokenizer) Save(file string) (n int64, err error)

Save stores the double array data in a file

func (*DaTokenizer) TransCount ¶

func (dat *DaTokenizer) TransCount() int

TransCount as the number of transitions aka arcs in the finite state automaton

func (*DaTokenizer) Transduce ¶

func (dat *DaTokenizer) Transduce(r io.Reader, w io.Writer) bool

Transduce input to ouutput

func (*DaTokenizer) TransduceTokenWriter ¶

func (dat *DaTokenizer) TransduceTokenWriter(r io.Reader, w *TokenWriter) bool

TransduceTokenWriter transduces an input string against the double array FSA. The rules are always greedy. If the automaton fails, it takes the last possible token ending branch.

Based on Mizobuchi et al (2000), p. 129, with additional support for IDENTITY, UNKNOWN and EPSILON transitions and NONTOKEN and TOKENEND handling.

func (DaTokenizer) Type ¶

func (DaTokenizer) Type() string

Type of tokenizer

func (*DaTokenizer) WriteTo ¶

func (dat *DaTokenizer) WriteTo(w io.Writer) (n int64, err error)

WriteTo stores the double array data in an io.Writer.

type MatrixTokenizer ¶

type MatrixTokenizer struct {
	// contains filtered or unexported fields
}

func LoadMatrixFile ¶

func LoadMatrixFile(file string) *MatrixTokenizer

LoadDatokFile reads a double array represented tokenizer from a file.

func ParseMatrix ¶

func ParseMatrix(ior io.Reader) *MatrixTokenizer

LoadMatrixFile reads a matrix represented tokenizer from an io.Reader

func (*MatrixTokenizer) Save ¶

func (mat *MatrixTokenizer) Save(file string) (n int64, err error)

Save stores the matrix data in a file

func (*MatrixTokenizer) Transduce ¶

func (mat *MatrixTokenizer) Transduce(r io.Reader, w io.Writer) bool

Transduce input to ouutput

func (*MatrixTokenizer) TransduceTokenWriter ¶

func (mat *MatrixTokenizer) TransduceTokenWriter(r io.Reader, w *TokenWriter) bool

TransduceTokenWriter transduces an input string against the matrix FSA. The rules are always greedy. If the automaton fails, it takes the last possible token ending branch.

func (MatrixTokenizer) Type ¶

func (MatrixTokenizer) Type() string

Type of tokenizer

func (*MatrixTokenizer) WriteTo ¶

func (mat *MatrixTokenizer) WriteTo(w io.Writer) (n int64, err error)

WriteTo stores the matrix data in an io.Writer.

type TokenWriter ¶

type TokenWriter struct {
	SentenceEnd func(int)
	TextEnd     func(int)
	Flush       func() error
	Token       func(int, []rune)
}

func NewTokenWriter ¶

func NewTokenWriter(w io.Writer, flags Bits) *TokenWriter

Create a new token writer based on the options

type Tokenizer ¶

type Tokenizer interface {
	Transduce(r io.Reader, w io.Writer) bool
	TransduceTokenWriter(r io.Reader, w *TokenWriter) bool
	Type() string
}

func LoadTokenizerFile ¶

func LoadTokenizerFile(file string) Tokenizer

Source Files ¶

View all Source files

Directories ¶

Path	Synopsis
cmd

?	: This menu
/	: Search site
f or F	: Jump to
y or Y	: Canonical URL