uax29

package module
v1.12.5 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: May 26, 2023 License: MIT Imports: 0 Imported by: 0

README

This package tokenizes (splits) words, sentences and graphemes, based on Unicode text segmentation (UAX #29), for Unicode version 13.0.0. Details and usage are in the respective packages:

uax29/words

uax29/sentences

uax29/graphemes

Why tokenize?

Any time our code operates on individual words, we are tokenizing. Often, we do it ad hoc, such as splitting on spaces, which gives inconsistent results. The Unicode standard is better: it is multi-lingual, and handles punctuation, special characters, etc.

Conformance

We use the official test suites. Status:

Go

See also

jargon, a text pipelines package for CLI and Go, which consumes this package.

Prior art

blevesearch/segment

rivo/uniseg

Other language implementations

JavaScript

Rust

Java

Python

Documentation

Overview

Package uax29 provides Unicode text segmentation (UAX #29) for words, sentences and graphemes.

See the words, sentences, and graphemes packages for details and usage.

For more information on the UAX #29 spec: https://unicode.org/reports/tr29/

Directories

Path Synopsis
gen
Package main generates tries of Unicode properties by calling go generate as the repository root
Package main generates tries of Unicode properties by calling go generate as the repository root
triegen
Package triegen implements a code generator for a trie for associating unsigned integer values with UTF-8 encoded runes.
Package triegen implements a code generator for a trie for associating unsigned integer values with UTF-8 encoded runes.
Package graphemes implements Unicode grapheme cluster boundaries: https://unicode.org/reports/tr29/#Grapheme_Cluster_Boundaries
Package graphemes implements Unicode grapheme cluster boundaries: https://unicode.org/reports/tr29/#Grapheme_Cluster_Boundaries
Package iterators is a support (base types) package for other packages in UAX29.
Package iterators is a support (base types) package for other packages in UAX29.
filter
Package filter provides methods for filtering via Scanners and Segmenters.
Package filter provides methods for filtering via Scanners and Segmenters.
transformer
Package transformer provides a few handy transformers, for use with Scanner and Segmenter.
Package transformer provides a few handy transformers, for use with Scanner and Segmenter.
Package sentences implements Unicode sentence boundaries: https://unicode.org/reports/tr29/#Sentence_Boundaries
Package sentences implements Unicode sentence boundaries: https://unicode.org/reports/tr29/#Sentence_Boundaries
Package words implements Unicode word boundaries: https://unicode.org/reports/tr29/#Word_Boundaries
Package words implements Unicode word boundaries: https://unicode.org/reports/tr29/#Word_Boundaries

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL