uax29

package module
v1.14.3 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Aug 29, 2024 License: MIT Imports: 0 Imported by: 0

README

This package tokenizes (splits) words, sentences and graphemes, based on Unicode text segmentation (UAX #29), for Unicode version 15.0.0. Details and usage are in the respective packages:

uax29/words

uax29/sentences

uax29/graphemes

uax29/phrases

Why tokenize?

Any time our code operates on individual words, we are tokenizing. Often, we do it ad hoc, such as splitting on spaces, which gives inconsistent results. The Unicode standard is better: it is multi-lingual, and handles punctuation, special characters, etc.

Uses

The uax29 module has 4 tokenizers. In decreasing granularity: sentences → phrases → words → graphemes. Words is the most common use.

You might use this for inverted indexes, full-text search, TF-IDF, BM25, embeddings, etc. Anything that needs word boundaries.

If you're doing embeddings, the definition of “meaningful unit” will depend on your application. You might choose sentences, phrases, words, or a combination.

Conformance

We use the official Unicode test suites. Status:

Go

Quick start

go get "github.com/clipperhouse/uax29/words"
import "github.com/clipperhouse/uax29/words"

text := []byte("Hello, 世界. Nice dog! 👍🐶")

segments := words.NewSegmenter(text)            // A segmenter is an iterator over the words

for segments.Next() {                           // Next() returns true until end of data or error
	fmt.Printf("%q\n", segments.Bytes())        // Do something with the current token
}

if segments.Err() != nil {                      // Check the error
	log.Fatal(segments.Err())
}

See also

jargon, a text pipelines package for CLI and Go, which consumes this package.

Prior art

blevesearch/segment

rivo/uniseg

Other language implementations

C# (also by me)

JavaScript

Rust

Java

Python

Documentation

Overview

Package uax29 provides Unicode text segmentation (UAX #29) for words, sentences and graphemes.

See the words, sentences, and graphemes packages for details and usage.

For more information on the UAX #29 spec: https://unicode.org/reports/tr29/

Directories

Path Synopsis
gen
Package main generates tries of Unicode properties by calling go generate as the repository root
Package main generates tries of Unicode properties by calling go generate as the repository root
triegen
Package triegen implements a code generator for a trie for associating unsigned integer values with UTF-8 encoded runes.
Package triegen implements a code generator for a trie for associating unsigned integer values with UTF-8 encoded runes.
Package graphemes implements Unicode grapheme cluster boundaries: https://unicode.org/reports/tr29/#Grapheme_Cluster_Boundaries
Package graphemes implements Unicode grapheme cluster boundaries: https://unicode.org/reports/tr29/#Grapheme_Cluster_Boundaries
Package iterators is a support (base types) package for other packages in UAX29.
Package iterators is a support (base types) package for other packages in UAX29.
filter
Package filter provides methods for filtering via Scanners and Segmenters.
Package filter provides methods for filtering via Scanners and Segmenters.
transformer
Package transformer provides a few handy transformers, for use with Scanner and Segmenter.
Package transformer provides a few handy transformers, for use with Scanner and Segmenter.
Package phrases implements Unicode phrase boundaries: https://unicode.org/reports/tr29/#phrase_Boundaries
Package phrases implements Unicode phrase boundaries: https://unicode.org/reports/tr29/#phrase_Boundaries
Package sentences implements Unicode sentence boundaries: https://unicode.org/reports/tr29/#Sentence_Boundaries
Package sentences implements Unicode sentence boundaries: https://unicode.org/reports/tr29/#Sentence_Boundaries
Package words implements Unicode word boundaries: https://unicode.org/reports/tr29/#Word_Boundaries
Package words implements Unicode word boundaries: https://unicode.org/reports/tr29/#Word_Boundaries

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL