words

package

v1.14.3 Latest Latest Go to latest Published: Aug 29, 2024 License: MIT Imports: 3 Imported by: 3

Details

Valid go.mod file
Redistributable license
Tagged version
Stable version
Learn more about best practices

Repository

github.com/clipperhouse/uax29

Links

Open Source Insights

README ¶

An implementation of word boundaries from Unicode text segmentation (UAX 29), for Unicode version 15.0.0.

Quick start

go get "github.com/clipperhouse/uax29/words"

import "github.com/clipperhouse/uax29/words"

text := []byte("Hello, 世界. Nice dog! 👍🐶")

segments := words.NewSegmenter(text)            // A segmenter is an iterator over the words

for segments.Next() {                           // Next() returns true until end of data or error
	fmt.Printf("%q\n", segments.Bytes())        // Do something with the current token
}

if segments.Err() != nil {                      // Check the error
	log.Fatal(segments.Err())
}

Note: this package will return all tokens, including whitespace and punctuation — it's not strictly “words” in the common sense. If you wish to omit things like whitespace and punctuation, you can use a filter (see below). For our purposes, “segment”, “word”, and “token” are used synonymously.

Conformance

We use the Unicode test suite. Status:

APIs

If you have a `[]byte`

Use Segmenter for bounded memory and best performance:

text := []byte("Hello, 世界. Nice dog! 👍🐶")

segments := words.NewSegmenter(text)            // A segmenter is an iterator over the words

for segments.Next() {                           // Next() returns true until end of data or error
	fmt.Printf("%q\n", segments.Bytes())        // Do something with the current word
}

if segments.Err() != nil {                      // Check the error
	log.Fatal(segments.Err())
}

Use SegmentAll() if you prefer brevity, and are not too concerned about allocations.

segments := words.SegmentAll(text)             // Returns a slice of byte slices; each slice is a word

fmt.Println("Words: %q", segments)

If you have an `io.Reader`

Use Scanner

r := getYourReader()                            // from a file or network maybe
scanner := words.NewScanner(r)

for scanner.Scan() {                            // Scan() returns true until error or EOF
	fmt.Println(scanner.Text())                 // Do something with the current word
}

if scanner.Err() != nil {                       // Check the error
	log.Fatal(scanner.Err())
}

Performance

On a Mac M2 laptop, we see around 160MB/s, which works out to around 40 million words (tokens, really) per second.

You should see approximately constant memory when using Segmenter or Scanner, independent of data size. When using SegmentAll(), expect memory to be O(n) on the number of words (one slice per word, 24 bytes).

Invalid inputs

Invalid UTF-8 input is considered undefined behavior. We test to ensure that bad inputs will not cause pathological outcomes, such as a panic or infinite loop. Callers should expect “garbage-in, garbage-out”.

Your pipeline should probably include a call to utf8.Valid().

Filters

You can add a filter to a Scanner or Segmenter.

For example, the Segmenter / Scanner returns all tokens, split by word boundaries. This includes things like whitespace and punctuation, which may not be what one means by “words”. By using a filter, you can omit them.

text := []byte("Hello, 世界. Nice dog! 👍🐶")

segments := words.NewSegmenter(text)
segments.Filter(filter.Wordlike)

for segments.Next() {
	// Note that whitespace and punctuation are omitted.
	fmt.Printf("%q\n", segments.Bytes())
}

if segments.Err() != nil {
	log.Fatal(segments.Err())
}

You can write your own filters (predicates), with arbitrary logic, by implementing a func([]byte) bool. You can also create a filter based on Unicode categories with the filter.Contains and filter.Entirely methods.

Transforms

Tokens can be modified by adding a transformer to a Scanner or Segmenter.

You might wish to lowercase all the words, for example:

text := []byte("Hello, 世界. Nice dog! 👍🐶")

segments := words.NewSegmenter(text)
segments.Transform(transformer.Lower)

for segments.Next() {
	// Note that tokens come out lowercase
	fmt.Printf("%q\n", segments.Bytes())
}

if segments.Err() != nil {
	log.Fatal(segments.Err())
}

Here are a few more examples.

We use the x/text/transform package. We can accept anything that implements the transform.Transformer interface. Many things in x/text do that, such as runes, normalization, casing, and encoding.

Limitations

This package follows the basic UAX #29 specification. For more idiomatic treatment of words across languages, there is more that can be done, scroll down to the “Notes:” section of the standard:

It is not possible to provide a uniform set of rules that resolves all issues across languages or that handles all ambiguous situations within a given language. The goal for the specification presented in this annex is to provide a workable default; tailored implementations can be more sophisticated.

I also found this article helpful.

Documentation ¶

Overview ¶

Package words implements Unicode word boundaries: https://unicode.org/reports/tr29/#Word_Boundaries

Index ¶

func BleveIdeographic(token []byte) bool
func BleveNumeric(token []byte) bool
func NewScanner(r io.Reader) *iterators.Scanner
func NewSegmenter(data []byte) *iterators.Segmenter
func SegmentAll(data []byte) [][]byte
func SplitFunc(data []byte, atEOF bool) (advance int, token []byte, err error)

Constants ¶

This section is empty.

Variables ¶

This section is empty.

Functions ¶

func BleveIdeographic ¶ added in v1.12.0

func BleveIdeographic(token []byte) bool

BleveIdeographic determines if a token is comprised ideographs, by the Bleve segmenter's definition. It is the union of Han, Katakana, & Hiragana. See https://github.com/blevesearch/segment/blob/master/segment_words.rl ...and search for uses of "Ideo". This API is experimental.

func BleveNumeric ¶ added in v1.12.0

func BleveNumeric(token []byte) bool

BleveNumeric determines if a token is Numeric using the Bleve segmenter's. definition, see: https://github.com/blevesearch/segment/blob/master/segment_words.rl#L199-L207 This API is experimental.

func NewScanner ¶

func NewScanner(r io.Reader) *iterators.Scanner

NewScanner returns a Scanner, to tokenize words per https://unicode.org/reports/tr29/#Word_Boundaries. Iterate through words by calling Scan() until false, then check Err(). See also the bufio.Scanner docs.

Example ¶

package main

import (
	"fmt"
	"log"
	"strings"

	"github.com/clipperhouse/uax29/iterators/filter"
	"github.com/clipperhouse/uax29/words"
)

func main() {
	text := "Hello, 世界. Nice dog! 👍🐶"
	r := strings.NewReader(text)

	sc := words.NewScanner(r)
	sc.Filter(filter.Wordlike) // let's exclude whitespace & punctuation

	// Scan returns true until error or EOF
	for sc.Scan() {
		// Do something with the token (segment)
		fmt.Printf("%q\n", sc.Text())
	}

	// Gotta check the error!
	if err := sc.Err(); err != nil {
		log.Fatal(err)
	}
}

Output:

"Hello"
"世"
"界"
"Nice"
"dog"
"👍"
"🐶"

func NewSegmenter ¶ added in v1.7.0

func NewSegmenter(data []byte) *iterators.Segmenter

NewSegmenter retuns a Segmenter, which is an iterator over the source text. Iterate while Next() is true, and access the segmented words via Bytes().

Example ¶

package main

import (
	"fmt"
	"log"

	"github.com/clipperhouse/uax29/iterators/filter"
	"github.com/clipperhouse/uax29/words"
)

func main() {
	text := []byte("Hello, 世界. Nice dog! 👍🐶")

	seg := words.NewSegmenter(text)
	seg.Filter(filter.Wordlike) // let's exclude whitespace & punctuation

	// Next returns true until error or end of data
	for seg.Next() {
		// Do something with the token (segment)
		fmt.Printf("%q\n", seg.Bytes())
	}

	// Gotta check the error!
	if err := seg.Err(); err != nil {
		log.Fatal(err)
	}
}

Output:

"Hello"
"世"
"界"
"Nice"
"dog"
"👍"
"🐶"

func SegmentAll ¶ added in v1.7.0

func SegmentAll(data []byte) [][]byte

SegmentAll will iterate through all tokens and collect them into a [][]byte. This is a convenience method -- if you will be allocating such a slice anyway, this will save you some code. The downside is that this allocation is unbounded -- O(n) on the number of tokens. Use Segmenter for more bounded memory usage.

Example ¶

package main

import (
	"fmt"

	"github.com/clipperhouse/uax29/words"
)

func main() {
	text := []byte("Hello, 世界. Nice dog! 👍🐶")

	segments := words.SegmentAll(text)
	fmt.Printf("%q\n", segments)
}

Output:

["Hello" "," " " "世" "界" "." " " "Nice" " " "dog" "!" " " "👍" "🐶"]

func SplitFunc ¶ added in v1.2.0

func SplitFunc(data []byte, atEOF bool) (advance int, token []byte, err error)

SplitFunc is a bufio.SplitFunc implementation of word segmentation, for use with bufio.Scanner.

Types ¶

This section is empty.

Source Files ¶

View all Source files

?	: This menu
/	: Search site
f or F	: Jump to
y or Y	: Canonical URL

README ¶

Quick start

Conformance

APIs

If you have a []byte

If you have an io.Reader

Performance

Invalid inputs

Filters

Transforms

Limitations

Documentation ¶

Overview ¶

Index ¶

Examples ¶

Constants ¶

Variables ¶

Functions ¶

func BleveIdeographic ¶ added in v1.12.0

func BleveNumeric ¶ added in v1.12.0

func NewScanner ¶

func NewSegmenter ¶ added in v1.7.0

func SegmentAll ¶ added in v1.7.0

func SplitFunc ¶ added in v1.2.0

Types ¶

Source Files ¶

If you have a `[]byte`

If you have an `io.Reader`