phrases

package
v1.14.3 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Aug 29, 2024 License: MIT Imports: 3 Imported by: 1

README

An implementation of “phrase boundaries”, a variation on words boundaries from Unicode text segmentation (UAX 29).

“Phrases” are not a Unicode standard, it is our definition that we think may be useful. We define it as “a series of words separated only by spaces”. Punctuation breaks phrases. Emojis are treated as words.

Quick start

go get "github.com/clipperhouse/uax29/phrases"
text := []byte("Hello, 世界. Nice — and totally adorable — dog; perhaps the “best one”! 🏆 🐶")

phrase := phrases.NewSegmenter(text)

// Next returns true until error or end of data
for phrase.Next() {
	// Do something with the phrase
	fmt.Printf("%q\n", phrase.Bytes())
}

// Gotta check the error!
if err := phrase.Err(); err != nil {
	log.Fatal(err)
}
// Output: "Hello"
// ","
// " "
// "世"
// "界"
// "."
// " Nice "
// "—"
// " and totally adorable "
// "—"
// " dog"
// ";"
// " perhaps the "
// "“"
// "best one"
// "”"
// "!"
// " 🏆 🐶"

Documentation

Note: this package will return all tokens, including punctuation — it's not strictly “phrases” in the common sense. If you wish to omit things certain tokens, use a filter (see below). For our purposes, “segment”, “phrase”, and “token” are used synonymously.

APIs

If you have a []byte

Use Segmenter for bounded memory and best performance:

text := []byte("Hello, 世界. Nice dog! 👍🐶")

segments := phrases.NewSegmenter(text)            // A segmenter is an iterator over the phrases

for segments.Next() {                           // Next() returns true until end of data or error
	fmt.Printf("%q\n", segments.Bytes())        // Do something with the current phrase
}

if segments.Err() != nil {                      // Check the error
	log.Fatal(segments.Err())
}

Use SegmentAll() if you prefer brevity, and are not too concerned about allocations.

segments := phrases.SegmentAll(text)             // Returns a slice of byte slices; each slice is a phrase

fmt.Println("phrases: %q", segments)
If you have an io.Reader

Use Scanner

r := getYourReader()                            // from a file or network maybe
scanner := phrases.NewScanner(r)

for scanner.Scan() {                            // Scan() returns true until error or EOF
	fmt.Println(scanner.Text())                 // Do something with the current phrase
}

if scanner.Err() != nil {                       // Check the error
	log.Fatal(scanner.Err())
}

Performance

On a Mac M2 laptop, we see around 240MB/s, which works out to around 30 million phrases (tokens, really) per second.

You should see approximately constant memory when using Segmenter or Scanner, independent of data size. When using SegmentAll(), expect memory to be O(n) on the number of phrases (one slice per phrase).

Uses

The uax29 module has 4 tokenizers. In decreasing granularity: sentences → phrases → words → graphemes.

For best results, you may wish to first split sentences, and then split phrases within those sentences.

If you're doing embeddings, the definition of “meaningful unit” will depend on your application. You might choose sentences, phrases, words, or all of the above. You can tokenize the tokens of other tokenizers.

Invalid inputs

Invalid UTF-8 input is considered undefined behavior. We test to ensure that bad inputs will not cause pathological outcomes, such as a panic or infinite loop. Callers should expect “garbage-in, garbage-out”.

Your pipeline should probably include a call to utf8.Valid().

Filters

You can add a filter to a Scanner or Segmenter.

For example, the tokenizer returns all tokens, split by phrase boundaries. This may includes things like punctuation, which may not be what one means by “phrases”. By using a filter, you can omit them.

text := []byte("Hello, 世界. Nice dog! 👍🐶")

segments := phrases.NewSegmenter(text)
segments.Filter(filter.Wordlike)

for segments.Next() {
	fmt.Printf("%q\n", segments.Bytes())
}

if segments.Err() != nil {
	log.Fatal(segments.Err())
}

You can write your own filters (predicates), with arbitrary logic, by implementing a func([]byte) bool. You can also create a filter based on Unicode categories with the filter.Contains and filter.Entirely methods.

Transforms

Tokens can be modified by adding a transformer to a Scanner or Segmenter.

You might wish to lowercase all the phrases, for example:

text := []byte("Hello, 世界. Nice dog! 👍🐶")

phrases := phrases.NewSegmenter(text)
phrases.Transform(transformer.Lower)

for phrases.Next() {
	// Note that tokens come out lowercase
	fmt.Printf("%q\n", phrases.Bytes())
}

if phrases.Err() != nil {
	log.Fatal(phrases.Err())
}

Here are a few more examples.

We use the x/text/transform package. We can accept anything that implements the transform.Transformer interface. Many things in x/text do that, such as runes, normalization, casing, and encoding.

See also this stemming package.

Limitations

This package follows derives from the basic UAX #29 specification. For more idiomatic treatment of phrases across languages, there is more that can be done, scroll down to the “Notes:” section of the standard:

It is not possible to provide a uniform set of rules that resolves all issues across languages or that handles all ambiguous situations within a given language. The goal for the specification presented in this annex is to provide a workable default; tailored implementations can be more sophisticated.

I also found this article helpful.

Documentation

Overview

Package phrases implements Unicode phrase boundaries: https://unicode.org/reports/tr29/#phrase_Boundaries

Index

Examples

Constants

This section is empty.

Variables

This section is empty.

Functions

func NewScanner

func NewScanner(r io.Reader) *iterators.Scanner

NewScanner returns a Scanner, to tokenize phrases per https://unicode.org/reports/tr29/#phrase_Boundaries. Iterate through phrases by calling Scan() until false, then check Err(). See also the bufio.Scanner docs.

Example
package main

import (
	"fmt"
	"log"
	"strings"

	"github.com/clipperhouse/uax29/phrases"
)

func main() {
	text := "Hello, 世界. Nice — and adorable — dog; perhaps the “best one”! 🏆 🐶"
	r := strings.NewReader(text)

	sc := phrases.NewScanner(r)

	// Scan returns true until error or EOF
	for sc.Scan() {
		// Do something with the token (segment)
		fmt.Printf("%q\n", sc.Text())
	}

	// Gotta check the error!
	if err := sc.Err(); err != nil {
		log.Fatal(err)
	}
}
Output:

"Hello"
","
" "
"世"
"界"
"."
" Nice "
"—"
" and adorable "
"—"
" dog"
";"
" perhaps the "
"“"
"best one"
"”"
"!"
" 🏆 🐶"

func NewSegmenter

func NewSegmenter(data []byte) *iterators.Segmenter

NewSegmenter retuns a Segmenter, which is an iterator over the source text. Iterate while Next() is true, and access the segmented phrases via Bytes().

Example
package main

import (
	"fmt"
	"log"

	"github.com/clipperhouse/uax29/phrases"
)

func main() {
	text := []byte("Hello, 世界. Nice — and adorable — dog; perhaps the “best one”! 🏆 🐶")

	phrase := phrases.NewSegmenter(text)

	// Next returns true until error or end of data
	for phrase.Next() {
		// Do something with the phrase
		fmt.Printf("%q\n", phrase.Bytes())
	}

	// Gotta check the error!
	if err := phrase.Err(); err != nil {
		log.Fatal(err)
	}
}
Output:

"Hello"
","
" "
"世"
"界"
"."
" Nice "
"—"
" and adorable "
"—"
" dog"
";"
" perhaps the "
"“"
"best one"
"”"
"!"
" 🏆 🐶"

func SegmentAll

func SegmentAll(data []byte) [][]byte

SegmentAll will iterate through all tokens and collect them into a [][]byte. This is a convenience method -- if you will be allocating such a slice anyway, this will save you some code. The downside is that this allocation is unbounded -- O(n) on the number of tokens. Use Segmenter for more bounded memory usage.

Example
package main

import (
	"fmt"

	"github.com/clipperhouse/uax29/phrases"
)

func main() {
	text := []byte("Hello, 世界. Nice — and adorable — dog; perhaps the best one! 👍🐶")

	segments := phrases.SegmentAll(text)
	fmt.Printf("%q\n", segments)
}
Output:

["Hello" "," " " "世" "界" "." " Nice " "—" " and adorable " "—" " dog" ";" " perhaps the best one" "!" " 👍🐶"]

func SplitFunc

func SplitFunc(data []byte, atEOF bool) (advance int, token []byte, err error)

SplitFunc is a bufio.SplitFunc implementation of phrase segmentation, for use with bufio.Scanner.

Types

This section is empty.

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL