sentences

package

v1.14.3 Latest Latest Go to latest Published: Aug 29, 2024 License: MIT Imports: 3 Imported by: 2

Details

Valid go.mod file
Redistributable license
Tagged version
Stable version
Learn more about best practices

Repository

github.com/clipperhouse/uax29

Links

Open Source Insights

README ¶

An implementation of sentence boundaries from Unicode text segmentation (UAX 29), for Unicode version 15.0.0.

Quick start

go get "github.com/clipperhouse/uax29/sentences"

import "github.com/clipperhouse/uax29/sentences"

text := []byte("Hello, 世界. Nice dog! 👍🐶")

segments := sentences.NewSegmenter(text)        // A segmenter is an iterator over the sentences

for segments.Next() {                           // Next() returns true until end of data or error
	fmt.Printf("%q\n", segments.Bytes())        // Do something with the current sentence
}

if err := segments.Err(); err != nil {          // Check the error
	log.Fatal(err)
}

For our purposes, “segment”, “sentence”, and “token” are used synonymously.

Conformance

We use the Unicode test suite. Status:

APIs

If you have a `[]byte`

Use Segmenter for bounded memory and best performance:

text := []byte("Hello, 世界. Nice dog! 👍🐶")

segments := sentences.NewSegmenter(text)        // A segmenter is an iterator over the sentences

for segments.Next() {                           // Next() returns true until end of data or error
	fmt.Printf("%q\n", segments.Bytes())        // Do something with the current sentence
}

if err := segments.Err(); err != nil {          // Check the error
	log.Fatal(err)
}

Use SegmentAll() if you prefer brevity, are not too concerned about allocations, or would be populating a [][]byte anyway.

text := []byte("Hello, 世界. Nice dog! 👍🐶")
segments := sentences.SegmentAll(text)          // Returns a slice of byte slices; each slice is a sentence

fmt.Println("Graphemes: %q", segments)

If you have an `io.Reader`

Use Scanner (which is a bufio.Scanner, those docs will tell you what to do).

r := getYourReader()                            // from a file or network maybe
scanner := sentences.NewScanner(r)

for scanner.Scan() {                            // Scan() returns true until error or EOF
	fmt.Println(scanner.Text())                 // Do something with the current sentence
}

if err := scanner.Err(); err != nil {           // Check the error
	log.Fatal(err)
}

Performance

On a Mac laptop, we see around 35MB/s, which works out to around 180 thousand sentences per second.

You should see approximately constant memory when using Segmenter or Scanner, independent of data size. When using SegmentAll(), expect memory to be O(n) on the number of sentences.

Invalid inputs

Invalid UTF-8 input is considered undefined behavior. We test to ensure that bad inputs will not cause pathological outcomes, such as a panic or infinite loop. Callers should expect “garbage-in, garbage-out”.

Your pipeline should probably include a call to utf8.Valid().

Documentation ¶

Overview ¶

Package sentences implements Unicode sentence boundaries: https://unicode.org/reports/tr29/#Sentence_Boundaries

Index ¶

func NewScanner(r io.Reader) *iterators.Scanner
func NewSegmenter(data []byte) *iterators.Segmenter
func SegmentAll(data []byte) [][]byte
func SplitFunc(data []byte, atEOF bool) (advance int, token []byte, err error)

Constants ¶

This section is empty.

Variables ¶

This section is empty.

Functions ¶

func NewScanner ¶ added in v1.0.3

func NewScanner(r io.Reader) *iterators.Scanner

NewScanner returns a Scanner, to tokenize sentences per https://unicode.org/reports/tr29/#Sentence_Boundaries. Iterate through sentences by calling Scan() until false, then check Err(). See also the bufio.Scanner docs.

Example ¶

package main

import (
	"fmt"
	"log"
	"strings"

	"github.com/clipperhouse/uax29/sentences"
)

func main() {
	text := "Hello, 世界. “Nice dog! 👍🐶”, they said."
	reader := strings.NewReader(text)

	scanner := sentences.NewScanner(reader)

	// Scan returns true until error or EOF
	for scanner.Scan() {
		fmt.Printf("%q\n", scanner.Text())
	}

	// Gotta check the error!
	if err := scanner.Err(); err != nil {
		log.Fatal(err)
	}
}

Output:

"Hello, 世界. "
"“Nice dog! "
"👍🐶”, they said."

func NewSegmenter ¶ added in v1.7.0

func NewSegmenter(data []byte) *iterators.Segmenter

NewSegmenter retuns a Segmenter, which is an iterator over the source text. Iterate while Next() is true, and access the segmented sentences via Bytes().

Example ¶

package main

import (
	"fmt"
	"log"

	"github.com/clipperhouse/uax29/sentences"
)

func main() {
	text := []byte("Hello, 世界. “Nice dog! 👍🐶”, they said.")

	segments := sentences.NewSegmenter(text)

	// Scan returns true until error or EOF
	for segments.Next() {
		fmt.Printf("%q\n", segments.Bytes())
	}

	// Gotta check the error!
	if err := segments.Err(); err != nil {
		log.Fatal(err)
	}
}

Output:

"Hello, 世界. "
"“Nice dog! "
"👍🐶”, they said."

func SegmentAll ¶ added in v1.7.0

func SegmentAll(data []byte) [][]byte

SegmentAll will iterate through all tokens and collect them into a [][]byte. This is a convenience method -- if you will be allocating such a slice anyway, this will save you some code. The downside is that this allocation is unbounded -- O(n) on the number of tokens. Use Segmenter for more bounded memory usage.

Example ¶

package main

import (
	"fmt"

	"github.com/clipperhouse/uax29/sentences"
)

func main() {
	text := []byte("Hello, 世界. “Nice dog! 👍🐶”, they said.")

	segments := sentences.SegmentAll(text)
	fmt.Printf("%q\n", segments)
}

Output:

["Hello, 世界. " "“Nice dog! " "👍🐶”, they said."]

func SplitFunc ¶ added in v1.2.0

func SplitFunc(data []byte, atEOF bool) (advance int, token []byte, err error)

SplitFunc is a bufio.SplitFunc implementation of word segmentation, for use with bufio.Scanner.

Types ¶

This section is empty.

Source Files ¶

View all Source files

?	: This menu
/	: Search site
f or F	: Jump to
y or Y	: Canonical URL

README ¶

Quick start

Conformance

APIs

If you have a []byte

If you have an io.Reader

Performance

Invalid inputs

Documentation ¶

Overview ¶

Index ¶

Examples ¶

Constants ¶

Variables ¶

Functions ¶

func NewScanner ¶ added in v1.0.3

func NewSegmenter ¶ added in v1.7.0

func SegmentAll ¶ added in v1.7.0

func SplitFunc ¶ added in v1.2.0

Types ¶

Source Files ¶

If you have a `[]byte`

If you have an `io.Reader`