segmenter

package
v0.1.2 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Apr 2, 2024 License: BSD-3-Clause, Unlicense Imports: 2 Imported by: 7

Documentation

Overview

Package segmenter implements Unicode rules used to segment a paragraph of text according to several criteria. In particular, it provides a way of delimiting line break opportunities.

The API of the package follows the very nice iterator pattern proposed in github.com/npillmayer/uax, but use a somewhat simpler internal implementation, inspired by Pango.

The reference documentation is at https://unicode.org/reports/tr14 and https://unicode.org/reports/tr29.

Index

Constants

This section is empty.

Variables

This section is empty.

Functions

This section is empty.

Types

type Grapheme

type Grapheme struct {
	// Text is a subslice of the original input slice, containing the delimited grapheme
	Text []rune
	// Offset is the start of the grapheme in the input rune slice
	Offset int
}

Grapheme is the content of a grapheme delimited by the segmenter.

type GraphemeIterator

type GraphemeIterator struct {
	// contains filtered or unexported fields
}

GraphemeIterator provides a convenient way of iterating over the graphemes delimited by a `Segmenter`.

func (*GraphemeIterator) Grapheme

func (gr *GraphemeIterator) Grapheme() Grapheme

Grapheme returns the current `Grapheme`

func (*GraphemeIterator) Next

func (gr *GraphemeIterator) Next() bool

Next returns true if there is still a grapheme to process, and advances the iterator; or return false.

type Line

type Line struct {
	// Text is a subslice of the original input slice, containing the delimited line
	Text []rune
	// Offset is the start of the line in the input rune slice
	Offset int
	// IsMandatoryBreak is true if breaking (at the end of the line)
	// is mandatory
	IsMandatoryBreak bool
}

Line is the content of a line delimited by the segmenter.

type LineIterator

type LineIterator struct {
	// contains filtered or unexported fields
}

LineIterator provides a convenient way of iterating over the lines delimited by a `Segmenter`.

func (*LineIterator) Line

func (li *LineIterator) Line() Line

Line returns the current `Line`

func (*LineIterator) Next

func (li *LineIterator) Next() bool

Next returns true if there is still a line to process, and advances the iterator; or return false.

type Segmenter

type Segmenter struct {
	// contains filtered or unexported fields
}

Segmenter is the entry point of the package.

Usage :

var seg Segmenter
seg.Init(...)
iter := seg.LineIterator()
for iter.Next() {
  ... // do something with iter.Line()
}

func (*Segmenter) GraphemeIterator

func (sg *Segmenter) GraphemeIterator() *GraphemeIterator

GraphemeIterator returns an iterator over the graphemes delimited in [Init].

func (*Segmenter) Init

func (seg *Segmenter) Init(paragraph []rune)

Init resets the segmenter storage with the given input, and computes the attributes required to segment the text.

func (*Segmenter) LineIterator

func (sg *Segmenter) LineIterator() *LineIterator

LineIterator returns an iterator on the lines delimited in [Init].

func (*Segmenter) WordIterator added in v0.1.2

func (sg *Segmenter) WordIterator() *WordIterator

WordIterator returns an iterator over the word delimited in [Init].

type Word added in v0.1.2

type Word struct {
	// Text is a subslice of the original input slice, containing the delimited word
	Text []rune
	// Offset is the start of the word in the input rune slice
	Offset int
}

Word is the content of a word delimited by the segmenter.

More precisely, a word is formed by runes with the [Alphabetic] property, or with a General_Category of Number, delimited by the Word Boundary Unicode Property.

See also https://unicode.org/reports/tr29/#Word_Boundary_Rules, http://unicode.org/reports/tr44/#Alphabetic and http://unicode.org/reports/tr44/#General_Category_Values

type WordIterator added in v0.1.2

type WordIterator struct {
	// contains filtered or unexported fields
}

func (*WordIterator) Next added in v0.1.2

func (gr *WordIterator) Next() bool

Next returns true if there is still a word to process, and advances the iterator; or return false.

func (*WordIterator) Word added in v0.1.2

func (gr *WordIterator) Word() Word

Word returns the current `Word`

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL