Documentation ¶
Overview ¶
Package segmenter implements Unicode rules used to segment a paragraph of text according to several criteria. In particular, it provides a way of delimiting line break opportunities.
The API of the package follows the very nice iterator pattern proposed in github.com/npillmayer/uax, but use a somewhat simpler internal implementation, inspired by Pango.
The reference documentation is at https://unicode.org/reports/tr14 and https://unicode.org/reports/tr29.
Index ¶
Constants ¶
This section is empty.
Variables ¶
This section is empty.
Functions ¶
This section is empty.
Types ¶
type Grapheme ¶
type Grapheme struct { // Text is a subslice of the original input slice, containing the delimited grapheme Text []rune // Offset is the start of the grapheme in the input rune slice Offset int }
Grapheme is the content of a grapheme delimited by the segmenter.
type GraphemeIterator ¶
type GraphemeIterator struct {
// contains filtered or unexported fields
}
GraphemeIterator provides a convenient way of iterating over the graphemes delimited by a `Segmenter`.
func (*GraphemeIterator) Grapheme ¶
func (gr *GraphemeIterator) Grapheme() Grapheme
Grapheme returns the current `Grapheme`
func (*GraphemeIterator) Next ¶
func (gr *GraphemeIterator) Next() bool
Next returns true if there is still a grapheme to process, and advances the iterator; or return false.
type Line ¶
type Line struct { // Text is a subslice of the original input slice, containing the delimited line Text []rune // Offset is the start of the line in the input rune slice Offset int // IsMandatoryBreak is true if breaking (at the end of the line) // is mandatory IsMandatoryBreak bool }
Line is the content of a line delimited by the segmenter.
type LineIterator ¶
type LineIterator struct {
// contains filtered or unexported fields
}
LineIterator provides a convenient way of iterating over the lines delimited by a `Segmenter`.
func (*LineIterator) Next ¶
func (li *LineIterator) Next() bool
Next returns true if there is still a line to process, and advances the iterator; or return false.
type Segmenter ¶
type Segmenter struct {
// contains filtered or unexported fields
}
Segmenter is the entry point of the package.
Usage :
var seg Segmenter seg.Init(...) iter := seg.LineIterator() for iter.Next() { ... // do something with iter.Line() }
func (*Segmenter) GraphemeIterator ¶
func (sg *Segmenter) GraphemeIterator() *GraphemeIterator
GraphemeIterator returns an iterator over the graphemes delimited in [Init].
func (*Segmenter) Init ¶
Init resets the segmenter storage with the given input, and computes the attributes required to segment the text.
func (*Segmenter) LineIterator ¶
func (sg *Segmenter) LineIterator() *LineIterator
LineIterator returns an iterator on the lines delimited in [Init].
func (*Segmenter) WordIterator ¶ added in v0.1.2
func (sg *Segmenter) WordIterator() *WordIterator
WordIterator returns an iterator over the word delimited in [Init].
type Word ¶ added in v0.1.2
type Word struct { // Text is a subslice of the original input slice, containing the delimited word Text []rune // Offset is the start of the word in the input rune slice Offset int }
Word is the content of a word delimited by the segmenter.
More precisely, a word is formed by runes with the [Alphabetic] property, or with a General_Category of Number, delimited by the Word Boundary Unicode Property.
See also https://unicode.org/reports/tr29/#Word_Boundary_Rules, http://unicode.org/reports/tr44/#Alphabetic and http://unicode.org/reports/tr44/#General_Category_Values
type WordIterator ¶ added in v0.1.2
type WordIterator struct {
// contains filtered or unexported fields
}
func (*WordIterator) Next ¶ added in v0.1.2
func (gr *WordIterator) Next() bool
Next returns true if there is still a word to process, and advances the iterator; or return false.
func (*WordIterator) Word ¶ added in v0.1.2
func (gr *WordIterator) Word() Word
Word returns the current `Word`