Documentation ¶
Overview ¶
Package uax is about Unicode Annexes and their algorithms.
Description ¶
From the Unicode Consortium:
A Unicode Standard Annex (UAX) forms an integral part of the Unicode Standard, but is published online as a separate document. The Unicode Standard may require conformance to normative content in a Unicode Standard Annex, if so specified in the Conformance chapter of that version of the Unicode Standard. The version number of a UAX document corresponds to the version of the Unicode Standard of which it forms a part.
[...]
A string of Unicode‐encoded text often needs to be broken up into text elements programmatically. Common examples of text elements include what users think of as characters, words, lines (more precisely, where line breaks are allowed), and sentences. The precise determination of text elements may vary according to orthographic conventions for a given script or language. The goal of matching user perceptions cannot always be met exactly because the text alone does not always contain enough information to unambiguously decide boundaries. For example, the period (U+002E FULL STOP) is used ambiguously, sometimes for end‐of‐sentence purposes, sometimes for abbreviations, and sometimes for numbers. In most cases, however, programmatic text boundaries can match user perceptions quite closely, although sometimes the best that can be done is not to surprise the user.
[...]
There are many different ways to divide text elements corresponding to user‐perceived characters, words, and sentences, and the Unicode Standard does not restrict the ways in which implementations can produce these divisions.
This specification defines default mechanisms; more sophisticated implementations can and should tailor them for particular locales or environments. For example, reliable detection of word boundaries in languages such as Thai, Lao, Chinese, or Japanese requires the use of dictionary lookup, analogous to English hyphenation.
Package Contents ¶
Implementations of specific UAX algorithms is done in the various sub-packages of uax. The driver type for some of the breaking-algorithms sits in sub-package segment and will use breaker-algorithms from other sub-packages.
Base package uax provides some of the necessary means to implement UAX breaking algorithms. Please note that it is in now way mandatory to use the supporting types and functions of this package. Implementors of additional breaking algorithms are free to ignore some or all of the helpers and instead implement their breaking algorithms from scratch.
Every implementation of UAX breaking algorithms has to handle the trade-off between efficiency and understandability. Algorithms as described in the Unicodes Annex documents are no easy read when considering all the details and edge cases. Getting it 100% right therefore sometimes may be tricky. Implementations in the sub-packages of uax try to strike a balance between efficiency and readability. The helper classes of uax allow implementors to transform UAX-rules into fairly readable small functions. From a maintenance point-of-view this is preferrable to huge and complex cascades of if-statements, which may provide better performance, but are hard to understand. Most of the breaking algorithms within sub-packages of uax therefore utilize the helper types from package uax.
We perform segmenting Unicode text based on rules, which are short regular expressions, i.e. finite state automata. This corresponds well with the formal UAX description of rules (except for the Bidi-rules, which are better understood as rules for a context-sensitive grammar). Every step within a rule is performed by executing a function. This function recognizes a single code-point class and returns another function. The returned function represents the expectation for the next code-point(-class). These kind of matching by function is continued until a rule is accepted or aborted.
An example let's consider rule WB13b “Do not break from extenders” from UAX#29:
ExtendNumLet x (ALetter | Hebrew_Letter| Numeric | Katakana)
The 'x' denotes a suppressed break. All the identifiers are UAX#29-specific classes for code-points. Matching them will call two functions in sequence:
rule_WB13b( … ) // match ExtendNumLet -> finish_WB13b( … ) // match any of ALetter … Katakana
The final return value will either signal an accept or abort.
The uax helper to perform this kind of matching is called Recognizer. A set of Recognizers comprises an NFA and will match break opportunities for a given UAX rule-set. Recognizers receive rune events and therefore implement interface RuneSubscriber.
Rune Events ¶
Walking the runes (= code-points) of a Unicode text and firing rules to match segments will produce a high fluctuation of short-lived Recognizers. Every Recognizer will have to react to the next rune read. Package uax provides a publish-subscribe mechanism for signalling new runes to all active Recognizers.
The default rune-publisher will distribute rune events to rune-subscribers and collect return values. Subscribers are required to return active matches and possible break-opportunities (or suppression thereof). After all subscribers are done consuming the rune, the publisher harvests subscribers which have ended their life-cycle (i.e., either accepted or aborted). Dead subscribers are flagging this with Done()==true and get unsubscribed.
Breaking algorithms are performed by `UnicodeBreaker`s (an interface type). The UnicodeBreakers in sub-packages of this package utilize UnicodePublishers as described above. The segment-driver needs one or more UnicodeBreakers to perform breaking logic.
Penalties ¶
Algorithms in this package will signal break opportunities for Unicode text. However, breaks are not signalled with true/false, but rather with a weighted “penalty.” Every break is connoted with an integer value, representing the desirability of the break. Negative values denote a negative penalty, i.e., a merit. High enough penalties signal the complete suppression of a break opportunity, causing the segmenter to not report this break.
The UnicodeBreakers in this package (including sub-packages) will apply the following logic:
(1) Mandatory breaks will have a penalty/merit of ≤ -10000 (uax.InfinitePenalty).
(2) Inhibited breaks will have penalty ≥ 10000 (uax.InfiniteMerits).
(3) Neutral positions will have a penalty of 0. The segmenter can be configured to regard the zero value as breakable or not.
The segmenter will aggregate penalties from its breakers and output aggregated penalties to the client.
______________________________________________________________________
License ¶
This project is provided under the terms of the UNLICENSE or the 3-Clause BSD license denoted by the following SPDX identifier:
SPDX-License-Identifier: 'Unlicense' OR 'BSD-3-Clause'
You may use the project under the terms of either license.
Licenses are reproduced in the license file in the root folder of this module.
Copyright © 2021 Norbert Pillmayer <norbert@pillmayer.com>
Index ¶
- Constants
- func PositionOfFirstLegalRune(s string) (int, []byte)
- type DefaultRunePublisher
- func (pq *DefaultRunePublisher) Fix(at int)
- func (pq DefaultRunePublisher) Len() int
- func (pq *DefaultRunePublisher) Pop() RuneSubscriber
- func (pq *DefaultRunePublisher) PopDone() RuneSubscriber
- func (rpub *DefaultRunePublisher) PublishRuneEvent(r rune, codePointClass int) (int, []int)
- func (pq *DefaultRunePublisher) Push(subscr RuneSubscriber)
- func (rpub *DefaultRunePublisher) SubscribeMe(rsub RuneSubscriber) RunePublisher
- func (pq DefaultRunePublisher) Top() RuneSubscriber
- type NfaStateFn
- type Recognizer
- type RunePublisher
- type RuneSubscriber
- type UnicodeBreaker
Constants ¶
const ( InfinitePenalty = 10000 InfiniteMerits = -10000 )
We define constants for flagging break points as infinitely bad and infinitely good, respectively.
Variables ¶
This section is empty.
Functions ¶
func PositionOfFirstLegalRune ¶
PositionOfFirstLegalRune returns a legal Unicode code point start position and cut-off prefix, if any.
Types ¶
type DefaultRunePublisher ¶
type DefaultRunePublisher struct {
// contains filtered or unexported fields
}
DefaultRunePublisher is a type to organize RuneSubscribers.
Rune publishers have to maintain a list of subscribers. Subscribers are then notified on the arrival of new runes (code-points) by sending them rune-events. When a subscriber is done with consuming runes (subscribers are often short-lived), it signals Done()=true.
The DefaultRunePublisher data structure "prioritizes" subscribers with Done()=true within a queue. It maintains a "gap" position between done and not-done. The queue grows as needed.
A DefaultRunePublisher implements RunePublisher.
func NewRunePublisher ¶
func NewRunePublisher() *DefaultRunePublisher
NewRunePublisher creates a new default RunePublisher.
func (*DefaultRunePublisher) Fix ¶
func (pq *DefaultRunePublisher) Fix(at int)
Fix signals that the Done()-flag of a subscriber has changed: inform the queue to let it re-organize.
func (DefaultRunePublisher) Len ¶
func (pq DefaultRunePublisher) Len() int
Len returns the number of subscribers held.
func (*DefaultRunePublisher) Pop ¶
func (pq *DefaultRunePublisher) Pop() RuneSubscriber
Pop the topmost subscriber.
func (*DefaultRunePublisher) PopDone ¶
func (pq *DefaultRunePublisher) PopDone() RuneSubscriber
PopDone pops the topmost subscriber if it is Done(), otherwise return nil. If the method returns nil, the queue either is empty or holds subscribers with Done()=false only (i.e., subscribers still active).
func (*DefaultRunePublisher) PublishRuneEvent ¶
func (rpub *DefaultRunePublisher) PublishRuneEvent(r rune, codePointClass int) (int, []int)
PublishRuneEvent triggers a rune event notification to all subscribers. Rune events include the rune (code-point) and an optional code-point class for the rune.
Return values are: the longest active match and a slice of penalties. These values usually are collected from the RuneSubscribers. Penalties will be overwritten by the next call to PublishRuneEvent(). Clients will have to make a copy if they want to preserve penalty values.
Interface RunePublisher
func (*DefaultRunePublisher) Push ¶
func (pq *DefaultRunePublisher) Push(subscr RuneSubscriber)
Push puts a new subscriber to the queue.
func (*DefaultRunePublisher) SubscribeMe ¶
func (rpub *DefaultRunePublisher) SubscribeMe(rsub RuneSubscriber) RunePublisher
SubscribeMe lets a client subscribe to a RunePublisher.
Part of interface RunePublisher.
func (DefaultRunePublisher) Top ¶
func (pq DefaultRunePublisher) Top() RuneSubscriber
Top subscriber in queue. If there is at last one Done() subscriber, top() will return one.
type NfaStateFn ¶
type NfaStateFn func(*Recognizer, rune, int) NfaStateFn
NfaStateFn represents a state in a non-deterministic finite automata. Functions of type NfaStateFn try to match a rune (Unicode code-point). The caller may provide a third argument, which should be a rune class. Rune (code-point) classes are described in various Unicode standards and annexes. One such annex, UAX#29, describes classes to help splitting up text into graphemes or words. An example class may be a class of western language alphabetic characters “AL”, of which runes ‘X’ and ‘é’ would be part of.
The first argument is a Recognizer (see the definition of type Recognizer in this package), which carries this state function.
NfaStateFn – after matching a rune – must return another NfaStateFn, which will then in turn be called to process the next rune. The process of matching a string will stop as soon as a NfaStateFn returns nil.
func DoAbort ¶
func DoAbort(rec *Recognizer) NfaStateFn
DoAbort returns a state function which signals abort.
func DoAccept ¶
func DoAccept(rec *Recognizer, penalties ...int) NfaStateFn
DoAccept returns a state function which signals accept, together with break penalties for matched runes (in reverse sequence).
type Recognizer ¶
type Recognizer struct { Expect int // next code-point to expect; semantics are up to the client MatchLen int // length of active match UserData interface{} // clients may need to store additional information // contains filtered or unexported fields }
A Recognizer represents an automata to recognize sequences of runes (i.e. Unicode code-points). Its main functionality is performed by an embedded NfaStateFn. The first NfaStateFn to use is provided with the constructor.
Recognizer's state functions must be careful to increment MatchLen with each matched rune. Failing to do so may result in incorrect splits of text.
Semantics of Expect and UserData are up to the client and not used by the default mechanism.
It is not mandatory to use Recognizers for segmenting text. The type is provided for easier implementation of types implementing UnicodeBreaker. Recognizers implement interface RuneSubscriber and UnicodeBreakers will use a UnicodePublisher to interact with them.
func NewPooledRecognizer ¶
func NewPooledRecognizer(cpClass int, stateFn NfaStateFn) *Recognizer
NewPooledRecognizer returns a new Recognizer, pre-filled with an expected code-point class and a state function. The Recognizer is pooled for efficiency.
func NewRecognizer ¶
func NewRecognizer(codePointClass int, next NfaStateFn) *Recognizer
NewRecognizer creates a new Recognizer. This is rarely used, as clients rather should call NewPooledRecognizer().
see NewPooledRecognizer.
func (*Recognizer) Done ¶
func (rec *Recognizer) Done() bool
Done is used by a Recognizer that it is done matching runes. If MatchLength() > 0 is has been accepting a sequence of runes, otherwise it has aborted to further try a match.
Interface RuneSubscriber
func (*Recognizer) MatchLength ¶
func (rec *Recognizer) MatchLength() int
MatchLength is part of interface RuneSubscriber.
func (*Recognizer) RuneEvent ¶
func (rec *Recognizer) RuneEvent(r rune, codePointClass int) []int
RuneEvent is part of interface RuneSubscriber.
func (*Recognizer) String ¶
func (rec *Recognizer) String() string
Simple stringer for debugging purposes.
func (*Recognizer) Unsubscribed ¶
func (rec *Recognizer) Unsubscribed()
Unsubscribed signals to a Recognizer that it has been unsubscribed from a RunePublisher; usually after the Recognizer's NfaStateFn has returned nil.
Interface RuneSubscriber
type RunePublisher ¶
type RunePublisher interface { SubscribeMe(RuneSubscriber) RunePublisher // subscribe an additional rune subscriber PublishRuneEvent(r rune, codePointClass int) (longestDistance int, penalties []int) }
A RunePublisher notifies subscribers with rune events: a new rune has been read and the subscriber – usually a recognizer rule – has to react to it.
UnicodeBreakers are not required to use the RunePublisher/RuneSubscriber pattern, but it is convenient to stick to it. UnicodeBreakers often rely on sets of rules, which are tested interleavingly. To releave UnicodeBreakers from managing rune-distribution to all the rules, it may be advantageous hold a RunePublisher within a UnicodeBreaker and let all rules implement the RuneSubscriber interface.
type RuneSubscriber ¶
type RuneSubscriber interface { RuneEvent(r rune, codePointClass int) []int // receive a new code-point MatchLength() int // length (in # of code-point) of the match up to now Done() bool // is this subscriber done? Unsubscribed() // this subscriber has been unsubscribed }
A RuneSubscriber is a receiver of rune events, i.e. messages to process a new code-point (rune). If they can match the rune, they will expect further runes, otherwise they abort. To they are finished, either by accepting or rejecting input, they set Done() to true. A successful acceptance of input is signalled by Done()==true and MatchLength()>0.
type UnicodeBreaker ¶
type UnicodeBreaker interface { CodePointClassFor(rune) int StartRulesFor(rune, int) ProceedWithRune(rune, int) LongestActiveMatch() int Penalties() []int }
UnicodeBreaker represents a logic to split up Unicode sequences into smaller parts. They are used by Segmenters to supply breaking logic.
Directories ¶
Path | Synopsis |
---|---|
Package emoji implements Unicode UTS #51 emoji classes.
|
Package emoji implements Unicode UTS #51 emoji classes. |
internal/generator
Package for a generator for UTS#51 Emoji character classes.
|
Package for a generator for UTS#51 Emoji character classes. |
Package grapheme implements Unicode Annex #29 grapheme breaking.
|
Package grapheme implements Unicode Annex #29 grapheme breaking. |
internal/generator
Package for a generator for UAX#29 Grapheme classes.
|
Package for a generator for UAX#29 Grapheme classes. |
internal
|
|
tablegen
tablegen is a helper CLI to create Go source files from Unicode Character Data files.
|
tablegen is a helper CLI to create Go source files from Unicode Character Data files. |
ucdparse
Package ucdparse provides a parser for Unicode Character Database files.
|
Package ucdparse provides a parser for Unicode Character Database files. |
Package segment is about Unicode text segmenting.
|
Package segment is about Unicode text segmenting. |
Package shaping provides tables corresponding to Unicode® Character Data tables relevant for text shaping.
|
Package shaping provides tables corresponding to Unicode® Character Data tables relevant for text shaping. |
Package uax11 provides utilities for Unicode® Standard Annex #11 “East Asian Width”.
|
Package uax11 provides utilities for Unicode® Standard Annex #11 “East Asian Width”. |
Package uax14 implements Unicode Annex #14 line wrap.
|
Package uax14 implements Unicode Annex #14 line wrap. |
internal/generator
Package generator is a generator for UAX#14 classes.
|
Package generator is a generator for UAX#14 classes. |
Package uax29 implements Unicode Annex #29 word breaking.
|
Package uax29 implements Unicode Annex #29 word breaking. |
internal/generator
Package for a generator for UAX#29 word breaking classes.
|
Package for a generator for UAX#29 word breaking classes. |