Documentation ¶
Overview ¶
Package norm contains types and functions for normalizing Unicode strings.
Index ¶
- Constants
- type Form
- func (f Form) Append(out []byte, src ...byte) []byte
- func (f Form) AppendString(out []byte, src string) []byte
- func (f Form) Bytes(b []byte) []byte
- func (f Form) FirstBoundary(b []byte) int
- func (f Form) FirstBoundaryInString(s string) int
- func (f Form) IsNormal(b []byte) bool
- func (f Form) IsNormalString(s string) bool
- func (f Form) LastBoundary(b []byte) int
- func (f Form) NextBoundary(b []byte, atEOF bool) int
- func (f Form) NextBoundaryInString(s string, atEOF bool) int
- func (f Form) Properties(s []byte) Properties
- func (f Form) PropertiesString(s string) Properties
- func (f Form) QuickSpan(b []byte) int
- func (f Form) QuickSpanString(s string) int
- func (f Form) Reader(r io.Reader) io.Reader
- func (Form) Reset()
- func (f Form) Span(b []byte, atEOF bool) (n int, err error)
- func (f Form) SpanString(s string, atEOF bool) (n int, err error)
- func (f Form) String(s string) string
- func (f Form) Transform(dst, src []byte, atEOF bool) (nDst, nSrc int, err error)
- func (f Form) Writer(w io.Writer) io.WriteCloser
- type Iter
- type Properties
Examples ¶
Constants ¶
const ( // Version is the Unicode edition from which the tables are derived. Version = "12.0.0" // MaxTransformChunkSize indicates the maximum number of bytes that Transform // may need to write atomically for any Form. Making a destination buffer at // least this size ensures that Transform can always make progress and that // the user does not need to grow the buffer on an ErrShortDst. MaxTransformChunkSize = 35 + maxNonStarters*4 )
const GraphemeJoiner = "\u034F"
GraphemeJoiner is inserted after maxNonStarters non-starter runes.
const MaxSegmentSize = maxByteBufferSize
MaxSegmentSize is the maximum size of a byte buffer needed to consider any sequence of starter and non-starter runes for the purpose of normalization.
Variables ¶
This section is empty.
Functions ¶
This section is empty.
Types ¶
type Form ¶
type Form int
A Form denotes a canonical representation of Unicode code points. The Unicode-defined normalization and equivalence forms are:
NFC Unicode Normalization Form C NFD Unicode Normalization Form D NFKC Unicode Normalization Form KC NFKD Unicode Normalization Form KD
For a Form f, this documentation uses the notation f(x) to mean the bytes or string x converted to the given form. A position n in x is called a boundary if conversion to the form can proceed independently on both sides:
f(x) == append(f(x[0:n]), f(x[n:])...)
References: https://unicode.org/reports/tr15/ and https://unicode.org/notes/tn5/.
func (Form) Append ¶
Append returns f(append(out, b...)). The buffer out must be nil, empty, or equal to f(out).
func (Form) AppendString ¶
AppendString returns f(append(out, []byte(s))). The buffer out must be nil, empty, or equal to f(out).
func (Form) FirstBoundary ¶
FirstBoundary returns the position i of the first boundary in b or -1 if b contains no boundary.
func (Form) FirstBoundaryInString ¶
FirstBoundaryInString returns the position i of the first boundary in s or -1 if s contains no boundary.
func (Form) IsNormalString ¶
IsNormalString returns true if s == f(s).
func (Form) LastBoundary ¶
LastBoundary returns the position i of the last boundary in b or -1 if b contains no boundary.
func (Form) NextBoundary ¶
NextBoundary reports the index of the boundary between the first and next segment in b or -1 if atEOF is false and there are not enough bytes to determine this boundary.
Example ¶
package main import ( "fmt" "github.com/xhit/unicode/norm" ) func main() { s := norm.NFD.String("Mêlée") for i := 0; i < len(s); { d := norm.NFC.NextBoundaryInString(s[i:], true) fmt.Printf("%[1]s: %+[1]q\n", s[i:i+d]) i += d } }
Output: M: "M" ê: "e\u0302" l: "l" é: "e\u0301" e: "e"
func (Form) NextBoundaryInString ¶
NextBoundaryInString reports the index of the boundary between the first and next segment in b or -1 if atEOF is false and there are not enough bytes to determine this boundary.
func (Form) Properties ¶
func (f Form) Properties(s []byte) Properties
Properties returns properties for the first rune in s.
func (Form) PropertiesString ¶
func (f Form) PropertiesString(s string) Properties
PropertiesString returns properties for the first rune in s.
func (Form) QuickSpan ¶
QuickSpan returns a boundary n such that b[0:n] == f(b[0:n]). It is not guaranteed to return the largest such n.
func (Form) QuickSpanString ¶
QuickSpanString returns a boundary n such that s[0:n] == f(s[0:n]). It is not guaranteed to return the largest such n.
func (Form) Reader ¶
Reader returns a new reader that implements Read by reading data from r and returning f(data).
func (Form) Reset ¶
func (Form) Reset()
Reset implements the Reset method of the transform.Transformer interface.
func (Form) Span ¶
Span implements transform.SpanningTransformer. It returns a boundary n such that b[0:n] == f(b[0:n]). It is not guaranteed to return the largest such n.
func (Form) SpanString ¶
SpanString returns a boundary n such that s[0:n] == f(s[0:n]). It is not guaranteed to return the largest such n.
func (Form) Transform ¶
Transform implements the Transform method of the transform.Transformer interface. It may need to write segments of up to MaxSegmentSize at once. Users should either catch ErrShortDst and allow dst to grow or have dst be at least of size MaxTransformChunkSize to be guaranteed of progress.
type Iter ¶
type Iter struct {
// contains filtered or unexported fields
}
An Iter iterates over a string or byte slice, while normalizing it to a given Form.
Example ¶
package main import ( "bytes" "fmt" "io" "unicode/utf8" "github.com/xhit/unicode/norm" ) // EqualSimple uses a norm.Iter to compare two non-normalized // strings for equivalence. func EqualSimple(a, b string) bool { var ia, ib norm.Iter ia.InitString(norm.NFKD, a) ib.InitString(norm.NFKD, b) for !ia.Done() && !ib.Done() { if !bytes.Equal(ia.Next(), ib.Next()) { return false } } return ia.Done() && ib.Done() } // FindPrefix finds the longest common prefix of ASCII characters // of a and b. func FindPrefix(a, b string) int { i := 0 for ; i < len(a) && i < len(b) && a[i] < utf8.RuneSelf && a[i] == b[i]; i++ { } return i } // EqualOpt is like EqualSimple, but optimizes the special // case for ASCII characters. func EqualOpt(a, b string) bool { n := FindPrefix(a, b) a, b = a[n:], b[n:] var ia, ib norm.Iter ia.InitString(norm.NFKD, a) ib.InitString(norm.NFKD, b) for !ia.Done() && !ib.Done() { if !bytes.Equal(ia.Next(), ib.Next()) { return false } if n := int64(FindPrefix(a[ia.Pos():], b[ib.Pos():])); n != 0 { ia.Seek(n, io.SeekCurrent) ib.Seek(n, io.SeekCurrent) } } return ia.Done() && ib.Done() } var compareTests = []struct{ a, b string }{ {"aaa", "aaa"}, {"aaa", "aab"}, {"a\u0300a", "\u00E0a"}, {"a\u0300\u0320b", "a\u0320\u0300b"}, {"\u1E0A\u0323", "\x44\u0323\u0307"}, // A character that decomposes into multiple segments // spans several iterations. {"\u3304", "\u30A4\u30CB\u30F3\u30AF\u3099"}, } func main() { for i, t := range compareTests { r0 := EqualSimple(t.a, t.b) r1 := EqualOpt(t.a, t.b) fmt.Printf("%d: %v %v\n", i, r0, r1) } }
Output: 0: true true 1: false false 2: true true 3: true true 4: true true 5: true true
func (*Iter) InitString ¶
InitString initializes i to iterate over src after normalizing it to Form f.
func (*Iter) Next ¶
Next returns f(i.input[i.Pos():n]), where n is a boundary of i.input. For any input a and b for which f(a) == f(b), subsequent calls to Next will return the same segments. Modifying runes are grouped together with the preceding starter, if such a starter exists. Although not guaranteed, n will typically be the smallest possible n.
type Properties ¶
type Properties struct {
// contains filtered or unexported fields
}
Properties provides access to normalization properties of a rune.
func (Properties) BoundaryAfter ¶
func (p Properties) BoundaryAfter() bool
BoundaryAfter returns true if runes cannot combine with or otherwise interact with this or previous runes.
func (Properties) BoundaryBefore ¶
func (p Properties) BoundaryBefore() bool
BoundaryBefore returns true if this rune starts a new segment and cannot combine with any rune on the left.
func (Properties) CCC ¶
func (p Properties) CCC() uint8
CCC returns the canonical combining class of the underlying rune.
func (Properties) Decomposition ¶
func (p Properties) Decomposition() []byte
Decomposition returns the decomposition for the underlying rune or nil if there is none.
func (Properties) LeadCCC ¶
func (p Properties) LeadCCC() uint8
LeadCCC returns the CCC of the first rune in the decomposition. If there is no decomposition, LeadCCC equals CCC.
func (Properties) Size ¶
func (p Properties) Size() int
Size returns the length of UTF-8 encoding of the rune.
func (Properties) TrailCCC ¶
func (p Properties) TrailCCC() uint8
TrailCCC returns the CCC of the last rune in the decomposition. If there is no decomposition, TrailCCC equals CCC.