Documentation ¶
Overview ¶
Package norm contains types and functions for normalizing Unicode strings.
Index ¶
- Constants
- type Form
- func (f Form) Append(out []byte, src ...byte) []byte
- func (f Form) AppendString(out []byte, src string) []byte
- func (f Form) Bytes(b []byte) []byte
- func (f Form) FirstBoundary(b []byte) int
- func (f Form) FirstBoundaryInString(s string) int
- func (f Form) IsNormal(b []byte) bool
- func (f Form) IsNormalString(s string) bool
- func (f Form) LastBoundary(b []byte) int
- func (f Form) Properties(s []byte) Properties
- func (f Form) PropertiesString(s string) Properties
- func (f Form) QuickSpan(b []byte) int
- func (f Form) QuickSpanString(s string) int
- func (f Form) Reader(r io.Reader) io.Reader
- func (f Form) String(s string) string
- func (f Form) Writer(w io.Writer) io.WriteCloser
- type Iter
- type Properties
Examples ¶
Constants ¶
const MaxSegmentSize = maxByteBufferSize
const Version = "6.2.0"
Version is the Unicode edition from which the tables are derived.
Variables ¶
This section is empty.
Functions ¶
This section is empty.
Types ¶
type Form ¶
type Form int
A Form denotes a canonical representation of Unicode code points. The Unicode-defined normalization and equivalence forms are:
NFC Unicode Normalization Form C NFD Unicode Normalization Form D NFKC Unicode Normalization Form KC NFKD Unicode Normalization Form KD
For a Form f, this documentation uses the notation f(x) to mean the bytes or string x converted to the given form. A position n in x is called a boundary if conversion to the form can proceed independently on both sides:
f(x) == append(f(x[0:n]), f(x[n:])...)
References: http://unicode.org/reports/tr15/ and http://unicode.org/notes/tn5/.
func (Form) Append ¶
Append returns f(append(out, b...)). The buffer out must be nil, empty, or equal to f(out).
func (Form) AppendString ¶
AppendString returns f(append(out, []byte(s))). The buffer out must be nil, empty, or equal to f(out).
func (Form) FirstBoundary ¶
FirstBoundary returns the position i of the first boundary in b or -1 if b contains no boundary.
func (Form) FirstBoundaryInString ¶
FirstBoundaryInString returns the position i of the first boundary in s or -1 if s contains no boundary.
func (Form) IsNormalString ¶
IsNormalString returns true if s == f(s).
func (Form) LastBoundary ¶
LastBoundary returns the position i of the last boundary in b or -1 if b contains no boundary.
func (Form) Properties ¶
func (f Form) Properties(s []byte) Properties
Properties returns properties for the first rune in s.
func (Form) PropertiesString ¶
func (f Form) PropertiesString(s string) Properties
PropertiesString returns properties for the first rune in s.
func (Form) QuickSpan ¶
QuickSpan returns a boundary n such that b[0:n] == f(b[0:n]). It is not guaranteed to return the largest such n.
func (Form) QuickSpanString ¶
QuickSpanString returns a boundary n such that b[0:n] == f(s[0:n]). It is not guaranteed to return the largest such n.
type Iter ¶
type Iter struct {
// contains filtered or unexported fields
}
An Iter iterates over a string or byte slice, while normalizing it to a given Form.
Example ¶
package main import ( "bytes" "exp/norm" "fmt" "unicode/utf8" ) // EqualSimple uses a norm.Iter to compare two non-normalized // strings for equivalence. func EqualSimple(a, b string) bool { var ia, ib norm.Iter ia.InitString(norm.NFKD, a) ib.InitString(norm.NFKD, b) for !ia.Done() && !ib.Done() { if !bytes.Equal(ia.Next(), ib.Next()) { return false } } return ia.Done() && ib.Done() } // FindPrefix finds the longest common prefix of ASCII characters // of a and b. func FindPrefix(a, b string) int { i := 0 for ; i < len(a) && i < len(b) && a[i] < utf8.RuneSelf && a[i] == b[i]; i++ { } return i } // EqualOpt is like EqualSimple, but optimizes the special // case for ASCII characters. func EqualOpt(a, b string) bool { n := FindPrefix(a, b) a, b = a[n:], b[n:] var ia, ib norm.Iter ia.InitString(norm.NFKD, a) ib.InitString(norm.NFKD, b) for !ia.Done() && !ib.Done() { if !bytes.Equal(ia.Next(), ib.Next()) { return false } if n := int64(FindPrefix(a[ia.Pos():], b[ib.Pos():])); n != 0 { ia.Seek(n, 1) ib.Seek(n, 1) } } return ia.Done() && ib.Done() } var compareTests = []struct{ a, b string }{ {"aaa", "aaa"}, {"aaa", "aab"}, {"a\u0300a", "\u00E0a"}, {"a\u0300\u0320b", "a\u0320\u0300b"}, {"\u1E0A\u0323", "\x44\u0323\u0307"}, // A character that decomposes into multiple segments // spans several iterations. {"\u3304", "\u30A4\u30CB\u30F3\u30AF\u3099"}, } func main() { for i, t := range compareTests { r0 := EqualSimple(t.a, t.b) r1 := EqualOpt(t.a, t.b) fmt.Printf("%d: %v %v\n", i, r0, r1) } }
Output: 0: true true 1: false false 2: true true 3: true true 4: true true 5: true true
func (*Iter) InitString ¶
InitString initializes i to iterate over src after normalizing it to Form f.
func (*Iter) Next ¶
Next returns f(i.input[i.Pos():n]), where n is a boundary of i.input. For any input a and b for which f(a) == f(b), subsequent calls to Next will return the same segments. Modifying runes are grouped together with the preceding starter, if such a starter exists. Although not guaranteed, n will typically be the smallest possible n.
type Properties ¶
type Properties struct {
// contains filtered or unexported fields
}
Properties provides access to normalization properties of a rune.
func (Properties) BoundaryAfter ¶
func (p Properties) BoundaryAfter() bool
BoundaryAfter returns true if this rune cannot combine with runes to the right and always denotes the end of a segment.
func (Properties) BoundaryBefore ¶
func (p Properties) BoundaryBefore() bool
BoundaryBefore returns true if this rune starts a new segment and cannot combine with any rune on the left.
func (Properties) CCC ¶
func (p Properties) CCC() uint8
CCC returns the canonical combining class of the underlying rune.
func (Properties) Decomposition ¶
func (p Properties) Decomposition() []byte
Decomposition returns the decomposition for the underlying rune or nil if there is none.
func (Properties) LeadCCC ¶
func (p Properties) LeadCCC() uint8
LeadCCC returns the CCC of the first rune in the decomposition. If there is no decomposition, LeadCCC equals CCC.
func (Properties) Size ¶
func (p Properties) Size() int
Size returns the length of UTF-8 encoding of the rune.
func (Properties) TrailCCC ¶
func (p Properties) TrailCCC() uint8
TrailCCC returns the CCC of the last rune in the decomposition. If there is no decomposition, TrailCCC equals CCC.