Documentation
¶
Overview ¶
Package fallback implements Unicode Character Fallback Substitutions using the Unicode CLDR 41.0 supplemental data file characters.xml, and an algorithm for enumerating every canonically equivalent string.
This can be useful for robustly parsing Unicode strings where for practical reasons (e.g. missing keyboard keys, missing font support) certain fallbacks have been used, or for picking a sensible default when certain Unicode strings cannot be displayed (e.g. missing font support).
Note that care must be taken not to change the meaning of a text - for example, superscript two '²', will have a (last resort) Character Fallback Substitution to the digit '2' via NKFC normalisation, but these have entirely different meanings. Similarly, the string "1½" changes meaning if naively converted to "11/2". The Unicode Character Fallback Substitutions rules as implemented in this package would produce "1 1/2", but this doesn't help for superscript two.
See the (withdrawn draft) Unicode Technical Report 30: CHARACTER FOLDINGS, as well as the earlier draft Unicode Technical Report 25: CHARACTER FOLDINGS, for commentary.
Example (Combinations) ¶
input := []string{ "a�b�c", "d�e�f", "w", "x�y�z", } it := must.Result(combinations(input)) xs := lazy.ToSlice(it) must.Equal(3*3*1*3, len(xs)) sort.Slice(xs, func(i int, j int) bool { return operator.LT(xs[i], xs[j]) }) for _, x := range xs { fmt.Println(x) }
Output: adwx adwy adwz aewx aewy aewz afwx afwy afwz bdwx bdwy bdwz bewx bewy bewz bfwx bfwy bfwz cdwx cdwy cdwz cewx cewy cewz cfwx cfwy cfwz
Example (Dstarts) ¶
a := dstarts('a') _ = dstarts(0x2A600) // last item for _, r := range string(a) { fmt.Printf("%c\n", r) }
Output: à á â ã ä å ā ă ą ǎ ȁ ȃ ȧ ḁ ạ ả
Index ¶
Examples ¶
Constants ¶
This section is empty.
Variables ¶
This section is empty.
Functions ¶
func Equivalent ¶
Equivalent is a lazy.It that produces all strings canonically-equivalent to the input. Note that this is very expensive for large strings. Note also that this does not include any Unicode Character Fallback Substitutions.
This is a clean-room implementation of Mark Davies' algorithm described at https://unicode.org/notes/tn5/#Enumerating_Equivalent_Strings
Example ¶
package main import ( "fmt" "strings" lazy "github.com/tawesoft/golib/v2/iter" "github.com/tawesoft/golib/v2/must" "github.com/tawesoft/golib/v2/text/fallback" "golang.org/x/text/unicode/runenames" ) func main() { input := "\u0041\u030A\u0064\u0307\u0327" fmt.Printf("Input: %s %x %x\n", input, []rune(input), []byte(input)) eq := must.Result(fallback.Equivalent(input)) lazy.Walk(func(x string) { fmt.Printf("%s: %x = %s\n", x, []rune(x), lazy.Join(lazy.StringJoiner(", "), lazy.Map(strings.ToLower, lazy.Map[rune, string](runenames.Name, lazy.FromString(x))))) }, eq) /* (Found in the Unicode ICU as a test case) Results for: {LATIN CAPITAL LETTER A WITH RING ABOVE}{LATIN SMALL LETTER D}{COMBINING DOT ABOVE}{COMBINING CEDILLA} 1: \u0041\u030A\u0064\u0307\u0327 = {LATIN CAPITAL LETTER A}{COMBINING RING ABOVE}{LATIN SMALL LETTER D}{COMBINING DOT ABOVE}{COMBINING CEDILLA} 2: \u0041\u030A\u0064\u0327\u0307 = {LATIN CAPITAL LETTER A}{COMBINING RING ABOVE}{LATIN SMALL LETTER D}{COMBINING CEDILLA}{COMBINING DOT ABOVE} 3: \u0041\u030A\u1E0B\u0327 = {LATIN CAPITAL LETTER A}{COMBINING RING ABOVE}{LATIN SMALL LETTER D WITH DOT ABOVE}{COMBINING CEDILLA} 4: \u0041\u030A\u1E11\u0307 = {LATIN CAPITAL LETTER A}{COMBINING RING ABOVE}{LATIN SMALL LETTER D WITH CEDILLA}{COMBINING DOT ABOVE} 5: \u00C5\u0064\u0307\u0327 = {LATIN CAPITAL LETTER A WITH RING ABOVE}{LATIN SMALL LETTER D}{COMBINING DOT ABOVE}{COMBINING CEDILLA} 6: \u00C5\u0064\u0327\u0307 = {LATIN CAPITAL LETTER A WITH RING ABOVE}{LATIN SMALL LETTER D}{COMBINING CEDILLA}{COMBINING DOT ABOVE} 7: \u00C5\u1E0B\u0327 = {LATIN CAPITAL LETTER A WITH RING ABOVE}{LATIN SMALL LETTER D WITH DOT ABOVE}{COMBINING CEDILLA} 8: \u00C5\u1E11\u0307 = {LATIN CAPITAL LETTER A WITH RING ABOVE}{LATIN SMALL LETTER D WITH CEDILLA}{COMBINING DOT ABOVE} 9: \u212B\u0064\u0307\u0327 = {ANGSTROM SIGN}{LATIN SMALL LETTER D}{COMBINING DOT ABOVE}{COMBINING CEDILLA} 10: \u212B\u0064\u0327\u0307 = {ANGSTROM SIGN}{LATIN SMALL LETTER D}{COMBINING CEDILLA}{COMBINING DOT ABOVE} 11: \u212B\u1E0B\u0327 = {ANGSTROM SIGN}{LATIN SMALL LETTER D WITH DOT ABOVE}{COMBINING CEDILLA} 12: \u212B\u1E11\u0307 = {ANGSTROM SIGN}{LATIN SMALL LETTER D WITH CEDILLA}{COMBINING DOT ABOVE} */ // TODO for some reason our implementation is missing the two variants with an Angstrom Sign. // This is probably due to Go's Unicode version being older than the example. // Revisit once new Unicode versions land }
Output: Input: Åḑ̇ [41 30a 64 307 327] 41cc8a64cc87cca7 Åḑ̇: [41 30a 64 327 307] = latin capital letter a, combining ring above, latin small letter d, combining cedilla, combining dot above Åḑ̇: [c5 64 327 307] = latin capital letter a with ring above, latin small letter d, combining cedilla, combining dot above Åḑ̇: [41 30a 1e0b 327] = latin capital letter a, combining ring above, latin small letter d with dot above, combining cedilla Åḑ̇: [c5 1e0b 327] = latin capital letter a with ring above, latin small letter d with dot above, combining cedilla Å̧: [41 30a 327] = latin capital letter a, combining ring above, combining cedilla Å̧: [c5 327] = latin capital letter a with ring above, combining cedilla Åḑ̇: [41 30a 1e11 307] = latin capital letter a, combining ring above, latin small letter d with cedilla, combining dot above Åḑ̇: [c5 1e11 307] = latin capital letter a with ring above, latin small letter d with cedilla, combining dot above Å̇: [41 30a 307] = latin capital letter a, combining ring above, combining dot above Å̇: [c5 307] = latin capital letter a with ring above, combining dot above
func Is ¶
Is returns true iff the provided string is a possible fallback string produced by Unicode Character Fallback Substitution rules applied to the input rune. Neither argument is required to be normalised on input.
For example,
Is('㎦', "㎞³") // true Is('㎦', "km³") // true Is('㎦', "km3") // true
Example ¶
package main import ( "fmt" "github.com/tawesoft/golib/v2/text/fallback" ) func main() { type row struct { input rune alternative string } rows := []row{ {'㎦', "㎞³"}, {'㎦', "km³"}, {'㎦', "km3"}, {'㎦', "foo"}, {'²', "2"}, {'½', "1⁄2"}, // 0x2044 {'½', " 1/2"}, // 0x002F } for _, r := range rows { q := fallback.Is(r.input, r.alternative) fmt.Printf("Is %s a fallback for %c? %t\n", r.alternative, r.input, q) } }
Output: Is ㎞³ a fallback for ㎦? true Is km³ a fallback for ㎦? true Is km3 a fallback for ㎦? true Is foo a fallback for ㎦? false Is 2 a fallback for ²? true Is 1⁄2 a fallback for ½? true Is 1/2 a fallback for ½? true
func Subs ¶
Subs returns a complete list of strings that can be used as fallbacks for the input rune, in order of priority, according to the Unicode Character Fallback Substitutions rules.
Example ¶
package main import ( "fmt" "github.com/tawesoft/golib/v2/text/fallback" ) func main() { rows := []rune{ '㎦', '²', '½', } for _, r := range rows { fmt.Printf("=== %c ===\n", r) for _, s := range fallback.Subs(r) { fmt.Println(s) } } }
Output: === ㎦ === ㎦ km³ km3 === ² === ² 2 === ½ === ½ 1/2 1⁄2
Types ¶
This section is empty.