fallback

package

v2.8.1 Latest Latest Go to latest Published: Feb 17, 2023 License: MIT Imports: 12 Imported by: 0

Details

Valid go.mod file
Redistributable license
Tagged version
Stable version
Learn more about best practices

Repository

github.com/tawesoft/golib

Links

Open Source Insights

Documentation ¶

Overview ¶

Package fallback implements Unicode Character Fallback Substitutions using the Unicode CLDR 41.0 supplemental data file characters.xml, and an algorithm for enumerating every canonically equivalent string.

This can be useful for robustly parsing Unicode strings where for practical reasons (e.g. missing keyboard keys, missing font support) certain fallbacks have been used, or for picking a sensible default when certain Unicode strings cannot be displayed (e.g. missing font support).

Note that care must be taken not to change the meaning of a text - for example, superscript two '²', will have a (last resort) Character Fallback Substitution to the digit '2' via NKFC normalisation, but these have entirely different meanings. Similarly, the string "1½" changes meaning if naively converted to "11/2". The Unicode Character Fallback Substitutions rules as implemented in this package would produce "1 1/2", but this doesn't help for superscript two.

See the (withdrawn draft) Unicode Technical Report 30: CHARACTER FOLDINGS, as well as the earlier draft Unicode Technical Report 25: CHARACTER FOLDINGS, for commentary.

Example (Combinations) ¶

input := []string{
	"a�b�c",
	"d�e�f",
	"w",
	"x�y�z",
}

it := must.Result(combinations(input))
xs := lazy.ToSlice(it)

must.Equal(3*3*1*3, len(xs))

sort.Slice(xs, func(i int, j int) bool {
	return operator.LT(xs[i], xs[j])
})

for _, x := range xs {
	fmt.Println(x)
}

Output:

adwx
adwy
adwz
aewx
aewy
aewz
afwx
afwy
afwz
bdwx
bdwy
bdwz
bewx
bewy
bewz
bfwx
bfwy
bfwz
cdwx
cdwy
cdwz
cewx
cewy
cewz
cfwx
cfwy
cfwz

Example (Dstarts) ¶

a := dstarts('a')
_ = dstarts(0x2A600) // last item

for _, r := range string(a) {
	fmt.Printf("%c\n", r)
}

Output:

à
á
â
ã
ä
å
ā
ă
ą
ǎ
ȁ
ȃ
ȧ
ḁ
ạ
ả

Index ¶

func Equivalent(in string) (lazy.It[string], error)
func Is(r rune, s string) bool
func Subs(x rune) []string

Constants ¶

This section is empty.

Variables ¶

This section is empty.

Functions ¶

func Equivalent ¶

func Equivalent(in string) (lazy.It[string], error)

Equivalent is a lazy.It that produces all strings canonically-equivalent to the input. Note that this is very expensive for large strings. Note also that this does not include any Unicode Character Fallback Substitutions.

This is a clean-room implementation of Mark Davies' algorithm described at https://unicode.org/notes/tn5/#Enumerating_Equivalent_Strings

Example ¶

package main

import (
	"fmt"
	"strings"

	lazy "github.com/tawesoft/golib/v2/iter"
	"github.com/tawesoft/golib/v2/must"
	"github.com/tawesoft/golib/v2/text/fallback"
	"golang.org/x/text/unicode/runenames"
)

func main() {
	input := "\u0041\u030A\u0064\u0307\u0327"
	fmt.Printf("Input: %s %x %x\n", input, []rune(input), []byte(input))
	eq := must.Result(fallback.Equivalent(input))

	lazy.Walk(func(x string) {
		fmt.Printf("%s: %x = %s\n", x, []rune(x),
			lazy.Join(lazy.StringJoiner(", "),
				lazy.Map(strings.ToLower,
					lazy.Map[rune, string](runenames.Name,
						lazy.FromString(x)))))
	}, eq)

	/* (Found in the Unicode ICU as a test case)

	   Results for: {LATIN CAPITAL LETTER A WITH RING ABOVE}{LATIN SMALL LETTER D}{COMBINING DOT ABOVE}{COMBINING CEDILLA}

	   1: \u0041\u030A\u0064\u0307\u0327
	    = {LATIN CAPITAL LETTER A}{COMBINING RING ABOVE}{LATIN SMALL LETTER D}{COMBINING DOT ABOVE}{COMBINING CEDILLA}
	   2: \u0041\u030A\u0064\u0327\u0307
	    = {LATIN CAPITAL LETTER A}{COMBINING RING ABOVE}{LATIN SMALL LETTER D}{COMBINING CEDILLA}{COMBINING DOT ABOVE}
	   3: \u0041\u030A\u1E0B\u0327
	    = {LATIN CAPITAL LETTER A}{COMBINING RING ABOVE}{LATIN SMALL LETTER D WITH DOT ABOVE}{COMBINING CEDILLA}
	   4: \u0041\u030A\u1E11\u0307
	    = {LATIN CAPITAL LETTER A}{COMBINING RING ABOVE}{LATIN SMALL LETTER D WITH CEDILLA}{COMBINING DOT ABOVE}
	   5: \u00C5\u0064\u0307\u0327
	    = {LATIN CAPITAL LETTER A WITH RING ABOVE}{LATIN SMALL LETTER D}{COMBINING DOT ABOVE}{COMBINING CEDILLA}
	   6: \u00C5\u0064\u0327\u0307
	    = {LATIN CAPITAL LETTER A WITH RING ABOVE}{LATIN SMALL LETTER D}{COMBINING CEDILLA}{COMBINING DOT ABOVE}
	   7: \u00C5\u1E0B\u0327
	    = {LATIN CAPITAL LETTER A WITH RING ABOVE}{LATIN SMALL LETTER D WITH DOT ABOVE}{COMBINING CEDILLA}
	   8: \u00C5\u1E11\u0307
	    = {LATIN CAPITAL LETTER A WITH RING ABOVE}{LATIN SMALL LETTER D WITH CEDILLA}{COMBINING DOT ABOVE}
	   9: \u212B\u0064\u0307\u0327
	    = {ANGSTROM SIGN}{LATIN SMALL LETTER D}{COMBINING DOT ABOVE}{COMBINING CEDILLA}
	   10: \u212B\u0064\u0327\u0307
	    = {ANGSTROM SIGN}{LATIN SMALL LETTER D}{COMBINING CEDILLA}{COMBINING DOT ABOVE}
	   11: \u212B\u1E0B\u0327
	    = {ANGSTROM SIGN}{LATIN SMALL LETTER D WITH DOT ABOVE}{COMBINING CEDILLA}
	   12: \u212B\u1E11\u0307
	    = {ANGSTROM SIGN}{LATIN SMALL LETTER D WITH CEDILLA}{COMBINING DOT ABOVE}

	*/

	// TODO for some reason our implementation is missing the two variants with an Angstrom Sign.
	//   This is probably due to Go's Unicode version being older than the example.
	//   Revisit once new Unicode versions land

}

Output:

Input: Åḑ̇ [41 30a 64 307 327] 41cc8a64cc87cca7
Åḑ̇: [41 30a 64 327 307] = latin capital letter a, combining ring above, latin small letter d, combining cedilla, combining dot above
Åḑ̇: [c5 64 327 307] = latin capital letter a with ring above, latin small letter d, combining cedilla, combining dot above
Åḑ̇: [41 30a 1e0b 327] = latin capital letter a, combining ring above, latin small letter d with dot above, combining cedilla
Åḑ̇: [c5 1e0b 327] = latin capital letter a with ring above, latin small letter d with dot above, combining cedilla
Å̧: [41 30a 327] = latin capital letter a, combining ring above, combining cedilla
Å̧: [c5 327] = latin capital letter a with ring above, combining cedilla
Åḑ̇: [41 30a 1e11 307] = latin capital letter a, combining ring above, latin small letter d with cedilla, combining dot above
Åḑ̇: [c5 1e11 307] = latin capital letter a with ring above, latin small letter d with cedilla, combining dot above
Å̇: [41 30a 307] = latin capital letter a, combining ring above, combining dot above
Å̇: [c5 307] = latin capital letter a with ring above, combining dot above

func Is ¶

func Is(r rune, s string) bool

Is returns true iff the provided string is a possible fallback string produced by Unicode Character Fallback Substitution rules applied to the input rune. Neither argument is required to be normalised on input.

For example,

Is('㎦', "㎞³") // true
Is('㎦', "km³") // true
Is('㎦', "km3") // true

Example ¶

package main

import (
	"fmt"

	"github.com/tawesoft/golib/v2/text/fallback"
)

func main() {
	type row struct {
		input       rune
		alternative string
	}

	rows := []row{
		{'㎦', "㎞³"},
		{'㎦', "km³"},
		{'㎦', "km3"},
		{'㎦', "foo"},
		{'²', "2"},
		{'½', "1⁄2"},  // 0x2044
		{'½', " 1/2"}, // 0x002F
	}

	for _, r := range rows {
		q := fallback.Is(r.input, r.alternative)
		fmt.Printf("Is %s a fallback for %c? %t\n",
			r.alternative, r.input, q)
	}

}

Output:

Is ㎞³ a fallback for ㎦? true
Is km³ a fallback for ㎦? true
Is km3 a fallback for ㎦? true
Is foo a fallback for ㎦? false
Is 2 a fallback for ²? true
Is 1⁄2 a fallback for ½? true
Is  1/2 a fallback for ½? true

func Subs ¶

func Subs(x rune) []string

Subs returns a complete list of strings that can be used as fallbacks for the input rune, in order of priority, according to the Unicode Character Fallback Substitutions rules.

Example ¶

package main

import (
	"fmt"

	"github.com/tawesoft/golib/v2/text/fallback"
)

func main() {
	rows := []rune{
		'㎦', '²', '½',
	}

	for _, r := range rows {
		fmt.Printf("=== %c ===\n", r)
		for _, s := range fallback.Subs(r) {
			fmt.Println(s)
		}
	}

}

Output:

=== ㎦ ===
㎦
km³
km3
=== ² ===
²
2
=== ½ ===
½
 1/2
1⁄2

Types ¶

This section is empty.

Source Files ¶

View all Source files

?	: This menu
/	: Search site
f or F	: Jump to
y or Y	: Canonical URL