fallback

package
v2.8.1 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Feb 17, 2023 License: MIT Imports: 12 Imported by: 0

Documentation

Overview

Package fallback implements Unicode Character Fallback Substitutions using the Unicode CLDR 41.0 supplemental data file characters.xml, and an algorithm for enumerating every canonically equivalent string.

This can be useful for robustly parsing Unicode strings where for practical reasons (e.g. missing keyboard keys, missing font support) certain fallbacks have been used, or for picking a sensible default when certain Unicode strings cannot be displayed (e.g. missing font support).

Note that care must be taken not to change the meaning of a text - for example, superscript two '²', will have a (last resort) Character Fallback Substitution to the digit '2' via NKFC normalisation, but these have entirely different meanings. Similarly, the string "1½" changes meaning if naively converted to "11/2". The Unicode Character Fallback Substitutions rules as implemented in this package would produce "1 1/2", but this doesn't help for superscript two.

See the (withdrawn draft) Unicode Technical Report 30: CHARACTER FOLDINGS, as well as the earlier draft Unicode Technical Report 25: CHARACTER FOLDINGS, for commentary.

Example (Combinations)
input := []string{
	"a�b�c",
	"d�e�f",
	"w",
	"x�y�z",
}

it := must.Result(combinations(input))
xs := lazy.ToSlice(it)

must.Equal(3*3*1*3, len(xs))

sort.Slice(xs, func(i int, j int) bool {
	return operator.LT(xs[i], xs[j])
})

for _, x := range xs {
	fmt.Println(x)
}
Output:

adwx
adwy
adwz
aewx
aewy
aewz
afwx
afwy
afwz
bdwx
bdwy
bdwz
bewx
bewy
bewz
bfwx
bfwy
bfwz
cdwx
cdwy
cdwz
cewx
cewy
cewz
cfwx
cfwy
cfwz
Example (Dstarts)
a := dstarts('a')
_ = dstarts(0x2A600) // last item

for _, r := range string(a) {
	fmt.Printf("%c\n", r)
}
Output:

à
á
â
ã
ä
å
ā
ă
ą
ǎ
ȁ
ȃ
ȧ
ḁ
ạ
ả

Index

Examples

Constants

This section is empty.

Variables

This section is empty.

Functions

func Equivalent

func Equivalent(in string) (lazy.It[string], error)

Equivalent is a lazy.It that produces all strings canonically-equivalent to the input. Note that this is very expensive for large strings. Note also that this does not include any Unicode Character Fallback Substitutions.

This is a clean-room implementation of Mark Davies' algorithm described at https://unicode.org/notes/tn5/#Enumerating_Equivalent_Strings

Example
package main

import (
	"fmt"
	"strings"

	lazy "github.com/tawesoft/golib/v2/iter"
	"github.com/tawesoft/golib/v2/must"
	"github.com/tawesoft/golib/v2/text/fallback"
	"golang.org/x/text/unicode/runenames"
)

func main() {
	input := "\u0041\u030A\u0064\u0307\u0327"
	fmt.Printf("Input: %s %x %x\n", input, []rune(input), []byte(input))
	eq := must.Result(fallback.Equivalent(input))

	lazy.Walk(func(x string) {
		fmt.Printf("%s: %x = %s\n", x, []rune(x),
			lazy.Join(lazy.StringJoiner(", "),
				lazy.Map(strings.ToLower,
					lazy.Map[rune, string](runenames.Name,
						lazy.FromString(x)))))
	}, eq)

	/* (Found in the Unicode ICU as a test case)

	   Results for: {LATIN CAPITAL LETTER A WITH RING ABOVE}{LATIN SMALL LETTER D}{COMBINING DOT ABOVE}{COMBINING CEDILLA}

	   1: \u0041\u030A\u0064\u0307\u0327
	    = {LATIN CAPITAL LETTER A}{COMBINING RING ABOVE}{LATIN SMALL LETTER D}{COMBINING DOT ABOVE}{COMBINING CEDILLA}
	   2: \u0041\u030A\u0064\u0327\u0307
	    = {LATIN CAPITAL LETTER A}{COMBINING RING ABOVE}{LATIN SMALL LETTER D}{COMBINING CEDILLA}{COMBINING DOT ABOVE}
	   3: \u0041\u030A\u1E0B\u0327
	    = {LATIN CAPITAL LETTER A}{COMBINING RING ABOVE}{LATIN SMALL LETTER D WITH DOT ABOVE}{COMBINING CEDILLA}
	   4: \u0041\u030A\u1E11\u0307
	    = {LATIN CAPITAL LETTER A}{COMBINING RING ABOVE}{LATIN SMALL LETTER D WITH CEDILLA}{COMBINING DOT ABOVE}
	   5: \u00C5\u0064\u0307\u0327
	    = {LATIN CAPITAL LETTER A WITH RING ABOVE}{LATIN SMALL LETTER D}{COMBINING DOT ABOVE}{COMBINING CEDILLA}
	   6: \u00C5\u0064\u0327\u0307
	    = {LATIN CAPITAL LETTER A WITH RING ABOVE}{LATIN SMALL LETTER D}{COMBINING CEDILLA}{COMBINING DOT ABOVE}
	   7: \u00C5\u1E0B\u0327
	    = {LATIN CAPITAL LETTER A WITH RING ABOVE}{LATIN SMALL LETTER D WITH DOT ABOVE}{COMBINING CEDILLA}
	   8: \u00C5\u1E11\u0307
	    = {LATIN CAPITAL LETTER A WITH RING ABOVE}{LATIN SMALL LETTER D WITH CEDILLA}{COMBINING DOT ABOVE}
	   9: \u212B\u0064\u0307\u0327
	    = {ANGSTROM SIGN}{LATIN SMALL LETTER D}{COMBINING DOT ABOVE}{COMBINING CEDILLA}
	   10: \u212B\u0064\u0327\u0307
	    = {ANGSTROM SIGN}{LATIN SMALL LETTER D}{COMBINING CEDILLA}{COMBINING DOT ABOVE}
	   11: \u212B\u1E0B\u0327
	    = {ANGSTROM SIGN}{LATIN SMALL LETTER D WITH DOT ABOVE}{COMBINING CEDILLA}
	   12: \u212B\u1E11\u0307
	    = {ANGSTROM SIGN}{LATIN SMALL LETTER D WITH CEDILLA}{COMBINING DOT ABOVE}

	*/

	// TODO for some reason our implementation is missing the two variants with an Angstrom Sign.
	//   This is probably due to Go's Unicode version being older than the example.
	//   Revisit once new Unicode versions land

}
Output:

Input: Åḑ̇ [41 30a 64 307 327] 41cc8a64cc87cca7
Åḑ̇: [41 30a 64 327 307] = latin capital letter a, combining ring above, latin small letter d, combining cedilla, combining dot above
Åḑ̇: [c5 64 327 307] = latin capital letter a with ring above, latin small letter d, combining cedilla, combining dot above
Åḑ̇: [41 30a 1e0b 327] = latin capital letter a, combining ring above, latin small letter d with dot above, combining cedilla
Åḑ̇: [c5 1e0b 327] = latin capital letter a with ring above, latin small letter d with dot above, combining cedilla
Å̧: [41 30a 327] = latin capital letter a, combining ring above, combining cedilla
Å̧: [c5 327] = latin capital letter a with ring above, combining cedilla
Åḑ̇: [41 30a 1e11 307] = latin capital letter a, combining ring above, latin small letter d with cedilla, combining dot above
Åḑ̇: [c5 1e11 307] = latin capital letter a with ring above, latin small letter d with cedilla, combining dot above
Å̇: [41 30a 307] = latin capital letter a, combining ring above, combining dot above
Å̇: [c5 307] = latin capital letter a with ring above, combining dot above

func Is

func Is(r rune, s string) bool

Is returns true iff the provided string is a possible fallback string produced by Unicode Character Fallback Substitution rules applied to the input rune. Neither argument is required to be normalised on input.

For example,

Is('㎦', "㎞³") // true
Is('㎦', "km³") // true
Is('㎦', "km3") // true
Example
package main

import (
	"fmt"

	"github.com/tawesoft/golib/v2/text/fallback"
)

func main() {
	type row struct {
		input       rune
		alternative string
	}

	rows := []row{
		{'㎦', "㎞³"},
		{'㎦', "km³"},
		{'㎦', "km3"},
		{'㎦', "foo"},
		{'²', "2"},
		{'½', "1⁄2"},  // 0x2044
		{'½', " 1/2"}, // 0x002F
	}

	for _, r := range rows {
		q := fallback.Is(r.input, r.alternative)
		fmt.Printf("Is %s a fallback for %c? %t\n",
			r.alternative, r.input, q)
	}

}
Output:

Is ㎞³ a fallback for ㎦? true
Is km³ a fallback for ㎦? true
Is km3 a fallback for ㎦? true
Is foo a fallback for ㎦? false
Is 2 a fallback for ²? true
Is 1⁄2 a fallback for ½? true
Is  1/2 a fallback for ½? true

func Subs

func Subs(x rune) []string

Subs returns a complete list of strings that can be used as fallbacks for the input rune, in order of priority, according to the Unicode Character Fallback Substitutions rules.

Example
package main

import (
	"fmt"

	"github.com/tawesoft/golib/v2/text/fallback"
)

func main() {
	rows := []rune{
		'㎦', '²', '½',
	}

	for _, r := range rows {
		fmt.Printf("=== %c ===\n", r)
		for _, s := range fallback.Subs(r) {
			fmt.Println(s)
		}
	}

}
Output:

=== ㎦ ===
㎦
km³
km3
=== ² ===
²
2
=== ½ ===
½
 1/2
1⁄2

Types

This section is empty.

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL