Documentation ¶
Overview ¶
Package diacritics is the subpackage of package candidate which will attempt to remove diacritical marks from extended latin letters based on one of two different strategies.
- Strategy #1: Straight diacritics removal (NFKD -> strip Mn -> NFKC) - Strategy #2: Apache Lucene ASCII folding
Index ¶
Constants ¶
This section is empty.
Variables ¶
var AsciiFoldTransformer = transform.Chain( norm.NFKC, &asciiFoldSpanningTransformer{}, )
AsciiFoldTransformer is a Unicode stream transformer object which replaces a character with the ASCII folding version of the character.
var AsciiFoldTranslateTable = map[rune]string{}/* 1240 elements not displayed */
ASCII folding database fetched from https://github.com/apache/lucene-solr/blob/master/lucene/analysis/common/src/java/org/apache/lucene/analysis/miscellaneous/ASCIIFoldingFilter.java
var StripDiacriticalMarksTransformer = transform.Chain( norm.NFKD, runes.Remove(runes.In(runedata.CombiningDiacriticalMarks)), norm.NFKC, )
StripDiacriticalMarksTransformer is a Unicode stream transformer object which tries to remove as many combining diacritical marks from the input string as possible. It handles various combinations of the same Unicode characters whenever possible (such as 'ö' as a single codepoint vs. 'o' + '¨' = 'ö' which has 2 codepoints).
The removal process is preceded by Unicode decomposition, and the result is then re-combined to get final output.
Functions ¶
This section is empty.
Types ¶
This section is empty.