Documentation ¶
Overview ¶
String processing helpers for doing fuzzy detection and normalized token matching against keyword lists.
Index ¶
- func SlugContainsExplicitSlur(raw string) string
- func SlugIsExplicitSlur(raw string) string
- func Slugify(orig string) string
- func TokenInSet(tok string, set []string) bool
- func TokenizeIdentifier(orig string) []string
- func TokenizeText(text string) []string
- func TokenizeTextSkippingCensorChars(text string) []string
- func TokenizeTextWithRegex(text string, nonTokenCharsRegex *regexp.Regexp) []string
Constants ¶
This section is empty.
Variables ¶
This section is empty.
Functions ¶
func SlugContainsExplicitSlur ¶
For a small set of frequently-abused explicit slurs, checks for a of permissive set of "l33t-speak" variations of the keyword. This is intended to be used with pre-processed "slugs", which are strings with all whitespace, punctuation, and other characters removed. These could be pre-processed identifiers (like handles or record keys), or pre-processed free-form text.
If there is a match, returns a plan-text version of the slur.
This is a loose port of the 'hasExplicitSlur' function from the `@atproto/pds` TypeScript package.
func SlugIsExplicitSlur ¶
Variant of `SlugContainsExplicitSlur` where the entire slug must match.
func Slugify ¶
Takes an arbitrary string (eg, an identifier or free-form text) and returns a version with all non-letter, non-digit characters removed, and all lower-case
func TokenInSet ¶
Helper to check a single token against a list of tokens
func TokenizeIdentifier ¶
Splits an identifier in to tokens. Removes any single-character tokens.
For example, the-handle.bsky.social would be split in to ["the", "handle", "bsky", "social"]
func TokenizeText ¶
func TokenizeTextWithRegex ¶
Splits free-form text in to tokens, including lower-case, unicode normalization, and some unicode folding.
The intent is for this to work similarly to an NLP tokenizer, as might be used in a fulltext search engine, and enable fast matching to a list of known tokens. It might eventually even do stemming, removing pluralization (trailing "s" for English), etc.
Types ¶
This section is empty.