Documentation
¶
Overview ¶
Package words implements Unicode word boundaries: https://unicode.org/reports/tr29/#Word_Boundaries
Index ¶
Examples ¶
Constants ¶
This section is empty.
Variables ¶
This section is empty.
Functions ¶
func BleveIdeographic ¶ added in v1.12.0
BleveIdeographic determines if a token is comprised ideographs, by the Bleve segmenter's definition. It is the union of Han, Katakana, & Hiragana. See https://github.com/blevesearch/segment/blob/master/segment_words.rl ...and search for uses of "Ideo". This API is experimental.
func BleveNumeric ¶ added in v1.12.0
BleveNumeric determines if a token is Numeric using the Bleve segmenter's. definition, see: https://github.com/blevesearch/segment/blob/master/segment_words.rl#L199-L207 This API is experimental.
func NewScanner ¶
NewScanner returns a Scanner, to tokenize words per https://unicode.org/reports/tr29/#Word_Boundaries. Iterate through words by calling Scan() until false, then check Err(). See also the bufio.Scanner docs.
Example ¶
package main import ( "fmt" "log" "strings" "github.com/clipperhouse/uax29/iterators/filter" "github.com/clipperhouse/uax29/words" ) func main() { text := "Hello, 世界. Nice dog! 👍🐶" r := strings.NewReader(text) sc := words.NewScanner(r) sc.Filter(filter.Wordlike) // let's exclude whitespace & punctuation // Scan returns true until error or EOF for sc.Scan() { // Do something with the token (segment) fmt.Printf("%q\n", sc.Text()) } // Gotta check the error! if err := sc.Err(); err != nil { log.Fatal(err) } }
Output: "Hello" "世" "界" "Nice" "dog" "👍" "🐶"
func NewSegmenter ¶ added in v1.7.0
NewSegmenter retuns a Segmenter, which is an iterator over the source text. Iterate while Next() is true, and access the segmented words via Bytes().
Example ¶
package main import ( "fmt" "log" "github.com/clipperhouse/uax29/iterators/filter" "github.com/clipperhouse/uax29/words" ) func main() { text := []byte("Hello, 世界. Nice dog! 👍🐶") seg := words.NewSegmenter(text) seg.Filter(filter.Wordlike) // let's exclude whitespace & punctuation // Next returns true until error or end of data for seg.Next() { // Do something with the token (segment) fmt.Printf("%q\n", seg.Bytes()) } // Gotta check the error! if err := seg.Err(); err != nil { log.Fatal(err) } }
Output: "Hello" "世" "界" "Nice" "dog" "👍" "🐶"
func SegmentAll ¶ added in v1.7.0
SegmentAll will iterate through all tokens and collect them into a [][]byte. This is a convenience method -- if you will be allocating such a slice anyway, this will save you some code. The downside is that this allocation is unbounded -- O(n) on the number of tokens. Use Segmenter for more bounded memory usage.
Example ¶
package main import ( "fmt" "github.com/clipperhouse/uax29/words" ) func main() { text := []byte("Hello, 世界. Nice dog! 👍🐶") segments := words.SegmentAll(text) fmt.Printf("%q\n", segments) }
Output: ["Hello" "," " " "世" "界" "." " " "Nice" " " "dog" "!" " " "👍" "🐶"]
Types ¶
This section is empty.