uniseg

package

v1.21.9 Latest Latest Go to latest Published: Aug 6, 2024 License: Apache-2.0, MIT Imports: 1 Imported by: 0

Details

Valid go.mod file

The Go module system was introduced in Go 1.11 and is the official dependency management solution for Go.
Redistributable license

Redistributable licenses place minimal restrictions on how software can be used, modified, and redistributed.
Tagged version

Modules with tagged versions give importers more predictable builds.
Stable version

When a project reaches major version v1 it is considered stable.
Learn more about best practices

Repository

gitee.com/quant1x/gox

Links

Open Source Insights

README ¶

Unicode Text Segmentation for Go

This Go package implements Unicode Text Segmentation according to Unicode Standard Annex #29 (Unicode version 12.0.0).

At this point, only the determination of grapheme cluster boundaries is implemented.

Background

In Go, strings are read-only slices of bytes. They can be turned into Unicode code points using the for loop or by casting: []rune(str). However, multiple code points may be combined into one user-perceived character or what the Unicode specification calls "grapheme cluster". Here are some examples:

String	Bytes (UTF-8)	Code points (runes)	Grapheme clusters
Käse	6 bytes: `4b 61 cc 88 73 65`	5 code points: `4b 61 308 73 65`	4 clusters: `[4b],[61 308],[73],[65]`
🏳️‍🌈	14 bytes: `f0 9f 8f b3 ef b8 8f e2 80 8d f0 9f 8c 88`	4 code points: `1f3f3 fe0f 200d 1f308`	1 cluster: `[1f3f3 fe0f 200d 1f308]`
🇩🇪	8 bytes: `f0 9f 87 a9 f0 9f 87 aa`	2 code points: `1f1e9 1f1ea`	1 cluster: `[1f1e9 1f1ea]`

This package provides a tool to iterate over these grapheme clusters. This may be used to determine the number of user-perceived characters, to split strings in their intended places, or to extract individual characters which form a unit.

Installation

go get github.com/rivo/uniseg

Basic Example

package uniseg

import (
	"fmt"

	"github.com/rivo/uniseg"
)

func main() {
	gr := uniseg.NewGraphemes("👍🏼!")
	for gr.Next() {
		fmt.Printf("%x ", gr.Runes())
	}
	// Output: [1f44d 1f3fc] [21]
}

Documentation

Refer to https://godoc.org/github.com/rivo/uniseg for the package's documentation.

Dependencies

This package does not depend on any packages outside the standard library.

Your Feedback

Add your issue here on GitHub. Feel free to get in touch if you have any questions.

Version

Version tags will be introduced once Golang modules are official. Consider this version 0.1.

Documentation ¶

Overview ¶

Package uniseg implements Unicode Text Segmentation according to Unicode Standard Annex #29 (http://unicode.org/reports/tr29/).

At this point, only the determination of grapheme cluster boundaries is implemented.

Index ¶

func GraphemeClusterCount(s string) (n int)
type Graphemes
- func NewGraphemes(s string) *Graphemes

Examples ¶

Graphemes

Constants ¶

This section is empty.

Variables ¶

This section is empty.

Functions ¶

func GraphemeClusterCount ¶

func GraphemeClusterCount(s string) (n int)

GraphemeClusterCount returns the number of user-perceived characters (grapheme clusters) for the given string. To calculate this number, it iterates through the string using the Graphemes iterator.

Types ¶

type Graphemes ¶

type Graphemes struct {
	// contains filtered or unexported fields
}

Graphemes implements an iterator over Unicode extended grapheme clusters, specified in the Unicode Standard Annex #29. Grapheme clusters correspond to "user-perceived characters". These characters often consist of multiple code points (e.g. the "woman kissing woman" emoji consists of 8 code points: woman + ZWJ + heavy black heart (2 code points) + ZWJ + kiss mark + ZWJ + woman) and the rules described in Annex #29 must be applied to group those code points into clusters perceived by the user as one character.

Example ¶

Type example.

gr := NewGraphemes("👍🏼!")
for gr.Next() {
	fmt.Printf("%x ", gr.Runes())
}

Output:

[1f44d 1f3fc] [21]

func NewGraphemes ¶

func NewGraphemes(s string) *Graphemes

NewGraphemes returns a new grapheme cluster iterator.

func (*Graphemes) Bytes ¶

func (g *Graphemes) Bytes() []byte

Bytes returns a byte slice which corresponds to the current grapheme cluster. If the iterator is already past the end or Next() has not yet been called, nil is returned.

func (*Graphemes) Next ¶

func (g *Graphemes) Next() bool

Next advances the iterator by one grapheme cluster and returns false if no clusters are left. This function must be called before the first cluster is accessed.

func (*Graphemes) Positions ¶

func (g *Graphemes) Positions() (int, int)

Positions returns the interval of the current grapheme cluster as byte positions into the original string. The first returned value "from" indexes the first byte and the second returned value "to" indexes the first byte that is not included anymore, i.e. str[from:to] is the current grapheme cluster of the original string "str". If Next() has not yet been called, both values are 0. If the iterator is already past the end, both values are 1.

func (*Graphemes) Reset ¶

func (g *Graphemes) Reset()

Reset puts the iterator into its initial state such that the next call to Next() sets it to the first grapheme cluster again.

func (*Graphemes) Runes ¶

func (g *Graphemes) Runes() []rune

Runes returns a slice of runes (code points) which corresponds to the current grapheme cluster. If the iterator is already past the end or Next() has not yet been called, nil is returned.

func (*Graphemes) Str ¶

func (g *Graphemes) Str() string

Str returns a substring of the original string which corresponds to the current grapheme cluster. If the iterator is already past the end or Next() has not yet been called, an empty string is returned.

Source Files ¶

View all Source files

?	: This menu
/	: Search site
f or F	: Jump to
y or Y	: Canonical URL