encoding

package module
v1.0.1 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Sep 20, 2024 License: Apache-2.0 Imports: 4 Imported by: 2

README

encoding

Linux Windows Apache License Coverage GoDoc

Package encoding provides a number of encodings that are missing from the standard Go encoding package.

We hope that we can contribute these to the standard Go library someday. It turns out that some of these are useful for dealing with I/O streams coming from non-UTF friendly sources.

The UTF8 Encoder is also useful for situations where valid UTF-8 might be carried in streams that contain non-valid UTF; in particular I use it for helping me cope with terminals that embed escape sequences in otherwise valid UTF-8.

Documentation

Overview

Package encoding provides a few of the encoding structures that are missing from the Go x/text/encoding tree.

Index

Constants

View Source
const (
	// RuneError is an alias for the UTF-8 replacement rune, '\uFFFD'.
	RuneError = '\uFFFD'

	// RuneSelf is the rune below which UTF-8 and the Unicode values are
	// identical.  Its also the limit for ASCII.
	RuneSelf = 0x80

	// ASCIISub is the ASCII substitution character.
	ASCIISub = '\x1a'
)

Variables

ASCII represents the 7-bit US-ASCII scheme. It decodes directly to UTF-8 without change, as all ASCII values are legal UTF-8. Unicode values less than 128 (i.e. 7 bits) map 1:1 with ASCII. It encodes runes outside of that to 0x1A, the ASCII substitution character.

EBCDIC represents the 8-bit EBCDIC scheme, found in some mainframe environments. If you don't know what this is, consider yourself lucky.

View Source
var ISO8859_1 encoding.Encoding

ISO8859_1 represents the 8-bit ISO8859-1 scheme. It decodes directly to UTF-8 without change, as all ISO8859-1 values are legal UTF-8. Unicode values less than 256 (i.e. 8 bits) map 1:1 with 8859-1. It encodes runes outside of that to 0x1A, the ASCII substitution character.

View Source
var ISO8859_9 encoding.Encoding

ISO8859_9 represents the 8-bit ISO8859-9 scheme.

View Source
var UTF8 encoding.Encoding = validUtf8{}

UTF8 is an encoding for UTF-8. All it does is verify that the UTF-8 in is valid. The main reason for its existence is that it will detect and report ErrSrcShort or ErrDstShort, whereas the Nop encoding just passes every byte, blithely.

Functions

This section is empty.

Types

type Charmap

type Charmap struct {
	transform.NopResetter

	// The map between bytes and runes.  To indicate that a specific
	// byte value is invalid for a charcter set, use the rune
	// utf8.RuneError.  Values that are absent from this map will
	// be assumed to have the identity mapping -- that is the default
	// is to assume ISO8859-1, where all 8-bit characters have the same
	// numeric value as their Unicode runes.  (Not to be confused with
	// the UTF-8 values, which *will* be different for non-ASCII runes.)
	//
	// If no values less than RuneSelf are changed (or have non-identity
	// mappings), then the character set is assumed to be an ASCII
	// superset, and certain assumptions and optimizations become
	// available for ASCII bytes.
	Map map[byte]rune

	// The ReplacementChar is the byte value to use for substitution.
	// It should normally be ASCIISub for ASCII encodings.  This may be
	// unset (left to zero) for mappings that are strictly ASCII supersets.
	// In that case ASCIISub will be assumed instead.
	ReplacementChar byte
	// contains filtered or unexported fields
}

Charmap is a structure for setting up encodings for 8-bit character sets, for transforming between UTF8 and that other character set. It has some ideas borrowed from golang.org/x/text/encoding/charmap, but it uses a different implementation. This implementation uses maps, and supports user-defined maps.

We do assume that a character map has a reasonable substitution character, and that valid encodings are stable (exactly a 1:1 map) and stateless (that is there is no shift character or anything like that.) Hence this approach will not work for many East Asian character sets.

Measurement shows little or no measurable difference in the performance of the two approaches. The difference was down to a couple of nsec/op, and no consistent pattern as to which ran faster. With the conversion to UTF-8 the code takes about 25 nsec/op. The conversion in the reverse direction takes about 100 nsec/op. (The larger cost for conversion from UTF-8 is most likely due to the need to convert the UTF-8 byte stream to a rune before conversion.

func (*Charmap) Init

func (c *Charmap) Init()

Init initializes internal values of a character map. This should be done early, to minimize the cost of allocation of transforms later. It is not strictly necessary however, as the allocation functions will arrange to call it if it has not already been done.

func (*Charmap) NewDecoder

func (c *Charmap) NewDecoder() *encoding.Decoder

NewDecoder returns a Decoder the converts from the 8-bit character set to UTF-8. Unknown mappings, if any, are mapped to '\uFFFD'.

func (*Charmap) NewEncoder

func (c *Charmap) NewEncoder() *encoding.Encoder

NewEncoder returns a Transformer that converts from UTF8 to the 8-bit character set. Unknown mappings are mapped to 0x1A.

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL