Documentation ¶
Overview ¶
Package encoding provides a few of the encoding structures that are missing from the Go x/text/encoding tree.
Index ¶
Constants ¶
const ( // RuneError is an alias for the UTF-8 replacement rune, '\uFFFD'. RuneError = '\uFFFD' // RuneSelf is the rune below which UTF-8 and the Unicode values are // identical. Its also the limit for ASCII. RuneSelf = 0x80 // ASCIISub is the ASCII substitution character. ASCIISub = '\x1a' )
Variables ¶
var ASCII encoding.Encoding
ASCII represents the 7-bit US-ASCII scheme. It decodes directly to UTF-8 without change, as all ASCII values are legal UTF-8. Unicode values less than 128 (i.e. 7 bits) map 1:1 with ASCII. It encodes runes outside of that to 0x1A, the ASCII substitution character.
var EBCDIC encoding.Encoding
EBCDIC represents the 8-bit EBCDIC scheme, found in some mainframe environments. If you don't know what this is, consider yourself lucky.
var ISO8859_1 encoding.Encoding
ISO8859_1 represents the 8-bit ISO8859-1 scheme. It decodes directly to UTF-8 without change, as all ISO8859-1 values are legal UTF-8. Unicode values less than 256 (i.e. 8 bits) map 1:1 with 8859-1. It encodes runes outside of that to 0x1A, the ASCII substitution character.
var ISO8859_9 encoding.Encoding
ISO8859_9 represents the 8-bit ISO8859-9 scheme.
var UTF8 encoding.Encoding = validUtf8{}
UTF8 is an encoding for UTF-8. All it does is verify that the UTF-8 in is valid. The main reason for its existence is that it will detect and report ErrSrcShort or ErrDstShort, whereas the Nop encoding just passes every byte, blithely.
Functions ¶
This section is empty.
Types ¶
type Charmap ¶
type Charmap struct { transform.NopResetter // The map between bytes and runes. To indicate that a specific // byte value is invalid for a charcter set, use the rune // utf8.RuneError. Values that are absent from this map will // be assumed to have the identity mapping -- that is the default // is to assume ISO8859-1, where all 8-bit characters have the same // numeric value as their Unicode runes. (Not to be confused with // the UTF-8 values, which *will* be different for non-ASCII runes.) // // If no values less than RuneSelf are changed (or have non-identity // mappings), then the character set is assumed to be an ASCII // superset, and certain assumptions and optimizations become // available for ASCII bytes. Map map[byte]rune // The ReplacementChar is the byte value to use for substitution. // It should normally be ASCIISub for ASCII encodings. This may be // unset (left to zero) for mappings that are strictly ASCII supersets. // In that case ASCIISub will be assumed instead. ReplacementChar byte // contains filtered or unexported fields }
Charmap is a structure for setting up encodings for 8-bit character sets, for transforming between UTF8 and that other character set. It has some ideas borrowed from golang.org/x/text/encoding/charmap, but it uses a different implementation. This implementation uses maps, and supports user-defined maps.
We do assume that a character map has a reasonable substitution character, and that valid encodings are stable (exactly a 1:1 map) and stateless (that is there is no shift character or anything like that.) Hence this approach will not work for many East Asian character sets.
Measurement shows little or no measurable difference in the performance of the two approaches. The difference was down to a couple of nsec/op, and no consistent pattern as to which ran faster. With the conversion to UTF-8 the code takes about 25 nsec/op. The conversion in the reverse direction takes about 100 nsec/op. (The larger cost for conversion from UTF-8 is most likely due to the need to convert the UTF-8 byte stream to a rune before conversion.
func (*Charmap) Init ¶
func (c *Charmap) Init()
Init initializes internal values of a character map. This should be done early, to minimize the cost of allocation of transforms later. It is not strictly necessary however, as the allocation functions will arrange to call it if it has not already been done.
func (*Charmap) NewDecoder ¶
NewDecoder returns a Decoder the converts from the 8-bit character set to UTF-8. Unknown mappings, if any, are mapped to '\uFFFD'.
func (*Charmap) NewEncoder ¶
NewEncoder returns a Transformer that converts from UTF8 to the 8-bit character set. Unknown mappings are mapped to 0x1A.