base62

package module
v1.1.0 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Jan 3, 2022 License: MIT Imports: 4 Imported by: 81

README

base62

GoDoc Go Report Card Issues GitHub release MIT License

base62 is a correctly implemented, compact and fast implementation of Base62 encoding/decoding algorithm. It is inspired by the java implementation by glowfall.

This Base62 implementation can encode/decode both integers and bytes of arbitrary length, the correctness is tested using a large number of randomly generated bytes.

It is much faster than big.Int based implementation and is not much slower than typical Base64 implementations. See the benchmark results below.

Why Base62

In comparison with Base64/Base32, Base62 is more friendly to human:

  • it contains only alpha-numerical symbols, no special characters
  • can be validated by eyes and more simple regexp
  • can be fully selected by mouse double-click in any text editors and browser address bar
  • it's compact and generates shorter strings than Base32
  • it's the most natural and unambiguous way to encode your data in human-readable form :)

Variations of Base62 algorithm cann be widely used to represent authentication data in printable and easy-copyable form, for example to encode OAuth 2.0 access_token data.

Usage

// Basic usage.
Encode(src []byte) []byte
EncodeToString(src []byte) string
Decode(src []byte) ([]byte, error)
DecodeString(src string) ([]byte, error)
FormatInt(num int64) []byte
FormatUint(num uint64) []byte
ParseInt(src []byte) (int64, error)
ParseUint(src []byte) (uint64, error)

// Providing a dst buffer, you may reuse buffers to reduce memory allocation.
EncodeToBuf(dst []byte, src []byte) []byte
DecodeToBuf(dst []byte, src []byte) ([]byte, error)
AppendInt(dst []byte, num int64) []byte
AppendUint(dst []byte, num uint64) []byte

// Or you may use a custom encoding alphabet.
enc := NewEncoding("...my-62-byte-string-alphabet...")
enc.XXX()

Benchmark

Benchmark_Encode-12                     11654754                97.41 ns/op
Benchmark_Decode-12                     15481666                73.60 ns/op
Benchmark_EncodeToString-12             11950086                99.54 ns/op
Benchmark_DecodeString-12               16301325                74.36 ns/op

Benchmark_EncodeToBuf-12                13855840                84.68 ns/op
Benchmark_DecodeToBuf-12                97695962                12.21 ns/op

Benchmark_EncodeInteger-12              29119437                41.30 ns/op
Benchmark_DecodeInteger-12             120328183                 9.917 ns/op

Benchmark_Encode_BigInt-12               1000000              1048 ns/op

Benchmark_Base64_EncodeToString-12      17803440                70.12 ns/op
Benchmark_Base64_DecodeString-12        19884616                55.09 ns/op

Benchmark_Base64_Encode-12              68163142                17.93 ns/op
Benchmark_Base64_Decode-12              41990004                28.25 ns/op

How it works

Encoding Base64 and Base32 both are bit-aligned, 6 bits for Base64 (2^6 = 64) and 5 bits for Base32 (2^5 = 32), but Base62 is not bit-aligned, it's inefficient to do divmod operation for non-bit-aligned integers, typical Base62 implementations are BigInt based, which encodes input data block by block to get better performance, (e.g. https://github.com/keybase/saltpack/blob/master/encoding/basex).

64 characters can fully represent 6 bits, but 62 characters can not, if each character represents 5 bits, that is the Base32 encoding.

A naive BigInt based algorithm gives the shortest result, which is roughly like this (e.g. https://github.com/eknkc/basex/blob/6baac8ea8b19cc66d125286d213770fec0691867/basex.go#L46):

digits := []int{0}
for i := 0; i < len(src); i++ {
    carry := int(src[i])

    for j := 0; j < len(digits); j++ {
        carry += digits[j] << 8
        digits[j] = carry % e.base
        carry = carry / e.base
    }

    for carry > 0 {
        digits = append(digits, carry%e.base)
        carry = carry / e.base
    }
}
// map to encoding alphabet
_ = digits

This works but the time complexity is O(n^2) where n is the length of src. If the input can be very large, the cost is unacceptable.

Improved algorithm splits the input into blocks, which reduce the time complexity to O(kn) where k is the block size, and n is the length of src, it generates slightly longer result than naive BigInt algorithm, but it is worth for the reduced time complexity. When k is n, it degrades to the naive algorithm, when k is 1, the output length is twice of the input, we don't want that. 32 is chosen as the block size in library saltpack/encoding/basex.

Inspired by the java implementation by glowfall, this library is not BigInt based, it encodes and decodes any arbitrary bytes in O(n) time complexity. Here is a brief description of the algorithm.

This library uses a variadic length encoding, for each 6 bits, if the value is in range [0, 62), it can be directly map to the 62 characters, if the value is 62 or 63 which exceeds 61, we turn to encoding just the lower 5 bits. If we find a way to recognize the 5 bits pattern, then we can correctly decode it back to the source data.

The binary representation of 62 and 63 is:

62 - 0b_0011_1110
63 - 0b_0011_1111

They have a common mask 0b_0011_1110, in range [0, 62), there are another two integers have a similar mask 0b_0001_1110, 30 and 31, which is:

30 - 0b_0001_1110
31 - 0b_0001_1111

The four integers 30, 31, 62, 63 share a common mask 0b_0001_1110, while all other integers in range [0, 64) don't share the mask, i.e. for all other integers, the expression value & 0b_0001_1110 == 0b_0001_1110 evaluates to false.

This is the key point!

We define a compactMask as 0b_0001_1110.

When encoding, for each 6 bits integer x, if x & compactMask == compactMask is true, it must be one of 30, 31, 62 or 63, we just encode the lower 5 bits, which are 0b_0001_1110 (30) and 0b_0001_1111 (31), we leave the 6-th bit to next byte.

When decoding, for each encoded byte x, we check x & compactMask == compactMask, when it is false, we know that the byte represents 6 bits in the raw data, it is in range [0, 30) or [32, 62), else it represents 5 bits in the raw data, it is 30 or 31.

That is it, by using variadic length encoding, we successfully limit the value range to [0, 62), we get very compact result and a simple O(n) time complexity for encoding and decoding data of arbitrary length.

Compatibility

This library guarantees that it can correctly decode data encoded by itself.

The encoded result is not compatible with BigInt based algorithm, (e.g. saltpack/encoding/basex, GMP and GnuPG).

The algorithm may be ported to other languages in future (if someone does a porting, I'm glad to link it here :-) ).

Changelog

v1.1.0 - 2022/1/3
  1. Refactor the encoding code to be simpler and cleaner, as a bonus, it gives better performance which is 1.5X~ faster than the old.
  2. Add a brief description of the algorithm and state the compatibility with other Base62 implementations.
v1.0.0 - 2021/10/23

First stable release, the package has been used in several small and medium projects.

This release adds new APIs which help to reuse buffers to reduce memory allocation.

Documentation

Index

Constants

This section is empty.

Variables

View Source
var StdEncoding = NewEncoding(encodeStd)

StdEncoding is the default base62 encoding using alphabet [A-Za-z0-9].

Functions

func AppendInt

func AppendInt(dst []byte, num int64) []byte

AppendInt appends the base62 representation of the integer num using StdEncoding, to dst and returns the extended buffer.

func AppendUint

func AppendUint(dst []byte, num uint64) []byte

AppendUint appends the base62 representation of the unsigned integer num using StdEncoding, to dst and returns the extended buffer.

func Decode

func Decode(src []byte) ([]byte, error)

Decode decodes src using StdEncoding, returns the decoded bytes.

If src contains invalid base62 data, it will return nil and CorruptInputError.

func DecodeString

func DecodeString(src string) ([]byte, error)

DecodeString returns the bytes represented by the base62 string src using StdEncoding.

func DecodeToBuf

func DecodeToBuf(dst []byte, src []byte) ([]byte, error)

DecodeToBuf decodes src using StdEncoding, appending the decoded bytes to dst. If dst has not enough capacity, it copies dst and returns the extended buffer.

If src contains invalid base62 data, it will return nil and CorruptInputError.

func Encode

func Encode(src []byte) []byte

Encode encodes src using StdEncoding, returns the encoded bytes.

func EncodeToBuf

func EncodeToBuf(dst []byte, src []byte) []byte

EncodeToBuf encodes src using StdEncoding, appending the encoded bytes to dst. If dst has not enough capacity, it copies dst and returns the extended buffer.

func EncodeToString

func EncodeToString(src []byte) string

EncodeToString returns a base62 string representation of src using StdEncoding.

func FormatInt

func FormatInt(num int64) []byte

FormatInt encodes an integer num to base62 using StdEncoding.

func FormatUint

func FormatUint(num uint64) []byte

FormatUint encodes an unsigned integer num to base62 using StdEncoding.

func ParseInt

func ParseInt(src []byte) (int64, error)

ParseInt returns an integer from its base62 representation using StdEncoding.

If src contains invalid base62 data, it returns 0 and CorruptInputError.

func ParseUint

func ParseUint(src []byte) (uint64, error)

ParseUint returns an unsigned integer from its base62 representation using StdEncoding.

If src contains invalid base62 data, it returns 0 and CorruptInputError.

Types

type CorruptInputError

type CorruptInputError int64

func (CorruptInputError) Error

func (e CorruptInputError) Error() string

type Encoding

type Encoding struct {
	// contains filtered or unexported fields
}

An Encoding is a radix 62 encoding/decoding scheme, defined by a 62-character alphabet.

func NewEncoding

func NewEncoding(encoder string) *Encoding

NewEncoding returns a new Encoding defined by the given alphabet, which must be a 62-byte string that does not contain CR / LF ('\r', '\n').

func (*Encoding) AppendInt

func (enc *Encoding) AppendInt(dst []byte, num int64) []byte

AppendInt appends the base62 representation of the integer num, as generated by FormatInt, to dst and returns the extended buffer.

func (*Encoding) AppendUint

func (enc *Encoding) AppendUint(dst []byte, num uint64) []byte

AppendUint appends the base62 representation of the unsigned integer num, as generated by FormatUint, to dst and returns the extended buffer.

func (*Encoding) Decode

func (enc *Encoding) Decode(src []byte) ([]byte, error)

Decode decodes src using the encoding enc, returns the decoded bytes.

If src contains invalid base62 data, it will return nil and CorruptInputError.

func (*Encoding) DecodeString

func (enc *Encoding) DecodeString(src string) ([]byte, error)

DecodeString returns the bytes represented by the base62 string src.

func (*Encoding) DecodeToBuf

func (enc *Encoding) DecodeToBuf(dst []byte, src []byte) ([]byte, error)

DecodeToBuf decodes src using the encoding enc, appending the decoded bytes to dst. If dst has not enough capacity, it copies dst and returns the extended buffer.

If src contains invalid base62 data, it will return nil and CorruptInputError.

func (*Encoding) Encode

func (enc *Encoding) Encode(src []byte) []byte

Encode encodes src using the encoding enc, returns the encoded bytes.

func (*Encoding) EncodeToBuf

func (enc *Encoding) EncodeToBuf(dst []byte, src []byte) []byte

EncodeToBuf encodes src using the encoding enc, appending the encoded bytes to dst. If dst has not enough capacity, it copies dst and returns the extended buffer.

func (*Encoding) EncodeToString

func (enc *Encoding) EncodeToString(src []byte) string

EncodeToString returns a base62 string representation of src.

func (*Encoding) FormatInt

func (enc *Encoding) FormatInt(num int64) []byte

FormatInt encodes an integer num to base62 using the encoding enc.

func (*Encoding) FormatUint

func (enc *Encoding) FormatUint(num uint64) []byte

FormatUint encodes an unsigned integer num to base62 using the encoding enc.

func (*Encoding) ParseInt

func (enc *Encoding) ParseInt(src []byte) (int64, error)

ParseInt returns an integer from its base62 representation.

If src contains invalid base62 data, it returns 0 and CorruptInputError.

func (*Encoding) ParseUint

func (enc *Encoding) ParseUint(src []byte) (uint64, error)

ParseUint returns an unsigned integer from its base62 representation.

If src contains invalid base62 data, it returns 0 and CorruptInputError.

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL