cmpbin

package
v0.0.0-...-feb63dc Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Jan 13, 2025 License: Apache-2.0 Imports: 5 Imported by: 20

Documentation

Overview

Package cmpbin provides binary serialization routines which ensure that the serialized objects maintain the same sort order of the original inputs when sorted bytewise (i.e. with memcmp). Additionally, serialized objects are concatenatable, and the concatenated items will behave as if they're compared field-to-field. So, for example, comparing each string in a []string would compare the same way as comparing the concatenation of those strings encoded with cmpbin. Simply concatenating the strings without encoding them will NOT retain this property, as you could not distinguish []string{"a", "aa"} from []string{"aa", "a"}. With cmpbin, these two would unambiguously sort as ("a", "aa") < ("aa", "a").

Notes on particular serialization schemes:

- Numbers: The number encoding is less efficient on average than Varint ("encoding/binary") for small numbers (it has a minimum encoded size of 2 bytes), but is more efficient for large numbers (it has a maximum encoded size of 9 bytes for a 64 bit int, unlike the largest Varint which has a 10b representation).

Both signed and unsigned numbers are encoded with the same scheme, and will sort together as signed numbers. Decoding with the incorrect routine will result in an ErrOverflow/ErrUnderflow error if the actual value is out of range.

The scheme works like:

  • given an 2's compliment value V
  • extract the sign (S) and magnitude (M) of V
  • Find the position of the highest bit (P), minus 1.
  • write (bits):
  • SPPPPPPP MMMMMMMM MM000000
  • S is 1
  • P's are the log2(M)-1
  • M's are the magnitude of V
  • 0's are padding
  • Additionally, if the number is negative, invert the bits of all the bytes (e.g. XOR 0xFF). This makes the sign bit S 0 for negative numbers, and makes the ordering of the numbers correct when compared bytewise.

- Strings/[]byte Each byte in the encoded stream reserves the least significant bit as a stop bit (1 means that the string continues, 0 means that the string ends). The actual user data is shifted into the top 7 bits of every encoded byte. This results in a data inflation rate of 12.5%, but this overhead is constant (doesn't vary by the encoded content). Note that if space efficiency is very important and you are storing large strings on average, you could reduce the overhead by only placing the stop bit on every other byte or every 4th byte, etc. This would reduce the overhead to 6.25% or 3.125% accordingly (but would cause every string to round out to 2 or 4 byte chunks), and it would make the algorithm implementation more complex. The current implementation was chosen as good enough in light of the fact that pre-compressing regular data could save more than 12.5% overall, and that for incompressable data a commonly used encoding scheme (base64) has a full 25% overhead (and a generally more complex implementation).

- Floats Floats are tricky (really tricky) because they have lots of weird non-sortable special cases (like NaN). That said, for the majority of non-weird cases, the implementation here will sort real numbers the way that you would expect.

The implementation is derived from http://stereopsis.com/radix.html, and full credit for the original algorithm goes to Michael Herf. The algorithm is essentially:

  • if the number is positive, flip the top bit
  • if the number is negative, flip all the bits

Floats are not varint encoded, you could varint encode the mantissa (significand). This is only a 52 bit section, meaning that it is normally encoded with 6.5 bytes (a nybble is stolen from the second exponent byte). Assuming you used the numerical encoding above, shifted left by 4 bits, discarding the sign bit (since its already the MSb on the float, and then using 6 bits (instead of 7) to represent the number of significant bits in the mantissa (since there are only a maximum of 52), you could expect to see small-mantissa floats (of any characteristic) encoded in 3 bytes (this has 6 bits of mantissa), and the largest floats would have an encoded size of 9 bytes (with 2 wasted bits). However the implementation complexity would be higher.

The actual encoded values for special cases are (sorted high to low):

  • QNaN - 0xFFF8000000000000 // note that golang doesn't seem to actually have SNaN?
  • SNaN - 0xFFF0000000000001
  • +inf - 0xFFF0000000000000
  • MaxFloat64 - 0xFFEFFFFFFFFFFFFF
  • SmallestNonzeroFloat64 - 0x8000000000000001
  • 0 - 0x8000000000000000
  • -0 - 0x7FFFFFFFFFFFFFFF
  • -SmallestNonzeroFloat64 - 0x7FFFFFFFFFFFFFFE
  • -MaxFloat64 - 0x0010000000000000
  • -inf - 0x000FFFFFFFFFFFFF

Index

Constants

View Source
const (
	MaxIntLen16 = 3
	MaxIntLen32 = 5
	MaxIntLen64 = 9
)

MaxIntLenN is the maximum length of a cmpbin-encoded N-bit integer (signed or unsigned).

Variables

View Source
var ErrByteLimitExceeded = errors.New("cmbpin: too big! tried to read > cmpbin.ReadByteLimit")

ErrByteLimitExceeded is returned from ReadBytes and ReadString when they attempt to read more than ReadByteLimit bytes.

View Source
var ErrOverflow = errors.New("cmpbin: varint overflows")

ErrOverflow is returned when reading an number which is too large for the destination type (either uint64 or int64)

View Source
var ErrUnderflow = errors.New("cmpbin: uvarint underflows")

ErrUnderflow is returned when reading an number which is too small for the destination type (either uint64 or int64)

View Source
var ReadByteLimit = int(math.Ceil(2 * 1024 * 1024 * 8 / 7))

ReadByteLimit is the limit of how many bytes ReadBytes and ReadString are willing to deserialize before returning ErrByteLimitExceeded. It is currently set to allow 2MB of user data (taking encoding size overhead into account).

Functions

func ConcatBytes

func ConcatBytes(itms ...[]byte) []byte

ConcatBytes is a convenience invocation of bytes.Join(itms, nil)

func IncrementBytes

func IncrementBytes(bstr []byte) ([]byte, bool)

IncrementBytes attempts to increment a copy of bstr as if adding 1 to an integer.

If it overflows, the returned []byte will be all 0's, and the overflow bool will be true.

func InvertBytes

func InvertBytes(bs []byte) []byte

InvertBytes simply inverts all the bytes in bs.

func ReadBytes

func ReadBytes(buf io.ByteReader) (ret []byte, n int, err error)

ReadBytes reads an encoded []byte from buf, returning the number of bytes read, and any read error encountered.

func ReadFloat64

func ReadFloat64(buf io.Reader) (ret float64, n int, err error)

ReadFloat64 reads a memcmp-sortable float from buf (as written by WriteFloat64). It also returns the number of bytes read (8, unless an error occurs), and any read error encountered.

func ReadInt

func ReadInt(r io.ByteReader) (ret int64, n int, err error)

ReadInt decodes a cmpbin-encoded number from a ByteReader. It returns the decoded value and the number of bytes read. The error may be Err{Over,Under}flow if the number is out of bounds. It may also return an error if the ByteReader returns an error.

func ReadString

func ReadString(buf io.ByteReader) (ret string, n int, err error)

ReadString reads an encoded string from buf, returning the number of bytes read, and any read error encountered.

func ReadUint

func ReadUint(r io.ByteReader) (mag uint64, n int, err error)

ReadUint decodes a cmpbin-encoded positive number from a ByteReader. It returns the decoded value and the number of bytes read. The error may be Err{Over,Under}flow if the number is out of bounds. It may also return an error if the ByteReader returns an error.

func WriteBytes

func WriteBytes(buf io.ByteWriter, data []byte) (n int, err error)

WriteBytes writes an encoded []byte to buf, returning the number of bytes written, and any write error encountered.

func WriteFloat64

func WriteFloat64(buf io.Writer, v float64) (n int, err error)

WriteFloat64 writes a memcmp-sortable float to buf. It returns the number of bytes written (8, unless an error occurs), and any write error encountered.

func WriteInt

func WriteInt(w io.ByteWriter, val int64) (int, error)

WriteInt val as a cmpbin Int to the ByteWriter. Returns the number of bytes written. Only returns an error if the underlying ByteWriter returns an error.

func WriteString

func WriteString(buf io.ByteWriter, s string) (n int, err error)

WriteString writes an encoded string to buf, returning the number of bytes written, and any write error encountered.

func WriteUint

func WriteUint(w io.ByteWriter, mag uint64) (int, error)

WriteUint writes mag to the ByteWriter. Returns the number of bytes written. Only returns an error if the underlying ByteWriter returns an error.

Types

type InvertibleBytesBuffer

type InvertibleBytesBuffer interface {
	WriteableBytesBuffer
	SetInvert(inverted bool)
}

InvertibleBytesBuffer is just like Buffer, except that it also has a stateful SetInvert() method, which will cause all reads and writes to/from it to be inverted (e.g. every byte XOR 0xFF).

In contexts where you need comparable byte sequences, like datastore queries, it requires manipulating the sortable fields (e.g. synthesizing them, parsing them, etc.). In particular, when you have a reverse-sorted field (e.g. high to low instead of low to high), it's achieved by having all the bits inverted.

Serialization formats (like gae/service/datastore.Serialize) can include delimiter information, which the parsers only know to parse non-inverted. If we don't have this Invertible buffer, we'd basically have to invert every byte in the []byte array when we're trying to decode a reverse-ordered field (including the bytes of all fields after the one we intend to parse) so that the parser can consume as many bytes as it needs (and it only knows the number of bytes it needs as it decodes them). This InvertibleBytesBuffer lets that happen on the fly without having to flip the whole underlying []byte.

If you know you need it, you'll know it's the right thing. If you're not sure then you definitely don't need it!

See InvertBytes for doing one-off []byte inversions.

func Invertible

Invertible returns an InvertibleBuffer based on the Buffer.

type ReadableBytesBuffer

type ReadableBytesBuffer interface {
	Len() int

	Read([]byte) (int, error)
	ReadByte() (byte, error)
}

ReadableBytesBuffer is the interface which corresponds to the subset of *bytes.Reader required for reading.

type WriteableBytesBuffer

type WriteableBytesBuffer interface {
	ReadableBytesBuffer

	String() string
	Bytes() []byte

	Grow(int)

	Write([]byte) (int, error)
	WriteByte(c byte) error
	WriteString(s string) (int, error)
}

WriteableBytesBuffer is the interface which corresponds to the subset of *bytes.Buffer required for writing.

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL