Documentation ¶
Overview ¶
Package cmpbin provides binary serialization routines which ensure that the serialized objects maintain the same sort order of the original inputs when sorted bytewise (i.e. with memcmp). Additionally, serialized objects are concatenatable, and the concatenated items will behave as if they're compared field-to-field. So, for example, comparing each string in a []string would compare the same way as comparing the concatenation of those strings encoded with cmpbin. Simply concatenating the strings without encoding them will NOT retain this property, as you could not distinguish []string{"a", "aa"} from []string{"aa", "a"}. With cmpbin, these two would unambiguously sort as ("a", "aa") < ("aa", "a").
Notes on particular serialization schemes:
- Numbers: The number encoding is less efficient on average than Varint ("encoding/binary") for small numbers (it has a minimum encoded size of 2 bytes), but is more efficient for large numbers (it has a maximum encoded size of 9 bytes for a 64 bit int, unlike the largest Varint which has a 10b representation).
Both signed and unsigned numbers are encoded with the same scheme, and will sort together as signed numbers. Decoding with the incorrect routine will result in an ErrOverflow/ErrUnderflow error if the actual value is out of range.
The scheme works like:
- given an 2's compliment value V
- extract the sign (S) and magnitude (M) of V
- Find the position of the highest bit (P), minus 1.
- write (bits):
- SPPPPPPP MMMMMMMM MM000000
- S is 1
- P's are the log2(M)-1
- M's are the magnitude of V
- 0's are padding
- Additionally, if the number is negative, invert the bits of all the bytes (e.g. XOR 0xFF). This makes the sign bit S 0 for negative numbers, and makes the ordering of the numbers correct when compared bytewise.
- Strings/[]byte Each byte in the encoded stream reserves the least significant bit as a stop bit (1 means that the string continues, 0 means that the string ends). The actual user data is shifted into the top 7 bits of every encoded byte. This results in a data inflation rate of 12.5%, but this overhead is constant (doesn't vary by the encoded content). Note that if space efficiency is very important and you are storing large strings on average, you could reduce the overhead by only placing the stop bit on every other byte or every 4th byte, etc. This would reduce the overhead to 6.25% or 3.125% accordingly (but would cause every string to round out to 2 or 4 byte chunks), and it would make the algorithm implementation more complex. The current implementation was chosen as good enough in light of the fact that pre-compressing regular data could save more than 12.5% overall, and that for incompressable data a commonly used encoding scheme (base64) has a full 25% overhead (and a generally more complex implementation).
- Floats Floats are tricky (really tricky) because they have lots of weird non-sortable special cases (like NaN). That said, for the majority of non-weird cases, the implementation here will sort real numbers the way that you would expect.
The implementation is derived from http://stereopsis.com/radix.html, and full credit for the original algorithm goes to Michael Herf. The algorithm is essentially:
- if the number is positive, flip the top bit
- if the number is negative, flip all the bits
Floats are not varint encoded, you could varint encode the mantissa (significand). This is only a 52 bit section, meaning that it is normally encoded with 6.5 bytes (a nybble is stolen from the second exponent byte). Assuming you used the numerical encoding above, shifted left by 4 bits, discarding the sign bit (since its already the MSb on the float, and then using 6 bits (instead of 7) to represent the number of significant bits in the mantissa (since there are only a maximum of 52), you could expect to see small-mantissa floats (of any characteristic) encoded in 3 bytes (this has 6 bits of mantissa), and the largest floats would have an encoded size of 9 bytes (with 2 wasted bits). However the implementation complexity would be higher.
The actual encoded values for special cases are (sorted high to low):
- QNaN - 0xFFF8000000000000 // note that golang doesn't seem to actually have SNaN?
- SNaN - 0xFFF0000000000001
- +inf - 0xFFF0000000000000
- MaxFloat64 - 0xFFEFFFFFFFFFFFFF
- SmallestNonzeroFloat64 - 0x8000000000000001
- 0 - 0x8000000000000000
- -0 - 0x7FFFFFFFFFFFFFFF
- -SmallestNonzeroFloat64 - 0x7FFFFFFFFFFFFFFE
- -MaxFloat64 - 0x0010000000000000
- -inf - 0x000FFFFFFFFFFFFF
Index ¶
- Constants
- Variables
- func ReadBytes(buf io.ByteReader) (ret []byte, n int, err error)
- func ReadFloat64(buf io.Reader) (ret float64, n int, err error)
- func ReadInt(r io.ByteReader) (ret int64, n int, err error)
- func ReadString(buf io.ByteReader) (ret string, n int, err error)
- func ReadUint(r io.ByteReader) (mag uint64, n int, err error)
- func WriteBytes(buf io.ByteWriter, data []byte) (n int, err error)
- func WriteFloat64(buf io.Writer, v float64) (n int, err error)
- func WriteInt(w io.ByteWriter, val int64) (int, error)
- func WriteString(buf io.ByteWriter, s string) (n int, err error)
- func WriteUint(w io.ByteWriter, mag uint64) (int, error)
Constants ¶
const ( MaxIntLen16 = 3 MaxIntLen32 = 5 MaxIntLen64 = 9 )
MaxIntLenN is the maximum length of a cmpbin-encoded N-bit integer (signed or unsigned).
Variables ¶
var ErrByteLimitExceeded = errors.New("cmbpin: too big! tried to read > cmpbin.ReadByteLimit")
ErrByteLimitExceeded is returned from ReadBytes and ReadString when they attempt to read more than ReadByteLimit bytes.
ErrOverflow is returned when reading an number which is too large for the destination type (either uint64 or int64)
ErrUnderflow is returned when reading an number which is too small for the destination type (either uint64 or int64)
ReadByteLimit is the limit of how many bytes ReadBytes and ReadString are willing to deserialize before returning ErrByteLimitExceeded. It is currently set to allow 2MB of user data (taking encoding size overhead into account).
Functions ¶
func ReadBytes ¶
func ReadBytes(buf io.ByteReader) (ret []byte, n int, err error)
ReadBytes reads an encoded []byte from buf, returning the number of bytes read, and any read error encountered.
func ReadFloat64 ¶
ReadFloat64 reads a memcmp-sortable float from buf (as written by WriteFloat64). It also returns the number of bytes read (8, unless an error occurs), and any read error encountered.
func ReadInt ¶
func ReadInt(r io.ByteReader) (ret int64, n int, err error)
ReadInt decodes a cmpbin-encoded number from a ByteReader. It returns the decoded value and the number of bytes read. The error may be Err{Over,Under}flow if the number is out of bounds. It may also return an error if the ByteReader returns an error.
func ReadString ¶
func ReadString(buf io.ByteReader) (ret string, n int, err error)
ReadString reads an encoded string from buf, returning the number of bytes read, and any read error encountered.
func ReadUint ¶
func ReadUint(r io.ByteReader) (mag uint64, n int, err error)
ReadUint decodes a cmpbin-encoded positive number from a ByteReader. It returns the decoded value and the number of bytes read. The erorr may be Err{Over,Under}flow if the number is out of bounds. It may also return an error if the ByteReader returns an error.
func WriteBytes ¶
func WriteBytes(buf io.ByteWriter, data []byte) (n int, err error)
WriteBytes writes an encoded []byte to buf, returning the number of bytes written, and any write error encountered.
func WriteFloat64 ¶
WriteFloat64 writes a memcmp-sortable float to buf. It returns the number of bytes written (8, unless an error occurs), and any write error encountered.
func WriteInt ¶
func WriteInt(w io.ByteWriter, val int64) (int, error)
WriteInt val as a cmpbin Int to the ByteWriter. Returns the number of bytes written. Only returns an error if the underlying ByteWriter returns an error.
Types ¶
This section is empty.