README ¶
Stream VByte SIMD Go
This is a repository that contains a port of Stream VByte to Go. Notably, this repo takes extra care to leverage SIMD techniques to achieve better performance. Currently, there is support for x86_64 architectures that have AVX and AVX2 hardware instructions. In cases where that is not available, or on non x86_64 architectures there is a portable scalar implementation. We also perform a runtime check to make sure that the necessary ISA is available and if not fallback to the scalar approach.
There are several existing implementations:
- Reference C/C++
- Rust
- Go
- Note: only has a scalar implementation which prompted this implementation with SIMD techniques.
Benchmarks
goos: darwin
goarch: amd64
pkg: github.com/theMPatel/streamvbyte-simdgo/pkg
cpu: Intel(R) Core(TM) i7-8700B CPU @ 3.20GHz
--
BenchmarkMemCopy8Uint32-12 447501031 2.645 ns/op 12096.16 MB/s
goos: darwin
goarch: amd64
pkg: github.com/theMPatel/streamvbyte-simdgo/pkg/decode
cpu: Intel(R) Core(TM) i7-8700B CPU @ 3.20GHz
--
BenchmarkGet8uint32Fast-12 371818682 3.751 ns/op 8531.87 MB/s
BenchmarkGet8uint32DeltaFast-12 296250846 4.316 ns/op 7414.33 MB/s
BenchmarkGet8uint32Scalar-12 63901624 19.35 ns/op 1653.34 MB/s
BenchmarkGet8uint32DeltaScalar-12 60604498 19.16 ns/op 1670.40 MB/s
BenchmarkGet8uint32Varint-12 25399839 47.10 ns/op 679.48 MB/s
BenchmarkGet8uint32DeltaVarint-12 23248798 60.74 ns/op 526.82 MB/s
goos: darwin
goarch: amd64
pkg: github.com/theMPatel/streamvbyte-simdgo/pkg/encode
cpu: Intel(R) Core(TM) i7-8700B CPU @ 3.20GHz
--
BenchmarkPut8uint32Fast-12 292122826 4.039 ns/op 7922.87 MB/s
BenchmarkPut8uint32DeltaFast-12 272535512 4.400 ns/op 7272.49 MB/s
BenchmarkPut8uint32Scalar-12 37495168 28.56 ns/op 1120.49 MB/s
BenchmarkPut8uint32DeltaScalar-12 39794214 30.37 ns/op 1053.50 MB/s
BenchmarkPut8uint32Varint-12 60481904 23.35 ns/op 1370.58 MB/s
BenchmarkPut8uint32DeltaVarint-12 52820625 23.12 ns/op 1384.15 MB/s
goos: darwin
goarch: amd64
pkg: github.com/theMPatel/streamvbyte-simdgo/pkg/stream/reader
cpu: Intel(R) Core(TM) i7-8700B CPU @ 3.20GHz
--
BenchmarkReadAllFast/Count_1e0-12 99590433 12.22 ns/op 327.21 MB/s
BenchmarkReadAllFast/Count_1e1-12 28132491 42.56 ns/op 939.83 MB/s
BenchmarkReadAllFast/Count_1e2-12 11250654 108.5 ns/op 3688.09 MB/s
BenchmarkReadAllFast/Count_1e3-12 1652133 731.5 ns/op 5468.28 MB/s
BenchmarkReadAllFast/Count_1e4-12 170557 7038 ns/op 5683.43 MB/s
BenchmarkReadAllFast/Count_1e5-12 16976 71057 ns/op 5629.25 MB/s
BenchmarkReadAllFast/Count_1e6-12 1627 726072 ns/op 5509.10 MB/s
BenchmarkReadAllFast/Count_1e7-12 153 7730007 ns/op 5174.64 MB/s
BenchmarkReadAllDeltaFast/Count_1e0-12 94608050 12.62 ns/op 317.08 MB/s
BenchmarkReadAllDeltaFast/Count_1e1-12 26393953 45.83 ns/op 872.72 MB/s
BenchmarkReadAllDeltaFast/Count_1e2-12 9537255 127.1 ns/op 3147.00 MB/s
BenchmarkReadAllDeltaFast/Count_1e3-12 1372822 871.4 ns/op 4590.46 MB/s
BenchmarkReadAllDeltaFast/Count_1e4-12 144782 8283 ns/op 4829.28 MB/s
BenchmarkReadAllDeltaFast/Count_1e5-12 14524 82339 ns/op 4857.94 MB/s
BenchmarkReadAllDeltaFast/Count_1e6-12 1440 830627 ns/op 4815.64 MB/s
BenchmarkReadAllDeltaFast/Count_1e7-12 129 9211114 ns/op 4342.58 MB/s
BenchmarkReadAllScalar/Count_1e0-12 127831876 9.369 ns/op 426.95 MB/s
BenchmarkReadAllScalar/Count_1e1-12 35675473 34.60 ns/op 1155.90 MB/s
BenchmarkReadAllScalar/Count_1e2-12 5006214 244.4 ns/op 1636.39 MB/s
BenchmarkReadAllScalar/Count_1e3-12 440270 2505 ns/op 1597.03 MB/s
BenchmarkReadAllScalar/Count_1e4-12 49596 24991 ns/op 1600.59 MB/s
BenchmarkReadAllScalar/Count_1e5-12 5034 243156 ns/op 1645.04 MB/s
BenchmarkReadAllScalar/Count_1e6-12 495 2452840 ns/op 1630.76 MB/s
BenchmarkReadAllScalar/Count_1e7-12 48 24554217 ns/op 1629.05 MB/s
BenchmarkReadAllDeltaScalar/Count_1e0-12 121810405 9.815 ns/op 407.54 MB/s
BenchmarkReadAllDeltaScalar/Count_1e1-12 36277255 33.27 ns/op 1202.44 MB/s
BenchmarkReadAllDeltaScalar/Count_1e2-12 4936993 244.4 ns/op 1636.89 MB/s
BenchmarkReadAllDeltaScalar/Count_1e3-12 505640 2387 ns/op 1676.06 MB/s
BenchmarkReadAllDeltaScalar/Count_1e4-12 50539 23712 ns/op 1686.91 MB/s
BenchmarkReadAllDeltaScalar/Count_1e5-12 3604 332550 ns/op 1202.83 MB/s
BenchmarkReadAllDeltaScalar/Count_1e6-12 523 2346140 ns/op 1704.93 MB/s
BenchmarkReadAllDeltaScalar/Count_1e7-12 48 24650424 ns/op 1622.69 MB/s
BenchmarkReadAllVarint/Count_1e0-12 150280795 7.915 ns/op 505.35 MB/s
BenchmarkReadAllVarint/Count_1e1-12 21280968 56.51 ns/op 707.81 MB/s
BenchmarkReadAllVarint/Count_1e2-12 1987202 605.3 ns/op 660.86 MB/s
BenchmarkReadAllVarint/Count_1e3-12 209826 5730 ns/op 698.03 MB/s
BenchmarkReadAllVarint/Count_1e4-12 21142 56938 ns/op 702.51 MB/s
BenchmarkReadAllVarint/Count_1e5-12 2102 570193 ns/op 701.52 MB/s
BenchmarkReadAllVarint/Count_1e6-12 210 5751015 ns/op 695.53 MB/s
BenchmarkReadAllVarint/Count_1e7-12 20 57857580 ns/op 691.35 MB/s
BenchmarkReadAllDeltaVarint/Count_1e0-12 126817252 9.398 ns/op 425.62 MB/s
BenchmarkReadAllDeltaVarint/Count_1e1-12 18339328 64.93 ns/op 616.06 MB/s
BenchmarkReadAllDeltaVarint/Count_1e2-12 2222472 539.6 ns/op 741.34 MB/s
BenchmarkReadAllDeltaVarint/Count_1e3-12 210583 5694 ns/op 702.50 MB/s
BenchmarkReadAllDeltaVarint/Count_1e4-12 22628 53472 ns/op 748.06 MB/s
BenchmarkReadAllDeltaVarint/Count_1e5-12 2401 498257 ns/op 802.80 MB/s
BenchmarkReadAllDeltaVarint/Count_1e6-12 229 5231904 ns/op 764.54 MB/s
BenchmarkReadAllDeltaVarint/Count_1e7-12 27 41749967 ns/op 958.08 MB/s
goos: darwin
goarch: amd64
pkg: github.com/theMPatel/streamvbyte-simdgo/pkg/stream/writer
cpu: Intel(R) Core(TM) i7-8700B CPU @ 3.20GHz
--
BenchmarkWriteAllFast/Count_1e0-12 54104296 22.19 ns/op 180.29 MB/s
BenchmarkWriteAllFast/Count_1e1-12 28924809 41.65 ns/op 960.39 MB/s
BenchmarkWriteAllFast/Count_1e2-12 7463600 157.7 ns/op 2537.13 MB/s
BenchmarkWriteAllFast/Count_1e3-12 996226 1195 ns/op 3347.49 MB/s
BenchmarkWriteAllFast/Count_1e4-12 104774 11201 ns/op 3571.20 MB/s
BenchmarkWriteAllFast/Count_1e5-12 10000 105197 ns/op 3802.39 MB/s
BenchmarkWriteAllFast/Count_1e6-12 928 1161927 ns/op 3442.56 MB/s
BenchmarkWriteAllFast/Count_1e7-12 108 9784601 ns/op 4088.06 MB/s
BenchmarkWriteAllDeltaFast/Count_1e0-12 51806442 22.73 ns/op 175.99 MB/s
BenchmarkWriteAllDeltaFast/Count_1e1-12 27048574 44.66 ns/op 895.72 MB/s
BenchmarkWriteAllDeltaFast/Count_1e2-12 7029151 169.7 ns/op 2357.49 MB/s
BenchmarkWriteAllDeltaFast/Count_1e3-12 911277 1309 ns/op 3055.99 MB/s
BenchmarkWriteAllDeltaFast/Count_1e4-12 95408 12211 ns/op 3275.75 MB/s
BenchmarkWriteAllDeltaFast/Count_1e5-12 10000 117217 ns/op 3412.48 MB/s
BenchmarkWriteAllDeltaFast/Count_1e6-12 1012 1179229 ns/op 3392.05 MB/s
BenchmarkWriteAllDeltaFast/Count_1e7-12 100 10847902 ns/op 3687.35 MB/s
BenchmarkWriteAllScalar/Count_1e0-12 55363352 21.93 ns/op 182.43 MB/s
BenchmarkWriteAllScalar/Count_1e1-12 18490597 64.52 ns/op 619.95 MB/s
BenchmarkWriteAllScalar/Count_1e2-12 2741785 430.3 ns/op 929.55 MB/s
BenchmarkWriteAllScalar/Count_1e3-12 292184 4148 ns/op 964.42 MB/s
BenchmarkWriteAllScalar/Count_1e4-12 29408 40316 ns/op 992.17 MB/s
BenchmarkWriteAllScalar/Count_1e5-12 3003 396536 ns/op 1008.73 MB/s
BenchmarkWriteAllScalar/Count_1e6-12 301 3961262 ns/op 1009.78 MB/s
BenchmarkWriteAllScalar/Count_1e7-12 31 40377081 ns/op 990.66 MB/s
BenchmarkWriteAllDeltaScalar/Count_1e0-12 55407040 21.84 ns/op 183.13 MB/s
BenchmarkWriteAllDeltaScalar/Count_1e1-12 17963426 66.39 ns/op 602.48 MB/s
BenchmarkWriteAllDeltaScalar/Count_1e2-12 2733224 436.3 ns/op 916.73 MB/s
BenchmarkWriteAllDeltaScalar/Count_1e3-12 292345 4186 ns/op 955.60 MB/s
BenchmarkWriteAllDeltaScalar/Count_1e4-12 29602 40638 ns/op 984.31 MB/s
BenchmarkWriteAllDeltaScalar/Count_1e5-12 2341 514412 ns/op 777.59 MB/s
BenchmarkWriteAllDeltaScalar/Count_1e6-12 309 3868614 ns/op 1033.96 MB/s
BenchmarkWriteAllDeltaScalar/Count_1e7-12 30 40247200 ns/op 993.86 MB/s
BenchmarkWriteAllVarint/Count_1e0-12 269642071 4.377 ns/op 913.77 MB/s
BenchmarkWriteAllVarint/Count_1e1-12 44656426 27.13 ns/op 1474.39 MB/s
BenchmarkWriteAllVarint/Count_1e2-12 5195884 230.9 ns/op 1732.13 MB/s
BenchmarkWriteAllVarint/Count_1e3-12 503438 2391 ns/op 1673.27 MB/s
BenchmarkWriteAllVarint/Count_1e4-12 51032 23717 ns/op 1686.59 MB/s
BenchmarkWriteAllVarint/Count_1e5-12 5292 230926 ns/op 1732.16 MB/s
BenchmarkWriteAllVarint/Count_1e6-12 522 2306455 ns/op 1734.26 MB/s
BenchmarkWriteAllVarint/Count_1e7-12 51 23197725 ns/op 1724.31 MB/s
BenchmarkWriteAllDeltaVarint/Count_1e0-12 194046319 6.211 ns/op 643.98 MB/s
BenchmarkWriteAllDeltaVarint/Count_1e1-12 44605569 26.84 ns/op 1490.52 MB/s
BenchmarkWriteAllDeltaVarint/Count_1e2-12 5672008 211.6 ns/op 1890.47 MB/s
BenchmarkWriteAllDeltaVarint/Count_1e3-12 552733 2182 ns/op 1833.22 MB/s
BenchmarkWriteAllDeltaVarint/Count_1e4-12 53224 22575 ns/op 1771.88 MB/s
BenchmarkWriteAllDeltaVarint/Count_1e5-12 5754 209450 ns/op 1909.76 MB/s
BenchmarkWriteAllDeltaVarint/Count_1e6-12 504 2362129 ns/op 1693.39 MB/s
BenchmarkWriteAllDeltaVarint/Count_1e7-12 70 16908016 ns/op 2365.74 MB/s
A note on the benchmarks: An array of random uint32's is generated and then encoded/decoded over and over again. An attempt is made to ensure that some of these benchmarks reflect the most probable real world performance metrics.
Stream VByte uses the same underlying format as Google's Group Varint approach. Lemire et al. wanted to see if there was a way to improve the performance even more and introduced a clever twist to enable better performance via SIMD techniques. The basic goal of the Group Varint format is to be able to achieve similar compression characteristics as the VByte format for integers and also be able to load and process them really quickly.
VByte format
The insight that backs the VByte encoding is noticing that you oftentimes don't need 32 bits to encode a 32-bit integer. Take for example an unsigned integer that is less than 2^8 (256). This integer will have bits set in the lowest byte of a 32-bit integer, while the remaining 3 bytes will simply be zeros.
111 in binary:
00000000 00000000 00000000 01101111
An approach you can take to compress this integer is to encode the integer using a variable number of bytes. For example, you can use the lower 7 bits to store data, i.e. bits from the original integer, and then use the MSB as a continuation bit. If the MSB bit is on, i.e. is 1, then more bytes are needed to decode this particular integer. Below is an example where you might need 2 bytes to store the number 1234.
1234 in binary:
00000000 00000000 00000100 11010010
Num compressed:
v v Continuation bits
0|0001001| 1|1010010|
^ ^ Data bits
If you want to decode this integer, you simply build up the number iteratively. I.e. you OR the last 7 bits of every byte shifted to the appropriate length to your 32-bit integer until you find a byte that doesn't have a continuation bit set. Note that this works the same for 64-bit numbers.
The problem with this approach is that it can introduce a lot of branch mis-predictions during encoding/decoding. During the decoding phase, you don't know ahead of time the number of bytes that were used to encode the integer you are currently processing and so you need to iterate until you find a byte without a continuation bit on. If you have integers that are nonuniform, i.e. integers that require random numbers of bytes to encode relative to one another, this can pose a challenge to the processor's branch predictor. These mis-predictions can cause major slowdowns in processor pipelines and so was born the Group Varint format.
Group Varint format
The Group Varint (varint-GB) format assumes that everything you hope to achieve, you can do with 32-bit integers. It introduces the concept of a control byte which is simply a byte that stores the encoded lengths of a group of 4 32-bit integers, hence Group Varint. 32-bit integers only require up to 4 bytes to properly encode. This means that you can represent their lengths with 2 bits using a zero-indexed length i.e. 0, 1, 2, and 3 to represent integers that require 1, 2, 3 and 4 bytes to encode, respectively.
00000000 00000000 00000000 01101111 = 111
00000000 00000000 00000100 11010010 = 1234
00000000 00001100 00001010 10000011 = 789123
01000000 00000000 00000000 00000000 = 1073741824
Num Len 2-bit control
----------------------------------
111 1 0b00
1234 2 0b01
789123 3 0b10
1073741824 4 0b11
Final Control byte
0b11100100
Encoded data (little endian right-to-left bottom-to-top)
0b01000000 0b00000000 0b00000000 0b00000000 0b00001100
0b00001010 0b10000011 0b00000100 0b11010010 0b01101111
You can then prefix every group of 4 encoded 32-bit integers with their control byte and then use it during decoding. The obvious downside is that you pay a storage cost of one byte for every 4 integers you want to encode. For 2^20 encoded integers, that's an extra 256 KB of extra space: totally marginal. The great upside, though, is that you've now removed almost all branches from your decoding phase. You know exactly how many data bytes you need to read from a buffer for a particular number and then can use branchless decoding.
package foo
import (
"encoding/binary"
)
func decodeOne(input []byte, size uint8) uint32 {
buf := make([]byte, 4)
copy(buf, input[:size])
// func (littleEndian) Uint32(b []byte) uint32 {
// _ = b[3]
// return uint32(b[0]) | uint32(b[1])<<8 | uint32(b[2])<<16 | uint32(b[3])<<24
// }
return binary.LittleEndian.Uint32(buf)
}
func main() {
ctrl := uint8(0b11_10_01_00)
data := []byte{
0b01101111, 0b11010010, 0b00000100,
0b10000011, 0b00001010, 0b00001100,
0b00000000, 0b00000000, 0b00000000,
0b01000000,
}
len0 := (ctrl & 3) + 1 // 1
len1 := (ctrl >> 2 & 3) + 1 // 2
len2 := (ctrl >> 4 & 3) + 1 // 3
len3 := (ctrl >> 6 & 3) + 1 // 4
_ = decodeOne(data, len0) // 111
_ = decodeOne(data[len0:], len1) // 1234
_ = decodeOne(data[len0+len1:], len2) // 789_123
_ = decodeOne(data[len0+len1+len2:], len3) // 1_073_741_824
}
Stream VByte format
Unfortunately, accelerating decoding of the varint-GB format with only SIMD techniques has proven unsuccessful. The below excerpt from the paper outlines why.
To understand why it might be difficult to accelerate the decoding of data compressed in the VARINT-GB format compared to the VARINT-G8IU format, consider that we cannot decode faster than we can access the control bytes. In VARINT-G8IU, the control bytes are conveniently always located nine compressed bytes apart. Thus while a control byte is being processed, or even before, our superscalar processor can load and start processing upcoming control bytes, as their locations are predictable. Instructions depending on these control bytes can be reordered by the processor for best performance. However, in the VARINT-GB format, there is a strong data dependency: the location of the next control byte depends on the current control byte. This increases the risk that the processor remains underutilized, delayed by the latency between issuing the load for the next control byte and waiting for it to be ready.
Additionally, they prove that decoding 4 integers at a time using 128-bit registers is faster than trying to decode a variable number of integers that fit into an 8-byte register, i.e. the varint-G8IU approach.
SIMD control byte generation algorithm
Lemire et al. have devised a brilliant SIMD algorithm for simultaneously generating two control bytes for a group of 8 integers. The best way to understand this algorithm is to understand how it works on a single integer and then assume it works in a vectorized form (it does). Going forward we'll use control bits stream to represent these control bytes we are building.
00000000 00000000 00000100 11010010 // 1234
Let's take one of the previous integers that we were looking at, 1234
, and walk through an example
of how the 2-bit control is generated for it using SIMD techniques. The goal is to be able to, for
any 32-bit integer, generate a 2-bit zero indexed length value. For example, if you have an integer
that requires 2 bytes to be encoded, we want for the algorithm to generate 0b01
.
00000000 00000000 00000100 11010010 // 1234
00000001 00000001 00000001 00000001 // 0x0101 mask
----------------------------------- // byte-min(1234, 0x0101)
00000000 00000000 00000001 00000001
The algorithm first uses a mask where every byte is equal to 1. If you perform a per-byte min operation on our integer and the 1's mask, the result will have a 1 at every byte that had a value in the original integer.
00000000 00000000 00000001 00000001
----------------------------------- // pack unsigned saturating 16-bit to 8-bit
00000000 00000000 00000000 11111111
Now you perform a 16-bit to 8-bit unsigned saturating pack operation. Practically this means that you're taking every 16-bit value and trying to shove that into 8 bits. If the 16-bit integer is larger than the largest unsigned integer 8 bits can support, the pack saturates to the largest unsigned 8-bit value.
Why this is performed will become more clear in the subsequent steps, however, at a high level, for every
integer you want to encode, you want for the MSB of two consecutive bytes in the control bits stream
to be representative of the final 2-bit control. For example, if you have a 3-byte integer, you want the
MSB of two consecutive bytes to be 1 and 0, in that order. The reason you would want this is that
there is a vector pack instruction that takes the MSB from every byte in the control bits stream
and packs it into the lowest byte. This would thus represent the value 0b10
in the final byte for
this 3-byte integer, which is what we want.
Performing a 16-bit to 8-bit unsigned saturating pack has the effect that you can use the saturation behavior to conditionally turn on the MSB of these bytes depending on which bytes have values in the original 32-bit integer.
00000000 00000000 00000000 11111111 // control bits stream
00000001 00000001 00000001 00000001 // 0x0101 mask
----------------------------------- // signed 16-bit min
00000000 00000000 00000000 11111111
We then take the 1's mask we used before and perform a signed 16-bit min operation. The reason for this is more clear if you look at an example using a 3-byte integer.
00000000 00001100 00001010 10000011 // 789123
00000001 00000001 00000001 00000001 // 0x0101 mask
----------------------------------- // byte-min(789123, 0x0101)
00000000 00000001 00000001 00000001
----------------------------------- // pack unsigned saturating 16-bit to 8-bit
00000000 00000000 00000001 11111111
00000001 00000001 00000001 00000001 // 0x0101 mask
----------------------------------- // signed 16-bit min
00000000 00000000 00000001 00000001
The signed 16-bit min operation has three important effects.
First, for 3-byte integers, it has the effect of turning off the MSB of the lowest byte. This is necessary
because a 3-byte integer should have a 2-bit control that is 0b10
and without this step using the MSB pack
operation would result in a 2-bit control that looks something like 0b_1
, where the lowest bit is on.
Obviously this is wrong, since only integers that require 2 or 4 bytes to encode should have that lower bit
on, i.e. 1 or 3 as a zero-indexed length.
Second, for 4-byte integers, the signed aspect has the effect of leaving both MSBs of the 2 bytes on. When using the
MSB pack operation later on, it will result in a 2-bit control value of 0b11
, which is what we want.
Third, for 1 and 2 byte integers, it has no effect. This is great for 2-byte values since the MSB will remain on and 1 byte values will not have any MSB on anyways, so it is effectively a noop in both scenarios.
00000000 00000000 00000000 11111111 // control bits stream (original 1234)
01111111 00000000 01111111 00000000 // 0x7F00 mask
----------------------------------- // add unsigned saturating 16-bit
01111111 00000000 01111111 11111111
Next, we take a mask with the value 0x7F00
and perform an unsigned saturating add to the control bits stream.
In the case for the integer 1234
this has no real effect. We maintain the MSB in the lowest byte. You'll note,
however, that the only byte that has its MSB on is the last one, so performing an MSB pack operation would result
in a value of 0b0001
, which is what we want. An example of this step on the integer 789123
might paint a clearer
picture.
00000000 00000000 00000001 00000001 // control bits stream (789123)
01111111 00000000 01111111 00000000 // 0x7F00 mask
----------------------------------- // add unsigned saturating 16-bit
01111111 00000000 11111111 00000001
You'll note here that the addition of 0x01
with 0x7F
in the upper byte results in the MSB of the resulting upper
byte turning on. The MSB in the lower byte remains off and now an MSB pack operation will resolve to 0b0010
,
which is what we want. The unsigned saturation behavior is really important for 4-byte numbers that only have
bits in the most significant byte on. An example below:
01000000 00000000 00000000 00000000 // 1073741824
00000001 00000001 00000001 00000001 // 0x0101 mask
----------------------------------- // byte-min(1073741824, 0x0101)
00000001 00000000 00000000 00000000
----------------------------------- // pack unsigned saturating 16-bit to 8-bit
00000000 00000000 11111111 00000000
00000001 00000001 00000001 00000001 // 0x0101 mask
----------------------------------- // signed 16-bit min
00000000 00000000 11111111 00000000
01111111 00000000 01111111 00000000 // 0x7F00 mask
----------------------------------- // add unsigned saturating 16-bit
01111111 00000000 11111111 11111111
Note here that because only the upper byte had a value in it, the lowest byte in the control bits stream remains
zero for the duration of the algorithm. This poses an issue, since for a 4-byte value, we want for the 2-bit
control to result in a value of 0b11
. Performing a 16-bit unsigned saturating addition has the effect of
turning on all bits in the lower byte, and thus we get a result with the MSB in the lower byte on.
01111111 00000000 11111111 00000001 // control bits stream (789123)
----------------------------------- // move byte mask
00000000 00000000 00000000 00000010 // 2-bit control
The final move byte mask is performed on the control bits stream, and we now have the result we wanted. Now that you see that this works for 1 integer, you know how it can work for 8 integers simultaneously, since we use vector instructions that operate on 128 bit registers.
SIMD integer packing/unpacking
The next problem to be solved is how to take a group of 4 integers, and compress it by removing extraneous/unused bytes so that all you're left with is a stream of data bytes with real information. Let's take two numbers from our examples above.
789123 1234
00000000 00001100 00001010 10000011 | 00000000 00000000 00000100 11010010
-------------------------------------------------------------------------
00001100 00001010 10000011 00000100 11010010 // packed
Here, we can use a shuffle operation. Vector shuffle operations rearrange the bytes in an input register according to some provided mask into a destination register. Every position in the mask stores an offset into the source vector stream that represents the data byte that should go into that position.
input [1234, 789123] (little endian R-to-L)
00000000 00001100 00001010 10000011 00000000 00000000 00000100 11010010
| | | | |
| | |____________________ | |
| |_____________________ | | |
|____________________ | | | |
v v v v v
0xff 0xff 0xff 0x06 0x05 0x04 0x01 0x00 // mask in hex
-----------------------------------------------------------------------
00000000 00000000 00000000 00001100 00001010 10000011 00000100 11010010 // packed
We keep a prebuilt lookup table that contains a mapping from control byte to the necessary mask and simply load that after we construct the control byte above. In addition, we keep a lookup table for a mapping from control bytes to total encoded length. This allows us to know by how much to increment the output pointer and overwrite, for example, the redundant upper 3 bytes in the above shuffle example.
Unpacking during decoding is the same as the above, but in reverse. We need to go from a packed format
to an unpacked memory format. We keep lookup tables to maintain a mapping from control byte to the reverse
shuffle mask, and then perform a shuffle operation to output to an uint32
array.
References
Directories ¶
Path | Synopsis |
---|---|
pkg
|
|
decode
Package decode provides an x86_64 implementation of two Stream VByte decoding algorithms, a normal decoding approach and one that incorporates differential coding.
|
Package decode provides an x86_64 implementation of two Stream VByte decoding algorithms, a normal decoding approach and one that incorporates differential coding. |
encode
Package encode provides an x86_64 implementation of two Stream VByte encoding algorithms, a normal encoding approach and one that incorporates differential coding.
|
Package encode provides an x86_64 implementation of two Stream VByte encoding algorithms, a normal encoding approach and one that incorporates differential coding. |