compress
This package is based on an optimized Deflate function, which is used by gzip/zip/zlib packages.
It offers slightly better compression at lower compression settings, and up to 3x faster encoding at highest compression level.
changelog
- Apr 2, 2018: Added huff0 en/decoder. Experimental for now, API may change.
- Mar 4, 2018: Added FSE Entropy en/decoder. Experimental for now, API may change.
- Nov 3, 2017: Add compression Estimate function.
- May 28, 2017: Reduce allocations when resetting decoder.
- Apr 02, 2017: Change back to official crc32, since changes were merged in Go 1.7.
- Jan 14, 2017: Reduce stack pressure due to array copies. See Issue #18625.
- Oct 25, 2016: Level 2-4 have been rewritten and now offers significantly better performance than before.
- Oct 20, 2016: Port zlib changes from Go 1.7 to fix zlib writer issue. Please update.
- Oct 16, 2016: Go 1.7 changes merged. Apples to apples this package is a few percent faster, but has a significantly better balance between speed and compression per level.
- Mar 24, 2016: Always attempt Huffman encoding on level 4-7. This improves base 64 encoded data compression.
- Mar 24, 2016: Small speedup for level 1-3.
- Feb 19, 2016: Faster bit writer, level -2 is 15% faster, level 1 is 4% faster.
- Feb 19, 2016: Handle small payloads faster in level 1-3.
- Feb 19, 2016: Added faster level 2 + 3 compression modes.
- Feb 19, 2016: Rebalanced compression levels, so there is a more even progresssion in terms of compression. New default level is 5.
- Feb 14, 2016: Snappy: Merge upstream changes.
- Feb 14, 2016: Snappy: Fix aggressive skipping.
- Feb 14, 2016: Snappy: Update benchmark.
- Feb 13, 2016: Deflate: Fixed assembler problem that could lead to sub-optimal compression.
- Feb 12, 2016: Snappy: Added AMD64 SSE 4.2 optimizations to matching, which makes easy to compress material run faster. Typical speedup is around 25%.
- Feb 9, 2016: Added Snappy package fork. This version is 5-7% faster, much more on hard to compress content.
- Jan 30, 2016: Optimize level 1 to 3 by not considering static dictionary or storing uncompressed. ~4-5% speedup.
- Jan 16, 2016: Optimization on deflate level 1,2,3 compression.
- Jan 8 2016: Merge CL 18317: fix reading, writing of zip64 archives.
- Dec 8 2015: Make level 1 and -2 deterministic even if write size differs.
- Dec 8 2015: Split encoding functions, so hashing and matching can potentially be inlined. 1-3% faster on AMD64. 5% faster on other platforms.
- Dec 8 2015: Fixed rare one byte out-of bounds read. Please update!
- Nov 23 2015: Optimization on token writer. ~2-4% faster. Contributed by @dsnet.
- Nov 20 2015: Small optimization to bit writer on 64 bit systems.
- Nov 17 2015: Fixed out-of-bound errors if the underlying Writer returned an error. See #15.
- Nov 12 2015: Added io.WriterTo support to gzip/inflate.
- Nov 11 2015: Merged CL 16669: archive/zip: enable overriding (de)compressors per file
- Oct 15 2015: Added skipping on uncompressible data. Random data speed up >5x.
usage
The packages are drop-in replacements for standard libraries. Simply replace the import path to use them:
old import |
new import |
compress/gzip |
github.com/klauspost/compress/gzip |
compress/zlib |
github.com/klauspost/compress/zlib |
archive/zip |
github.com/klauspost/compress/zip |
compress/flate |
github.com/klauspost/compress/flate |
You may also be interested in pgzip, which is a drop in replacement for gzip, which support multithreaded compression on big files and the optimized crc32 package used by these packages.
The packages contains the same as the standard library, so you can use the godoc for that: gzip, zip, zlib, flate.
Currently there is only minor speedup on decompression (mostly CRC32 calculation).
deflate optimizations
- Minimum matches are 4 bytes, this leads to fewer searches and better compression. (In Go 1.7)
- Stronger hash (iSCSI CRC32) for matches on x64 with SSE 4.2 support. This leads to fewer hash collisions. (Go 1.7 also has improved hashes)
- Literal byte matching using SSE 4.2 for faster match comparisons. (not in Go)
- Bulk hashing on matches. (In Go 1.7)
- Much faster dictionary indexing with
NewWriterDict()
/Reset()
. (In Go 1.7)
- Make Bit Coder faster by assuming we are on a 64 bit CPU. (In Go 1.7)
- Level 1 compression replaced by converted "Snappy" algorithm. (In Go 1.7)
- Uncompressible content is detected and skipped faster. (Only in BestSpeed in Go)
- A lot of branching eliminated by having two encoders for levels 4-6 and 7-9. (not in Go)
- All heap memory allocations eliminated. (In Go 1.7)
benchmark old ns/op new ns/op delta
BenchmarkEncodeDigitsSpeed1e4-4 554029 265175 -52.14%
BenchmarkEncodeDigitsSpeed1e5-4 3908558 2416595 -38.17%
BenchmarkEncodeDigitsSpeed1e6-4 37546692 24875330 -33.75%
BenchmarkEncodeDigitsDefault1e4-4 781510 486322 -37.77%
BenchmarkEncodeDigitsDefault1e5-4 15530248 6740175 -56.60%
BenchmarkEncodeDigitsDefault1e6-4 174915710 76498625 -56.27%
BenchmarkEncodeDigitsCompress1e4-4 769995 485652 -36.93%
BenchmarkEncodeDigitsCompress1e5-4 15450113 6929589 -55.15%
BenchmarkEncodeDigitsCompress1e6-4 175114660 73348495 -58.11%
BenchmarkEncodeTwainSpeed1e4-4 560122 275977 -50.73%
BenchmarkEncodeTwainSpeed1e5-4 3740978 2506095 -33.01%
BenchmarkEncodeTwainSpeed1e6-4 35542802 21904440 -38.37%
BenchmarkEncodeTwainDefault1e4-4 828534 549026 -33.74%
BenchmarkEncodeTwainDefault1e5-4 13667153 7528455 -44.92%
BenchmarkEncodeTwainDefault1e6-4 141191770 79952170 -43.37%
BenchmarkEncodeTwainCompress1e4-4 830050 545694 -34.26%
BenchmarkEncodeTwainCompress1e5-4 16620852 8460600 -49.10%
BenchmarkEncodeTwainCompress1e6-4 193326820 90808750 -53.03%
benchmark old MB/s new MB/s speedup
BenchmarkEncodeDigitsSpeed1e4-4 18.05 37.71 2.09x
BenchmarkEncodeDigitsSpeed1e5-4 25.58 41.38 1.62x
BenchmarkEncodeDigitsSpeed1e6-4 26.63 40.20 1.51x
BenchmarkEncodeDigitsDefault1e4-4 12.80 20.56 1.61x
BenchmarkEncodeDigitsDefault1e5-4 6.44 14.84 2.30x
BenchmarkEncodeDigitsDefault1e6-4 5.72 13.07 2.28x
BenchmarkEncodeDigitsCompress1e4-4 12.99 20.59 1.59x
BenchmarkEncodeDigitsCompress1e5-4 6.47 14.43 2.23x
BenchmarkEncodeDigitsCompress1e6-4 5.71 13.63 2.39x
BenchmarkEncodeTwainSpeed1e4-4 17.85 36.23 2.03x
BenchmarkEncodeTwainSpeed1e5-4 26.73 39.90 1.49x
BenchmarkEncodeTwainSpeed1e6-4 28.14 45.65 1.62x
BenchmarkEncodeTwainDefault1e4-4 12.07 18.21 1.51x
BenchmarkEncodeTwainDefault1e5-4 7.32 13.28 1.81x
BenchmarkEncodeTwainDefault1e6-4 7.08 12.51 1.77x
BenchmarkEncodeTwainCompress1e4-4 12.05 18.33 1.52x
BenchmarkEncodeTwainCompress1e5-4 6.02 11.82 1.96x
BenchmarkEncodeTwainCompress1e6-4 5.17 11.01 2.13x
- "Speed" is compression level 1
- "Default" is compression level 6
- "Compress" is compression level 9
- Test files are Digits (no matches) and Twain (plain text) .
As can be seen it shows a very good speedup all across the line.
Twain
is a much more realistic benchmark, and will be closer to JSON/HTML performance. Here speed is equivalent or faster, up to 2 times.
Without assembly. This is what you can expect on systems that does not have amd64 and SSE 4:
benchmark old ns/op new ns/op delta
BenchmarkEncodeDigitsSpeed1e4-4 554029 249558 -54.96%
BenchmarkEncodeDigitsSpeed1e5-4 3908558 2295216 -41.28%
BenchmarkEncodeDigitsSpeed1e6-4 37546692 22594905 -39.82%
BenchmarkEncodeDigitsDefault1e4-4 781510 579850 -25.80%
BenchmarkEncodeDigitsDefault1e5-4 15530248 10096561 -34.99%
BenchmarkEncodeDigitsDefault1e6-4 174915710 111470780 -36.27%
BenchmarkEncodeDigitsCompress1e4-4 769995 579708 -24.71%
BenchmarkEncodeDigitsCompress1e5-4 15450113 10266373 -33.55%
BenchmarkEncodeDigitsCompress1e6-4 175114660 110170120 -37.09%
BenchmarkEncodeTwainSpeed1e4-4 560122 260679 -53.46%
BenchmarkEncodeTwainSpeed1e5-4 3740978 2097372 -43.94%
BenchmarkEncodeTwainSpeed1e6-4 35542802 20353449 -42.74%
BenchmarkEncodeTwainDefault1e4-4 828534 646016 -22.03%
BenchmarkEncodeTwainDefault1e5-4 13667153 10056369 -26.42%
BenchmarkEncodeTwainDefault1e6-4 141191770 105268770 -25.44%
BenchmarkEncodeTwainCompress1e4-4 830050 642401 -22.61%
BenchmarkEncodeTwainCompress1e5-4 16620852 11157081 -32.87%
BenchmarkEncodeTwainCompress1e6-4 193326820 121780770 -37.01%
benchmark old MB/s new MB/s speedup
BenchmarkEncodeDigitsSpeed1e4-4 18.05 40.07 2.22x
BenchmarkEncodeDigitsSpeed1e5-4 25.58 43.57 1.70x
BenchmarkEncodeDigitsSpeed1e6-4 26.63 44.26 1.66x
BenchmarkEncodeDigitsDefault1e4-4 12.80 17.25 1.35x
BenchmarkEncodeDigitsDefault1e5-4 6.44 9.90 1.54x
BenchmarkEncodeDigitsDefault1e6-4 5.72 8.97 1.57x
BenchmarkEncodeDigitsCompress1e4-4 12.99 17.25 1.33x
BenchmarkEncodeDigitsCompress1e5-4 6.47 9.74 1.51x
BenchmarkEncodeDigitsCompress1e6-4 5.71 9.08 1.59x
BenchmarkEncodeTwainSpeed1e4-4 17.85 38.36 2.15x
BenchmarkEncodeTwainSpeed1e5-4 26.73 47.68 1.78x
BenchmarkEncodeTwainSpeed1e6-4 28.14 49.13 1.75x
BenchmarkEncodeTwainDefault1e4-4 12.07 15.48 1.28x
BenchmarkEncodeTwainDefault1e5-4 7.32 9.94 1.36x
BenchmarkEncodeTwainDefault1e6-4 7.08 9.50 1.34x
BenchmarkEncodeTwainCompress1e4-4 12.05 15.57 1.29x
BenchmarkEncodeTwainCompress1e5-4 6.02 8.96 1.49x
BenchmarkEncodeTwainCompress1e6-4 5.17 8.21 1.59x
So even without the assembly optimizations there is a general speedup across the board.
level 1-3 "snappy" compression
Levels 1 "Best Speed", 2 and 3 are completely replaced by a converted version of the algorithm found in Snappy, modified to be fully
compatible with the deflate bitstream (and thus still compatible with all existing zlib/gzip libraries and tools).
This version is considerably faster than the "old" deflate at level 1. It does however come at a compression loss, usually in the order of 3-4% compared to the old level 1. However, the speed is usually 1.75 times that of the fastest deflate mode.
In my previous experiments the most common case for "level 1" was that it provided no significant speedup, only lower compression compared to level 2 and sometimes even 3. However, the modified Snappy algorithm provides a very good sweet spot. Usually about 75% faster and with only little compression loss. Therefore I decided to replace level 1 with this mode entirely.
Input is split into blocks of 64kb of, and they are encoded independently (no backreferences across blocks) for the best speed. Contrary to Snappy the output is entropy-encoded, so you will almost always see better compression than Snappy. But Snappy is still about twice as fast as Snappy in deflate mode.
Level 2 and 3 have also been replaced. Level 2 is capable is matching between blocks and level 3 checks up to two hashes for matches it will try.
compression levels
This table shows the compression at each level, and the percentage of the output size compared to output
at the similar level with the standard library. Compression data is Twain
, see above.
(Not up-to-date after rebalancing)
Level |
Bytes |
% size |
1 |
194622 |
103.7% |
2 |
174684 |
96.85% |
3 |
170301 |
98.45% |
4 |
165253 |
97.69% |
5 |
161274 |
98.65% |
6 |
160464 |
99.71% |
7 |
160304 |
99.87% |
8 |
160279 |
99.99% |
9 |
160279 |
99.99% |
To interpret and example, this version of deflate compresses input of 407287 bytes to 161274 bytes at level 5, which is 98.6% of the size of what the standard library produces; 161274 bytes.
This means that from level 4 you can expect a compression level increase of a few percent. Level 1 is about 3% worse, as descibed above.
linear time compression (huffman only)
This compression library adds a special compression level, named ConstantCompression
, which allows near linear time compression. This is done by completely disabling matching of previous data, and only reduce the number of bits to represent each character.
This means that often used characters, like 'e' and ' ' (space) in text use the fewest bits to represent, and rare characters like '¤' takes more bits to represent. For more information see wikipedia or this nice video.
Since this type of compression has much less variance, the compression speed is mostly unaffected by the input data, and is usually more than 180MB/s for a single core.
The downside is that the compression ratio is usually considerably worse than even the fastest conventional compression. The compression raio can never be better than 8:1 (12.5%).
The linear time compression can be used as a "better than nothing" mode, where you cannot risk the encoder to slow down on some content. For comparison, the size of the "Twain" text is 233460 bytes (+29% vs. level 1) and encode speed is 144MB/s (4.5x level 1). So in this case you trade a 30% size increase for a 4 times speedup.
For more information see my blog post on Fast Linear Time Compression.
This is implemented on Go 1.7 as "Huffman Only" mode, though not exposed for gzip.
gzip/zip optimizations
- Uses the faster deflate
- Uses SSE 4.2 CRC32 calculations.
Speed increase is up to 3x of the standard library, but usually around 2x.
This is close to a real world benchmark as you will get. A 2.3MB JSON file. (NOTE: not up-to-date)
benchmark old ns/op new ns/op delta
BenchmarkGzipL1-4 95212470 59938275 -37.05%
BenchmarkGzipL2-4 102069730 76349195 -25.20%
BenchmarkGzipL3-4 115472770 82492215 -28.56%
BenchmarkGzipL4-4 153197780 107570890 -29.78%
BenchmarkGzipL5-4 203930260 134387930 -34.10%
BenchmarkGzipL6-4 233172100 145495400 -37.60%
BenchmarkGzipL7-4 297190260 197926950 -33.40%
BenchmarkGzipL8-4 512819750 376244733 -26.63%
BenchmarkGzipL9-4 563366800 403266833 -28.42%
benchmark old MB/s new MB/s speedup
BenchmarkGzipL1-4 52.11 82.78 1.59x
BenchmarkGzipL2-4 48.61 64.99 1.34x
BenchmarkGzipL3-4 42.97 60.15 1.40x
BenchmarkGzipL4-4 32.39 46.13 1.42x
BenchmarkGzipL5-4 24.33 36.92 1.52x
BenchmarkGzipL6-4 21.28 34.10 1.60x
BenchmarkGzipL7-4 16.70 25.07 1.50x
BenchmarkGzipL8-4 9.68 13.19 1.36x
BenchmarkGzipL9-4 8.81 12.30 1.40x
Multithreaded compression using pgzip comparison, Quadcore, CPU = 8:
(Not updated, old numbers)
benchmark old ns/op new ns/op delta
BenchmarkGzipL1 96155500 25981486 -72.98%
BenchmarkGzipL2 101905830 24601408 -75.86%
BenchmarkGzipL3 113506490 26321506 -76.81%
BenchmarkGzipL4 143708220 31761818 -77.90%
BenchmarkGzipL5 188210770 39602266 -78.96%
BenchmarkGzipL6 209812000 40402313 -80.74%
BenchmarkGzipL7 270015440 56103210 -79.22%
BenchmarkGzipL8 461359700 91255220 -80.22%
BenchmarkGzipL9 498361833 88755075 -82.19%
benchmark old MB/s new MB/s speedup
BenchmarkGzipL1 51.60 190.97 3.70x
BenchmarkGzipL2 48.69 201.69 4.14x
BenchmarkGzipL3 43.71 188.51 4.31x
BenchmarkGzipL4 34.53 156.22 4.52x
BenchmarkGzipL5 26.36 125.29 4.75x
BenchmarkGzipL6 23.65 122.81 5.19x
BenchmarkGzipL7 18.38 88.44 4.81x
BenchmarkGzipL8 10.75 54.37 5.06x
BenchmarkGzipL9 9.96 55.90 5.61x
snappy package
The standard snappy package has now been improved. This repo contains a copy of the snappy repo.
I would advise to use the standard package: https://github.com/golang/snappy
license
This code is licensed under the same conditions as the original Go code. See LICENSE file.