Which Compression Tool Should I Use for my Database Backups? (Part II: Decompression)

On my post last week, I analysed some of the most common compression tools and formats, and its compression speed and ratio. While that could give us a good idea of the performance of those tools, the analysis would be incomplete without researching the decompression. This is particularly true for database backups as, for those cases where the compression process is performed outside of the production boxes, you may not care too much about compression times. In that case, even if it is relatively slow, it will not affect the performance of your MySQL server (or whatever you are using). The decompression time, however, can be critical, as it may influence in many cases the MTTR of your whole system.

Testing environment

I used the same OpenStreetMap node MySQL dump in CSV format that I mentioned on my previous post, and -as some tools used the same format (and they should be compatible), but resulted on different compressed ratio- I chose the smallest resulting file for each of them. Here it is a table with the compressed size per format, as a reminder:

format	size (bytes)
original .csv (no compression)	3700635579
gzip	585756379
bzip2	508276130
bzip2 (pbzip2-compressed)	508782016
7z	354107250
lzip	354095395
lzo	782234410
lz4	816582329

Please note that while p7zip and lzip/plzip used the same algorithm, the file format is different. Also please notice the usage of two different compressed files for bzip2: the reason for that will be clarified later.

The hardware specs were the same as for the last post: an almost-idle Intel Quad Core i7-3770@3.40GHz with hyper threading, exposing 8 cpus to the kernel. 32 GB of ram. 2 spinning disks of 3 TB in RAID 1. The filesystem type was ext4 with the default OS options. The operating system had changed to CentOS 7, however.

The methodology was similar as before, for each tool, the following command was executed several times:

$ time [tool] -d -c < nodes.csv.[format] > nodes.csv

Except for dd and 7za, where the syntax is slightly different.

The final file was stored on a different partition of the same RAID. The final file was checked for correction (the uncompressed file was exactly the same as the original one) and deleted after every execution. I will not repeat here my disclaimer about the filesystem cache usage, but I also added the dd results as a reference.

Global results

This is my table of final results and the analysis and discussion follows bellow:

method                     compressed    compression  median decompression  decompression  %CPU Usage
                           size (bytes)  ratio (%)    time (seconds)	    speed (MB/s)
dd                         3700635579    100.00%       9.996                 353.061       100 -  43
gzip                        585756379     15.83%      17.391                 202.933        99 -  26
bzip2                       508276130     13.73%      55.616                  63.457       100 -  45
pigz                        585756379     15.83%       7.115                 496.023       172 -  62
pbzip2 (bzip2-compressed)   508276130     13.73%      50.760                  69.527       170 -  64
pbzip2 (pbzip2-compressed)  508782016     13.75%       9.904                 356.341       794 - 185
lzip                        354095395      9.57%      38.682                  91.236       100 -  47
7za                         354107250      9.57%      28.070                 125.729       157 -  95
plzip                       354095395      9.57%      19.791                 178.324       345 - 177
lzop                        782234410     21.14%       6.094                 579.127       136 -  47
lz4                         816582329     22.07%       3.176                1111.209       100 -  78

These data can be seen more easily, as usual, on a bidimensional graph. Here the X axis represents the median speed of decompression in MB/s (more is better) and the Y axis represents the compressed ratio in percentage of the compressed size over the original size (less is better):

(not plotted: dd, as it would appear with a 100% compression ratio).

CPU usage was polled every second, and so it was the memory usage, that in no test for any of the tools was over 1MB.

In this case I have plotted the function y = x*0.01+12 over the points and, while there is a clear tendency of better compression ratios requiring more time to decompress, the correlation is weaker than on the compression case.

The last thing I want to remark about the global results is that I have not tried variations in parameters for decompression, as in most cases there are little to no options for this process, and the algorithms will do essentially the same for a file that was created with --fast than another created with --best.

Decompressing gzip and bzip2 formats

Unsurprisingly, the gzip file took less time to decompress than bzip with the generic GNU tools (56 seconds vs. 17). I used GNU gzip 1.5 and bzip2 1.0.6. I said everything I had to say about the advantages and disadvantages of using the most standard tools, so I will not repeat myself, but I wanted to reiterate the idea that gzip is a great tool for fast compression processes when there is not an alternative, as it got a mean throughput of almost 203 MB/s when decompressing our test file.

Of course, the next step was testing decompressing in parallel, and for that I had pigz 2.3.1 and pbzip2 v1.1.6. As a side note, I would like to mention that, at the time of this writing, there were no rpm packages for pbzip2 for CentOS 7 in the base distribution nor on EPEL (which is currently in beta for version 7). I used the package for RHEL6.

However, when looking at the pigz results we can realise that, while there is certainly an improvement on speed (just over 7 seconds), it is not as dramatic as the 4x+ improvement that we had on compression. Also, if we look at the cpu usage, we can realise that the maximum %CPU usage is never over 170. I found the reason for that while looking at the documentation: while pigz uses several threads for read and write I/O, it is unable to parallelise the core gzip decompression algorithm. The improvement over standard gzip -however- is there, with almost 500 MB/s of decompression bandwidth.

When checking pbzip2, on my first try, I realised that there was no parallelization at all, and that the timing results were essentially the same as with regular bzip2. I searched for answers on the documentation and I found that and the reason for that was that decompression in parallel was possible (unlike gzip), but only for files created by pbzip2 itself. In other words, both bzip2 and pbzip2 create files with a compatible format (they can be decompressed with each other), but parallelization is only fully possible if they are created and decompressed with pbzip2. To test that second case, I got the best-compressed file that I got from my previous results (which was slightly larger than the one created with bzip2) and retried the tests. That is why there are two different rows on the global results for pbzip2.

In that second scenario, pbzip2 was a real improvement over bzip2, obtaining decompression rates of 356 MB/s, roughly equivalent to the results of a raw filesystem copy.

As it was expected, multiple threads of decompression is a clear advantage on SMP systems, with the usual disclaimers of extra resources consumed and the fact that, as just seen, it is not possible for all file formats.

Lzma decompression

The next group to test is lzma-based tools: Lzip 1.7, p7zip 9.20 and plzip 1.2-rc2. Again, lzip was not available on EPEL-7, and the RedHat6 version was used, and plzip was compiled from source, as we had to do previously.

Lzma algorithm was classified as a slow but good-compression algorithm on our previous results. A similar thing can be extrapolated for decompression: both lzip and 7za provide decompression times of around 30 seconds, with throughputs near the 100 MB/s. Although p7zip seems to be a bit better paralleled than lzip (with %cpu usage reaching 150), both provide essentially a mono-thread decompression algorithm. Plzip provides a better parallelization, reaching a maximum %cpu of 290, but the throughput never reaches the 200 MB/s.

The general evaluation is that they are clearly better tools than single-threaded gzip and bzip2, as they provide similar decompression bandwidths but with much better compression ratios.

Fast tools: lzop and lz4

Finally, we have left the fast compression and decompression tools, in our tests lzop v1.03 and lz4 r121. In this case we can testify the the claims that lz4, while providing similar compression speed than lzop, it is faster for decompression: almost doubling the rate (580 MB/s for lzop vs. 1111 MB/s for lz4). Obviously, the only reason those results are possible is because the filesystem cache is kicking in, so take this results with the due precaution. But it shows what kind of decompression bandwidth can be achieved when the disk latency is not the bottleneck.

When the time of the test is so small, I would recommend repeating it with larger filesizes and/or limiting the effect of the filesystem cache. I will leave that as a homework for the reader.

Conclusion

Aside from the found limitations of several of the tools regarding decompression parallelization (pigz, pbzip2), no highly surprising results have been found. Fast compression tools are fast to decompress (I have become a fan of lz4) and best-compression tools are slower (plzip seems to work very well if we are not constrained by time and CPU). As usual, I will leave you with my recommendation of always testing on your environment, with your own files and machines.

Which compression tool(s) do you use for MySQL (or any other database backups)? Leave me a comment here or on Twitter.

Which Compression Tool Should I Use for my Database Backups? (Part II: Decompression)

Tagged on: 7z 7za 7zip algorithm backup bzip bzip2 compression csv decompression format gzip lz4 lzip lzo lzop mysql p7zip performance plzip rate ratio speed tool tsv zip