Learn through the super-clean Baeldung Pro experience:
>> Membership and Baeldung Pro.
No ads, dark-mode and 6 months free of IntelliJ Idea Ultimate to start with.
Last updated: April 16, 2024
Along with the ZIP and contemporary 7-Zip formats, gzip is one of the most used compression formats and mechanisms.
In this short tutorial, we focus on xz for compressing and decompressing files in the Linux command line.
It’s fairly well known that ZIP is the standard cross-platform archiving tool and format. Similarly, gzip with tar is the standard archiving and compression tool in Linux. So, why use xz at all?
xz creates much smaller archives than gzip while using the same options. Therefore, we can consider xz a better drop-in replacement for gzip. Notably, we explore the claim of smaller archives later.
The disadvantage of xz is that it doesn’t ship with all Linux distributions. Yet, we can with many native package managers such as yum and apt.
Let’s use xz to compress a single file.
Apart from the program name, the usage is identical to that of gzip:
$ xz -v data.csv
This command compresses the file data.csv and replaces it with the file data.csv.xz. The -v option makes xz display progress information.
xz has the same compression levels 1-9 as gzip. The default compression level is 6. However, unlike gzip, that default compression level isn’t usually a good compromise between speed and compression ratio.
So, let’s compress a file with the minimum compression level 1:
$ xz -v1 data.csv
Unlike gzip, there’s no separate program for decompressing a file.
Instead, we use the -d option to decompress a single file:
$ xz -dv data.csv.xz
This decompresses the file data.csv.xz and replaces it with data.csv. Again, the -v option also displays progress information.
Just like with gzip, xz can only compress a single file.
That’s why we usually leverage the tar archiving utility in combination with xz to compress multiple files or entire directories:
$ tar cJvf archive.tar.xz *.csv
Let’s break down this command:
Notably, unlike xz and gzip, tar doesn’t delete the input files after it creates the archive.
Which xz compression level does tar pick? It depends on the version of tar, but it’s usually the default compression level 6.
Still, tar enables setting the compression program through the –use-compress-program option. We use this option to set the compression level since it accepts command-line arguments. Here, we specify the minimum compression level 1:
$ tar cvf archive.tar.xz --use-compress-program='xz -1' *.csv
Notably, we remove the J option because –use-compress-program already sets the compression program.
Decompressing a tar archive with xz is also a single step and identical to gzip (except for the different file extension):
$ tar xvf archive.tar.xz
Again, let’s see what each option does:
Again, the archive isn’t deleted after the operation. Notably, we don’t have to tell tar to decompress with xz as tar does this automatically by inspecting the file and detecting the xz compression.
Unlike gzip, xz supports multithreading directly, which speeds up compression.
By default, xz uses just a single thread. We can specify the number of threads with the -T option. A value of 0 tells xz to use one thread for every available CPU core. That’s generally a good default value to use:
$ xz -vT0 data.csv
If we decide to force multithreading, we can use more threads, such as the 3 in this example:
$ xz -vT3 data.csv
Unlike unpigz, decompression with xz doesn’t benefit from multithreading by default. If we want to employ faster decompression, we’d have to use multithreaded compression as we did above.
Even then, more than two or three threads don’t usually present much improvement, if any.
There are two main ways to use multithreading with tar and xz.
Previously, we specified the compression level with the –use-compress-program option. Now, we enable multithreading through the same –use-compress-program option by setting the number of threads with the command-line options.
Here, we again use one thread for every CPU core:
$ tar cvf archive.tar.xz --use-compress-program='xz -1T0' *.csv
While decompression with xz doesn’t benefit from multithreading by default, we can still use the same options:
$ tar xvf archive.tar.xz --use-compress-program='xz -dT3'
Thus, we again use -d with a specific thread count (3).
Another way to set the options for xz is to use the XZ_* environment variables that tar is aware of:
So, in general, we use XZ_DEFAULTS in a .bashrc or similar initialization script, while XZ_OPT generally helps in specific sessions or local scripts.
Let’s see the compression example from earlier with XZ_OPT:
$ XZ_OPT='-T0 -1' tar cJvf archive.tar.xz *.csv
Similarly, we can perform a decompression:
$ XZ_OPT='-d -T0' tar xJvf archive.tar.xz
Notably, we shouldn’t expect much improvement in either case due to the general way the algorithm works when decompressing.
Since version 5.4.1, xz provides support for parallel decompression with -T0. Yet, TAR files require a sequential read. Because of this, the process might need to preread a number of blocks. To do this, xz expects the archive to be compressed with the multithreading option.
Because of this, if multithreading is a must, we usually turn to algorithms like Zstd.
As we already noted, xz usually creates smaller archives than gzip.
To test this claim, we used the same 818 MB CSV file, and the same computer with six CPU cores and hyperthreading. This is the same setup we used to test gzip in Linux.
We compared xz to pigz, a gzip implementation that uses multithreading for faster compression and decompression:
With compression level 5, xz produced the smallest archive at 29 MB, which is 69% smaller than pigz with the same setup. However, xz took nearly 18 times as long at 70 seconds. Compression levels six and beyond hugely increased the compression time for a negligible 1% reduction in archive size.
So, we’ve demonstrated that xz does indeed create much smaller archives than gzip, sometimes at the price of time.
In this short article, we first saw when we might choose xz over ZIP and gzip.
Then, we learned how to compress and decompress single files with xz. Next, we looked at how we can use tar with xz to compress and decompress multiple files and directories.
Finally, we discovered how multithreading speeds up the compression on modern computers.