Using Compressed Data in Linux

Linux has a variety of tools for working with compressed data. This article will describe how to use them, and why.

Compression programs look for patterns in the data, and then replace the original file with a file that describes those patterns. Nothing is lost--that description contains all the information needed to recreate the original file. The description will be smaller than the original file, but how much smaller will depend on the data itself and the compression scheme used.

The catch is that it takes CPU time to compress or uncompress a file. Thus compression is really a way to trade CPU power for disk space. For files you use constantly, this may not be a good trade. But we strongly encourage you to compress any data sets you are not using on a regular basis. The SSCC's current disk space was quite costly and we hope to avoid adding to it any sooner than necessary.

This article will not attempt cover all the available compression tools or all the things they can do, just the most common usage. Full details are available by typing man and then the name of the command in Linux (e.g. man compress).

compress/uncompress

The compress and uncompress commands are very easy to use:

compress file

Will replace file with the compressed file file.Z (think zipped). You can uncompress a file with:

uncompress file

which replaces the compressed file with the original. uncompress doesn't care if you include the .Z at the end or not--it will find the file either way.

gzip/gunzip

gzip and gunzip work in exactly the same way:

gzip file

will replace file with the compressed file.gz.

gunzip file

will replace the compressed file with the original.

bzip2/bunzip2

bzip2 and bzip2 are another variation on the same theme.

bzip2 file

will replace file with the compressed file.bZ2.

bunzip2 file.bz2

will replace the compressed file with the original. Note that in this case you must type the .bz2 at the end of the name of the file to be uncompressed.

zip/unzip

zip works slightly differently in that it asks you to name the compressed file:

zip compressedFile file

will create compressedFile.zip (the .zip is added automatically), containing a compressed version of file. The original file is not removed.

unzip compressedFile

will recreate the original file. The compressed file is not removed.

Which Command Should I Use?

Unfortunately which command will work best depends on the exact properties of the file you're working with. Bzip will usually give the best compression, while Zip files are more easily used on Windows.

How Do I Uncompress this File?

Suppose you've obtained a file, perhaps via email or from the web, and you know it's compressed but you don't know what program was used to compress it. Look at the last letters of the file name, following the period:

Last Letters of the File Name... Program it was probably compressed with...
.Z compress
.gz gzip
.bz2 bzip2
.zip zip (possibly a Windows program like Winzip)

Note that both uncompress and unzip will handle Windows .zip files just fine. Feel free to just experiment: if you try to uncompress a file using a program that can't read the needed format, it will just give you an error message and quit.

zcat/bzcat

The zcat command reads a compressed file and sends the results to the standard output (use bzcat with bzip2). Just typing

zcat file

where file is a compressed file, will display the tables of the file on the screen. But the real point is to use the results in other programs. For example, to see the results one page at a time pipe the output to the more command:

zcat file | more

Both SAS and Stata can read directly from the output of the zcat command. For instructions see Using Compressed Data in SAS or Using Stata on Linux. Note that SAS has compression built in as a data set option. Stata users should consider using the user-written gzsave and gzuse commands. These act just like the regular save and use commands, but the file on disk is compressed just as if you had used gzip on it.

Last Revised: 5/30/2012