Using Compressed Data in SAS

SAS has a variety of tools for working with compressed data. This article will describe how to use them, and why.

Compression programs look for patterns in the data, and then replace the original file with a file that describes those patterns. Nothing is lost--that description contains all the information needed to recreate the original file. Normally the description is smaller than the original file, but how much smaller will depend on the data itself and the compression scheme used. With the compression scheme build into SAS, the compressed file can even be bigger than the original in rare cases.

The catch is that it takes CPU time to compress or uncompress a file. Thus compression is really a way to trade CPU power for disk space. For files you use constantly, this may not be a good trade. But we strongly you encourage to compress any data sets you are not using on a regular basis.

Using Compressed SAS Data Sets

SAS has built-in tools for compressing SAS data sets. Simply add the compress option to your data statement. It can take one of two values: yes (or equivalently char) and binary. These will use two different compression schemes (RLE and RDC respectively). The scheme you use will affect the amount of compression you get, but it's impossible to predict which will work better for a particular data set. You may want to experiment and see, or just choose one and go with it.

The complete syntax is simply:

data output (compress=yes);

or

data output (compress=binary);

where output should be replaced by the name of the data set you wish to create. The data step then proceeds as usual.

To use a compressed data set takes no special syntax whatsoever. SAS will recognize that the data set is compressed and uncompress each observation automatically as it reads it. Of course it takes additional CPU time to both read and write compressed data.

Both of the compression schemes available in SAS have the drawback that they add 12 bytes per observation, and if the data set has many observations but few variables this overhead can be greater than the amount of space saved. One other drawback: Stat/Transfer cannot read compressed data sets. If you anticipate using Stat/Transfer, do not compress your data, or make an uncompressed copy before attempting to run Stat/Transfer.

Reading Compressed Text Files

Linux has its own set of tools for compressing files, notably compress/uncompress, gzip/gunzip, bzip2/bunzip2 and zip/unzip. They are described in Using Compressed Data in Linux. Windows has another set of compression tools, the most common being Winzip and its relatives (note that the Linux programs uncompress and unzip can both handle Windows zip files, as can zcat). It is very common to receive data as text files that have been compressed using one of these programs.

Often the easiest way to read in these files is to first uncompress them using Linux tools and then read the uncompressed data directly--just remember not to keep both the compressed files and the uncompressed files once you're done. You can temporarily store the uncompressed files in /temp/fivedays if they are too large to fit in the space available to you.

But it is possible to have SAS read compressed files directly (with the exception of files compressed using bzip2, since it is not compatible with zcat). This can be handy if the files are extremely large, or if you are working with many compressed files.

The key is to define a SAS filename which is actually the output of the zcat command. The zcat command simply reads out the tables of a compressed file. SAS can use the output of zcat just as if it were a file. Before your data step, insert the following in your SAS program:

filename reader pipe 'zcat file';

where file is the name of the compressed file you want to read. Now in your data step, your infile command just uses reader:

infile reader;

Options such as delimiter still work as usual if needed. Here's an example of a full data step that reads a compressed text file:

filename reader pipe 'zcat compressed.Z';

data plain;
infile reader delimiter=" ";
input x1 x2 x3 x4 x5 x6 x7 x8 x9 x10;
run;

Writing Compressed Text Files

You can have SAS output compressed text files as well. This is equivalent to having SAS write normal text and then compressing it, but can save you a step (and this way it is all documented in your SAS program). Once again, the key is a filename containing a compression command:

filename writer pipe 'compress > file.Z';

The .Z is essential so uncompress will later recognize the output as a compressed file. You can then use that filename anywhere where SAS writes text. Two examples:

data _null_;
set plain;
file writer;
put x1-x10;
run;

proc export data=plain outfile=writer dbms=dlm;
delimiter=',';
run;

Please note that proc export will insist on briefly opening a SAS window as it runs. It will disappear very quickly, but this does mean that you must have the ability to display Linux graphics in your session or a job using proc export will hang. This is true whether you're compressing the output of proc export or not.

Last Revised: 5/10/2006