|
SAS has a variety of tools for working with compressed data. This publication
will describe how to use them, and why.
Compression programs look for patterns in the data, and then replace the
original file with a file that describes those patterns.
Nothing is lost--that description contains all the information
needed to recreate the original
file. Normally the description is smaller than the original
file, but how much smaller will depend on the data itself
and the compression scheme used.
With the compression scheme build into SAS, the compressed
file can even be bigger than the original in rare
cases.
The catch is that it takes CPU time to compress or uncompress a
file. Thus compression is really a way to trade CPU power for disk
space. For files you use constantly, this may not be a good trade.
But we strongly you encourage to compress any data sets you are
not using on a regular basis.
Using Compressed SAS Data Sets
SAS has built-in tools for compressing SAS data sets. Simply add the compress option
to your data statement. It can take one of two values: yes (or
equivalently char)
and binary. These will use two different
compression schemes (RLE and RDC respectively). The scheme
you use will affect the amount of compression you get, but
it's impossible to predict which will work better
for a particular data set. You may want to experiment
and see, or just choose one and go with it.
The complete syntax is simply:
data output (compress=yes);
or
data output (compress=binary);
where output should
be replaced by the name of the data set you wish to create.
The data step then proceeds as usual.
To use a compressed data set takes no special syntax whatsoever. SAS will
recognize that the data set is compressed and uncompress
each observation automatically as it reads it. Of course
it takes additional CPU time to both read and write compressed
data.
Both of the compression schemes available in SAS have the drawback that
they add 12 bytes per observation, and if the data set has
many observations but few variables this overhead can be
greater than the amount of space saved. One other drawback:
Stat/Transfer cannot read compressed data sets. If you anticipate
using Stat/Transfer, do not compress your data, or make an
uncompressed copy before attempting to run Stat/Transfer.
Reading Compressed Text Files
Linux has its own set of tools for compressing files, notably compress/uncompress,
gzip/gunzip, bzip2/bunzip2 and zip/unzip. They are described
in Using
Compressed Data in Linux. Windows has another set of compression tools,
the most common being Winzip and its relatives (note that
the Linux programs uncompress and unzip can
both handle Windows zip files, as can zcat).
It is very common to receive data as text files that
have been compressed using one of these programs.
Often the easiest way to read in these files is to first uncompress them
using Linux tools and then read the uncompressed data
directly--just remember not to keep both the compressed files
and the uncompressed files once you're done. You can temporarily
store the uncompressed files in /temp/fivedays if
they are too large to fit in the space available to
you.
But it is possible to have SAS read compressed
files directly (with the exception of files
compressed using bzip2, since it is not
compatible with zcat).
This can be handy if the files are extremely large,
or if you are working with many compressed files.
The key is to define a SAS filename which is actually the output of the zcat command.
The zcat command simply reads out the contents
of a compressed file. SAS can use the output of zcat just
as if it were a file. Before your data step, insert the following
in your SAS program:
filename reader pipe 'zcat file';
where file is
the name of the compressed file you want to read. Now in
your data step, your infile command just uses reader:
infile reader;
Options such as delimiter still work as
usual if needed. Here's an example of a full data step that
reads a compressed text file:
filename reader pipe 'zcat compressed.Z';
data plain;
infile reader delimiter=" ";
input x1 x2 x3 x4 x5 x6 x7 x8 x9 x10;
run;
Writing Compressed Text Files
You can have SAS output compressed text files as well. This is equivalent
to having SAS write normal text and then compressing it,
but can save you a step (and this way it is all documented
in your SAS program). Once again, the key is a filename
containing a compression command:
filename writer pipe 'compress > file.Z';
The .Z is essential so uncompress will
later recognize the output as a compressed file. You can
then use that filename anywhere where SAS writes
text. Two examples:
data _null_;
set plain;
file writer;
put x1-x10;
run;
proc export data=plain outfile=writer dbms=dlm;
delimiter=',';
run;
Please note that proc export will insist on briefly opening a SAS window
as it runs. It will disappear very quickly, but this does
mean that you must have the ability to display Linux graphics
in your session or a job using proc export will hang. This
is true whether you're compressing the output of proc export
or not.
|