|
SAS implements value labels (e.g. 0 is male, 1 is female) by allowing you
to define custom formats. The difficulty is that these formats
are not saved as part of the data they label. This publication
will discuss ways of storing SAS formats such that they can
be used in subsequent programs.
As an example, consider a fabricated data set of individuals. For
each individual you have their gender, their age, and their
income. You want to do three things with this data:
read it in and prepare it for analysis, get basic summary
statistics, and regress income on the other variables. Because
you don't want to re-run all the previous steps as you debug
the one you're working on, you put each step in a separate
SAS program. However you want to apply the same
value labels to gender in all three programs, so
you need a way to store the custom format between programs.
The Task
First let's go through what you want to accomplish, ignoring the issue
of formats for the moment.
Start by reading in the fabricated data. If this were actual data
you'd probably use a combination of infile and input,
but we'll use datalines and
put the data right in the data step.
data 'incomedata';
input gender age income;
datalines;
0 50 60000
1 45 80000
1 30 25000
0 25 18000
1 72 40000
;
run;
Note that the data is stored in a permanent data set (a file) so you can
use it in later programs.
Next you want some basic summary statistics. So run proc
freq and proc means:
proc freq data='incomedata';
run;
proc means data='incomedata';
run;
Of course in the real world you'd probably do something more sophisticated.
Finally we'll run a regression.
proc reg data='incomedata';
model income=gender age;
run;
This is very simple as well--in reality something this easy could
all go in one program, but we'll keep them separate for pedagogical
purposes.
Review: Defining and Using Formats
Formats in SAS are defined using proc format, and are applied to variables
using the format statement. So to apply a label to the gender variable,
the first step is to define a format that associates 0 with
male and 1 with female. We'll call it genderformat.
proc format;
value genderformat
0= 'male'
1= 'female'
;
run;
Next you need to associate that format with the gender
variable:
format gender genderformat.;
This statement must of course be part of a data or proc step. This could
be a separate data step just to apply the format, or
it could be added to an existing data or proc step.
The difficulty is that genderformat goes away as soon as the program that
defines it ends. So how can you use it in all three programs?
Including the Format Definitions in All Your Programs
One option is to simply include all the formatting code in every SAS program
that uses
genderformat. The first program, the one
that just reads in the data, produces no output (other than
the data set) and does not need to know about the format.
In fact, including the format statement in that data step
would complicate things, as we'll see later. But the other
two do need the format, and thus need the code that handles
it. Here is the complete code of those
programs, including the formatting.
Summary statistics:
proc format;
value genderformat
0= 'male'
1= 'female'
;
run;
data formatteddata;
set 'incomedata';
format gender genderformat.;
run;
proc freq data=formatteddata;
run;
proc means data=formatteddata;
run;
Regression:
proc format;
value genderformat
0= 'male'
1= 'female'
;
run;
data formatteddata;
set 'incomedata';
format gender genderformat.;
run;
proc reg data=formatteddata;
model income=gender age;
run;
The disadvantage of this approach is obvious: the programs are now about
three times as long. In reality it could be much worse--many
data sets include pages and pages of value labels, making
your programs extremely long and somewhat cumbersome to work
with. And if you wanted to make any changes to a
format, you'd have to change the copy in each
program. On the other hand, this method is straightforward
to implement.
Saving Formats in a Catalog File
The alternative is to save the format in a separate file SAS calls a catalog.
Then subsequent programs can refer to this catalog when
they need the format.
In order to save the format, you'll add a library statement to the proc
format, telling SAS where to put it (you'll
need to define the library first). This program will create
a file called gender.sas7bcat in your current directory.
You'll also need to tell SAS to look for formats in that
catalog file. This is done with the fmtsearch option.
Since genderformat is
now permanent, you can make the association between gender and genderformat permanent
by including the format statement in the data
step that reads in the data. Here's the complete
code for the data preparation program:
libname dir ".";
proc format library=dir.gender;
value genderformat
0= 'male'
1= 'female'
;
run;
options fmtsearch=dir.gender;
data 'incomedata';
format gender genderformat.;
input gender age income;
datalines;
0 50 60000
1 45 80000
1 30 25000
0 25 18000
1 72 40000
;
run;
The payoff comes in the next two programs. When they load incomedata,
the file will tell them that gender should
be formatted using genderformat.
They'll then look for genderformat, so you'll
need to tell them where to look using the fmtsearch option.
But that's it!
Summary Statistics:
libname dir ".";
options fmtsearch=(dir.gender);
proc freq data='incomedata';
run;
proc means data='incomedata';
run;
Regression:
libname dir ".";
options fmtsearch=(dir.gender);
proc reg data='incomedata';
model income=gender age;
run;
Obviously this is much shorter than redefining the format in each program,
especially if you've got a lot of formats. But there is a
catch. SAS now knows that gender needs to be formatted using
genderformat. And if genderformat is unavailable, it will
refuse to load the data set at all. You'll need to make
sure the catalog file stays with the data set, and that everyone
who uses it knows how to set the fmtsearch option.
If you ever do get stuck with a data set that has been formatted using
formats you don't have access to, the following trick can
be useful: tell SAS to change the format
to nothing in the data step that first loads the data. The
following format statement will clear all formats
from a data set. The result may not be pretty, but it
will be usable:
format _ALL_;
So which method should you use? Most likely as you read about the two
techniques one or the other seemed easier to you. Whichever
that was, go with it. It really is just a matter of personal preference.
|