|
ASCII text format is a popular way to disseminate data because it is the absolute
lowest common denominator of formats. Any computer made in the last twenty
years can read ASCII text (not that you'd want to use the edln program that
came with early versions of DOS). On the other hand, programs cannot directly
analyze ASCII data, so you will need to "read" it into the program
you are using. SAS and Stata have powerful tools for writing programs that
can interpret ASCII data sets. This document will not discuss writing such
programs (please refer to the appropriate user manual). We will discuss recognizing
ASCII data, the end of line controversy that can cause problems using it,
and two shortcuts that can often make programing unnecessary.
Recognizing ASCII Data
How do you know if your data is in ASCII format? Hopefully the data source
explained how the data was formatted, so a few extra moments with the documentation
could save you having to reinvent the wheel. On a PC the standard extension
for ASCII text is (as in ).
Unfortunately many data sources seem to feel this should be used only for
text intended to be read. Sometimes is
used for data in ASCII format, but this is not at all standard.
Fortunately it is easy to find out. Open the document in any text editor such
as on a PC or
or in Linux. If the result is readable,
the data is in ASCII text. If it is full of strange characters or makes no sense
at all, it is in some other format. While you're in the editor, take a look
at how the data is structured; this information will be useful later. Alternatively
in Linux you can use or
to view the data but these commands will want to display the entire data set,
which is probably more than you need.
The End of Line Muddle
ASCII (American Standard Code for Information Interchange)
is a standard way of representing text as numbers. It actually was developed
for teletype machines and predates computers. This leads to a minor headache
in using it.
When a teletype machine hit the end of a line, it had to
do two things: move the print carriage back to the left of the page, and move
down one line. ASCII thus specified that the way to signal the end of a line
was to send , which are
special characters meaning "carriage return, linefeed."
When Microsoft was putting DOS together, they decided to adhere strictly to
the ASCII standard, and they continued to do so when they developed Windows.
Thus on a PC, the end of a line in an ASCII text file is denoted by .
However, the people who wrote Linux decided they weren't making teletypes and
having two characters at the end of a line was redundant. Thus in a Linux ASCII
file, the end of a line is denoted by just .
To make the confusion complete, on a Macintosh the end of a line is denoted
by (but at least they're consistent:
that's why the key on the keyboard that goes to a new line is labeled "Return"
instead of "Enter").
Recognizing End of Line Problems
How will this affect you? The most common scenario is that you've written some
beautiful SAS code to read the data, and for some reason it can't read one variable.
When you open the data set in a Linux editor like
or , it then turns out that the problem
variable is the last variable on each line, and attached to the number are the
characters (thus your program decides the
whole thing isn't a number). is
and 's way of printing the special character
, also known as Control-M (don't ask).
Apparently your data was created on a PC, and UNIX doesn't know what to do with
the extra character.
The reverse can happen when you download a text file and try to open it in
on a PC. If everything is on one line,
with black boxes where the line breaks ought to be, your file was created in
Linux. knows that
is a new line, but when it sees just
it shows a black box.
Picking up after the Programmers
Linux has two simple utilities that can eliminate this problem for you (Windows
does not, though some programs can deal with it). The program
strips out the extraneous characters
from PC files so that UNIX will read them properly, while
adds characters to Linux files so
PC's can read them properly. First put your file in your Linux home directory
(mapped as the drive if you're using Windows and logging into the PRIMO
domain). Then in a Linux session type
cat InFile | dos2unix > OutFile
or
cat InFile | unix2dos > OutFile
at the command prompt, where InFile represents
the file you want converted, and OutFile represents
the name you want to give the result.
If you're interested in the details, the cat
command by default simply prints InFile to
the screen, but the pipe character ( | ) sends the printout to the dos2unix
program instead. dos2unix strips out the
characters and prints the result to the screen by default, but the redirect
character (>) sends it to OutFile instead.
unix2dos works in the same way.
Once you've done this you will be able to read your data set without any further
difficulties (due to the end of line muddle anyway).
Using Excel to Read ASCII Data
The first shortcut you may be able to use is to read your data with Excel.
Excel has built in functionality for converting ASCII text into an Excel spreadsheet.
It is not nearly as powerful or flexible as a programming language like SAS
or Stata, but it is quick, easy, and allows you to see and fix the results
as you go. It also will handle end of lines properly regardless of the source.
You can then use Stat/Transfer or DBMS/Copy to convert the spreadsheet to
the format you want. For simple files, Excel may be the quickest way to read
your ASCII text data.
In Excel, select and then locate
the file you wish to convert. Note that by default Excel will only show Excel
spreadsheets. At the bottom of the dialog box where it says , change this to .
To begin the process, simply select your data file. Excel will recognize it
is not in Excel format, and launch a "wizard" (a set of dialog boxes
to help you accomplish a specific task) to gather information about how to
read it.

Delimited Files
The easiest files to convert are "delimited" files. These files
contain a specific character that separates variables. As an example, consider
a data set with one observation and five variables, with the values 1, 2,
3, 4, and 5. A delimited text file contains these five numbers separated by
a delimiting character, or a character that signifies the start of a new variable.
If the delimiter is a comma the file looks like this:
1,2,3,4,5
If the delimiter is a space, it looks like this:
1 2 3 4 5
If it is a tab, it looks like this:
1 2
3 4
5
All of these are common delimiters. If you suspect the file may be delimited,
make sure is selected and click. Excel will then present this dialog box:
In this case Excel is correct that the delimiting character is a tab. Note
that it has drawn vertical lines showing what will be placed in each column.
This is exactly what we want, so we could click
and our data would be read into Excel (note that step 3, which can be accessed
by clicking again, is rarely needed).
But let's consider what happens when Excel is wrong

In this case I forced Excel to use a comma as the delimiter. This is the wrong
character as can be seen by examining the output. No columns have been drawn
because Excel didn't find any commas. If I clicked
now, all the data would be read into a single column, which would be useless.
In this example Excel identified the correct delimiter immediately and no further
intervention was needed. But this doesn't always occur and sometimes you may
need to experiment until the data looks right. Let's try a real world example.

This is panel data containing a variety of economic and social indicators for
a large number of countries for many years. It is formatted such that each
row gives the country code, the variable name, and then its value in each
year. Note also that there is nothing underneath 1950; this country had no
data for 1950.

The variables here are separated by spaces, but not just one space. Note that
Excel creates a column for every space, which is not what we want at all.
Excel does have a tool for eliminating this problem. If we select Treat consecutive
delimiters as one it will ignore the extra spaces, giving the following result:

This looks great except for one minor detail: remember those missing values?
The data set used blank space to represent missing values as well as space
being the delimiter (don't do this, but it's common). Thus Excel ignored all
the missing values. This is a major distortion of the data and obviously unacceptable.
Fixed Width Files
There's no good way to read this file using delimiting characters because it's
not a delimited file. In fact it is set up with columns of fixed width (and
by default Excel will suggest treating it this way). So we need to go back
to step 1 and select , then click
.

Now Excel will present us a view of our data with vertical lines representing
columns. The difference is that we can draw in the columns ourselves. The
directions are given in the dialog box. We want a column break between each
year. To know exactly where we need to scroll right until we get a year with
some data, and then scroll down to see exactly how it's lined up. In this
case we can see that it's important to have the column end immediately after
each value or we could easily cut off the first few digits of the next value.

Excel will create columns when it can figure out where they should go (you
can move them if Excel is wrong), and in fact these were created by Excel.
However it could not figure out what to do for the first couple of decades
because so much data is missing, so we have to add those columns by hand.
Note that this process is prone to error, so check your data carefully after
it is converted. This is the disadvantage of fixed width data files as compared
to delimited files.
One final hint: our data set has labels in the first row. This is very useful
for us, but will confuse your statistical software. Fortunately we can cut
them out right at the beginning. Back in step 1, simply set to 2. Excel will still let you look at row 1 in step
2 when you are setting the columns (though you will have to scroll up to see
them), but when it creates the spreadsheet the labels will be gone.
Excel does have limits to the size of file it can handle (65,536 rows and 256
columns) but it doesn't always warn you when these are exceeded. If you find
that half your data is gone after conversion, that's where it went. This brings
up an important point: data conversion is an imperfect process, so always
save your results in a new file and keep the original intact. There may be
a workaround: use Excel and see what data it is able to read. If it can read
all of each observation but not all the observations you can use a text editor
to remove the observations it successfully read from (a copy of) your original
data set. Then have Excel open your data set again to create a spreadsheet
with the next set of observations. Repeat until you've read everything, then
use your statistical software to combine the files. Realistically it may be
easier in such cases to write a program in your preferred statistical software
to begin with.
Let Someone Else Do the Work
There is another method of reading text files that is far superior when it
is available: let someone else do it (and I don't mean your Research Assistant).
In many cases the people that created the data set also wrote programs that
can read it into SAS or Stata (such programs are much easier to write when
you know exactly what format the data set is in; perhaps because you just
wrote the code to put it in that format!). So take a few minutes to read the
web site or the documentation related to your data, and see what resources
are available. With any luck, you won't need this document at all.
|