SSCC Publications

Working with Data Sets in ASCII Text Format

Printer Friendly Version

Last Revised:8/26/2004

ASCII text format is a popular way to disseminate data because it is the absolute lowest common denominator of formats. Any computer made in the last twenty years can read ASCII text (not that you'd want to use the edln program that came with early versions of DOS). On the other hand, programs cannot directly analyze ASCII data, so you will need to "read" it into the program you are using. SAS and Stata have powerful tools for writing programs that can interpret ASCII data sets. This document will not discuss writing such programs (please refer to the appropriate user manual). We will discuss recognizing ASCII data, the end of line controversy that can cause problems using it, and two shortcuts that can often make programing unnecessary.

Recognizing ASCII Data

How do you know if your data is in ASCII format? Hopefully the data source explained how the data was formatted, so a few extra moments with the documentation could save you having to reinvent the wheel. On a PC the standard extension for ASCII text is .txt (as in data.txt). Unfortunately many data sources seem to feel this should be used only for text intended to be read. Sometimes .dat is used for data in ASCII format, but this is not at all standard.

Fortunately it is easy to find out. Open the document in any text editor such as Notepad on a PC or vi or Emacs in Linux. If the result is readable, the data is in ASCII text. If it is full of strange characters or makes no sense at all, it is in some other format. While you're in the editor, take a look at how the data is structured; this information will be useful later. Alternatively in Linux you can use type or more to view the data but these commands will want to display the entire data set, which is probably more than you need.

The End of Line Muddle

ASCII (American Standard Code for Information Interchange) is a standard way of representing text as numbers. It actually was developed for teletype machines and predates computers. This leads to a minor headache in using it.

When a teletype machine hit the end of a line, it had to do two things: move the print carriage back to the left of the page, and move down one line. ASCII thus specified that the way to signal the end of a line was to send <CR><LF>, which are special characters meaning "carriage return, linefeed."

When Microsoft was putting DOS together, they decided to adhere strictly to the ASCII standard, and they continued to do so when they developed Windows. Thus on a PC, the end of a line in an ASCII text file is denoted by <CR><LF>.

However, the people who wrote Linux decided they weren't making teletypes and having two characters at the end of a line was redundant. Thus in a Linux ASCII file, the end of a line is denoted by just <LF>.

To make the confusion complete, on a Macintosh the end of a line is denoted by <CR> (but at least they're consistent: that's why the key on the keyboard that goes to a new line is labeled "Return" instead of "Enter").

Recognizing End of Line Problems

How will this affect you? The most common scenario is that you've written some beautiful SAS code to read the data, and for some reason it can't read one variable. When you open the data set in a Linux editor like vi or Emacs, it then turns out that the problem variable is the last variable on each line, and attached to the number are the characters ^M (thus your program decides the whole thing isn't a number). ^M is vi and emac's way of printing the special character <CR>, also known as Control-M (don't ask). Apparently your data was created on a PC, and UNIX doesn't know what to do with the extra character.

The reverse can happen when you download a text file and try to open it in Notepad on a PC. If everything is on one line, with black boxes where the line breaks ought to be, your file was created in Linux. Notepad knows that <CR><LF> is a new line, but when it sees just <LF> it shows a black box.

Picking up after the Programmers

Linux has two simple utilities that can eliminate this problem for you (Windows does not, though some programs can deal with it). The program dos2unix strips out the extraneous <CR> characters from PC files so that UNIX will read them properly, while unix2dos adds <CR> characters to Linux files so PC's can read them properly. First put your file in your Linux home directory (mapped as the Z: drive if you're using Windows and logging into the PRIMO domain). Then in a Linux session type

cat InFile | dos2unix > OutFile

or

cat InFile | unix2dos > OutFile

at the command prompt, where InFile represents the file you want converted, and OutFile represents the name you want to give the result.

If you're interested in the details, the cat command by default simply prints InFile to the screen, but the pipe character ( | ) sends the printout to the dos2unix program instead. dos2unix strips out the <CR> characters and prints the result to the screen by default, but the redirect character (>) sends it to OutFile instead. unix2dos works in the same way.

Once you've done this you will be able to read your data set without any further difficulties (due to the end of line muddle anyway).

Using Excel to Read ASCII Data

The first shortcut you may be able to use is to read your data with Excel. Excel has built in functionality for converting ASCII text into an Excel spreadsheet. It is not nearly as powerful or flexible as a programming language like SAS or Stata, but it is quick, easy, and allows you to see and fix the results as you go. It also will handle end of lines properly regardless of the source. You can then use Stat/Transfer or DBMS/Copy to convert the spreadsheet to the format you want. For simple files, Excel may be the quickest way to read your ASCII text data.

In Excel, select File…Open and then locate the file you wish to convert. Note that by default Excel will only show Excel spreadsheets. At the bottom of the dialog box where it says Files of type, change this to All Files (*.*). To begin the process, simply select your data file. Excel will recognize it is not in Excel format, and launch a "wizard" (a set of dialog boxes to help you accomplish a specific task) to gather information about how to read it.

Text Import Wizard Step 1 (Set to Delimited)

Delimited Files

The easiest files to convert are "delimited" files. These files contain a specific character that separates variables. As an example, consider a data set with one observation and five variables, with the values 1, 2, 3, 4, and 5. A delimited text file contains these five numbers separated by a delimiting character, or a character that signifies the start of a new variable. If the delimiter is a comma the file looks like this:

1,2,3,4,5

If the delimiter is a space, it looks like this:

1 2 3 4 5

If it is a tab, it looks like this:

1       2       3       4       5

All of these are common delimiters. If you suspect the file may be delimited, make sure Delimited is selected and click Next. Excel will then present this dialog box:


Text Import Wizard Step 2 (Delimiter is Tab)

In this case Excel is correct that the delimiting character is a tab. Note that it has drawn vertical lines showing what will be placed in each column. This is exactly what we want, so we could click Finish and our data would be read into Excel (note that step 3, which can be accessed by clicking Next again, is rarely needed). But let's consider what happens when Excel is wrong…

Text Import Wizard Step 2 (Delimiter is Comma)

In this case I forced Excel to use a comma as the delimiter. This is the wrong character as can be seen by examining the output. No columns have been drawn because Excel didn't find any commas. If I clicked Finish now, all the data would be read into a single column, which would be useless.

In this example Excel identified the correct delimiter immediately and no further intervention was needed. But this doesn't always occur and sometimes you may need to experiment until the data looks right. Let's try a real world example.

Text Import Wizard Step 1 (New Data Set)

This is panel data containing a variety of economic and social indicators for a large number of countries for many years. It is formatted such that each row gives the country code, the variable name, and then its value in each year. Note also that there is nothing underneath 1950; this country had no data for 1950.

Text Import Wizard Step 2 (Delimiter is Space)

The variables here are separated by spaces, but not just one space. Note that Excel creates a column for every space, which is not what we want at all. Excel does have a tool for eliminating this problem. If we select Treat consecutive delimiters as one it will ignore the extra spaces, giving the following result:

Text Import WIzard Step 2 (Treat Consecutive Delmiters as One)

This looks great except for one minor detail: remember those missing values? The data set used blank space to represent missing values as well as space being the delimiter (don't do this, but it's common). Thus Excel ignored all the missing values. This is a major distortion of the data and obviously unacceptable.

Fixed Width Files

There's no good way to read this file using delimiting characters because it's not a delimited file. In fact it is set up with columns of fixed width (and by default Excel will suggest treating it this way). So we need to go back to step 1 and select Fixed Width, then click Next.

Text Import Wizard Step 2 (Set Column Breaks)

Now Excel will present us a view of our data with vertical lines representing columns. The difference is that we can draw in the columns ourselves. The directions are given in the dialog box. We want a column break between each year. To know exactly where we need to scroll right until we get a year with some data, and then scroll down to see exactly how it's lined up. In this case we can see that it's important to have the column end immediately after each value or we could easily cut off the first few digits of the next value.

Text Import Wizard Step 2 (very little space between columns)

Excel will create columns when it can figure out where they should go (you can move them if Excel is wrong), and in fact these were created by Excel. However it could not figure out what to do for the first couple of decades because so much data is missing, so we have to add those columns by hand. Note that this process is prone to error, so check your data carefully after it is converted. This is the disadvantage of fixed width data files as compared to delimited files.

One final hint: our data set has labels in the first row. This is very useful for us, but will confuse your statistical software. Fortunately we can cut them out right at the beginning. Back in step 1, simply set Start import at row: to 2. Excel will still let you look at row 1 in step 2 when you are setting the columns (though you will have to scroll up to see them), but when it creates the spreadsheet the labels will be gone.

Excel does have limits to the size of file it can handle (65,536 rows and 256 columns) but it doesn't always warn you when these are exceeded. If you find that half your data is gone after conversion, that's where it went. This brings up an important point: data conversion is an imperfect process, so always save your results in a new file and keep the original intact. There may be a workaround: use Excel and see what data it is able to read. If it can read all of each observation but not all the observations you can use a text editor to remove the observations it successfully read from (a copy of) your original data set. Then have Excel open your data set again to create a spreadsheet with the next set of observations. Repeat until you've read everything, then use your statistical software to combine the files. Realistically it may be easier in such cases to write a program in your preferred statistical software to begin with.

Let Someone Else Do the Work

There is another method of reading text files that is far superior when it is available: let someone else do it (and I don't mean your Research Assistant). In many cases the people that created the data set also wrote programs that can read it into SAS or Stata (such programs are much easier to write when you know exactly what format the data set is in; perhaps because you just wrote the code to put it in that format!). So take a few minutes to read the web site or the documentation related to your data, and see what resources are available. With any luck, you won't need this document at all.

UW Home Page Article on the Carillon Tower