Research groups sometimes find it useful to enter data in Excel and then convert it to the format of the statistical software package they'll use for analysis later. This article will describe best practices that can help avoid problems in doing so.
Some statistical packages can be used for data entry just as easily as Excel, and then you'll never have to worry about conversion issues. In general, once you know a statistical package, use it rather than Excel whenever possible. (Don't use Excel as a way to avoid becoming an expert user of your chosen statistical package.) However, if your research group includes people who enter data but do not do analysis and thus don't know a statistical package, Excel can be a good tool for them. Just be sure the instructions you give them will lead them to create files that can be used by statistical packages.
Your goal should be to create a machine-readable file that can be read into your preferred statistical package by a program, without human intervention beyond starting that program. Ideally the process will be 100% reproducible with no opportunity for human error.
Each Excel file should contain exactly one matrix of data. Do not use more than one worksheet per file. The file should not include titles, notes, or other metadata other than a row of variable names—any needed metadata can be placed in an accompanying file. (Exception: Stat/Transfer will by default ignore anything after a blank row. If you use Stat/Transfer, you can place notes and such at the bottom of the spreadsheet after a blank row.)
Each row will be one observation and each column one variable. (The definition of “observation” will depend on the project.) The first row should contain variable names.
Do not use color to convey information.
One easy way to make sure everyone in a group uses the right file structure is to provide a blank starting file that already has that structure in place. Individuals will open it, enter data, and save the result with a new name (ideally the starting file will be read-only).
Variable names must be compatible with the software to be used. In general that means they must be less than 32 characters long, must not start with a number, and cannot contain spaces. Use capital letters or underscores to separate words (e.g. healthStatus or health_status).
Variables should be designated as either numeric or text. Under no circumstances should anything but numbers be placed in columns designated for numeric variables. If explanations are likely to be needed, create a separate text variable for those explanations.
Examples of characters that do not belong in numeric variables include:
$ % , ( ) < > ~ :
Numeric variables should not contain ranges or lists of numbers. If a question allows for multiple answers ("check all that apply"), create separate indicator variables for "respondent checked option 1", "respondent checked option 2", etc. rather than putting a list of answers that were chosen in a single variable.
Indicator variables (e.g. Yes/No or True/False variables) should usually be entered as numbers with 1 for Yes/True and 0 for No/False. Choose variable names that match the meaning of the indicator. For example, use female rather than gender so that 1 clearly means "Yes, this person is female."
Communicate how percentages should be recorded (e.g. whether 80% will be recorded as 80 or 0.8).
Different programs store dates in different ways, and automatic conversion of dates is often problematic. It's safer to store dates as sets of simple numeric variables such as year, month, and day. If you choose to use a date format, make sure everyone uses the same one, ideally the ISO date format (YYYY-MM-DD). Always include the century in years (1999 not 99, so the following year is 2000 rather than 100).
Changing the Data
Once data are entered, they should never be changed in Excel, even to correct clear errors. Changes should be made programmatically (i.e. use Stata do files, SAS programs, R scripts, or SPSS syntax files) so that the changes are documented, auditable, and reproducible. If you need to correct an error in the data, the code that makes the change should be accompanied by a comment that explains it.
Ideally, converting the file from Excel format to the format of your preferred statistical package will be done by a program as well. Some statistical packages have commands that can read Excel files (e.g. Stata's import excel or SAS's proc import) so you can include those commands along with the rest of your code. Alternatively, some packages allow you to include a command in your program that runs Stat/Transfer. See Using Stat/Transfer for details.
Using the Stat/Transfer Graphical User Interface is acceptable, especially if you have a small number of files to convert. If you have to make any changes to the default settings be sure to record exactly what settings you used.
Always keep a copy of the file in exactly the form you received it. Consider also saving a copy in CSV (comma separated variable) format so you never have to worry about Excel going away, losing access to Excel, or future versions of Excel not being able to read your file. See http://researchdata.wisc.edu/manage-your-data/data-formats/ for more discussion of "sustainable" data formats.
Last Revised: 11/20/2012