How to work with big datasets in Stata without running out of memory.
When you work with a data set in Stata, Stata must load the entire data set into the computer's memory (RAM). Fortunately, laptops today have more memory than most servers did 20 years ago, and most people never have to worry about how much memory Stata is using. But if you work with big datasets, you need to be careful: trying to use more memory than you have will end badly, and if you're working on one of SSCC's servers it will affect everyone else who is using that server.
Do I need to worry about memory?
You only need to worry about memory if the size of your data set is close to the amount of memory in the computer you're using, and if it's bigger you definitely have a problem. The number of observations or variables in your data set won't tell you that, since the amount of memory they take up varies. But a data set takes up the same amount of space in memory as it does on disk. There are many ways to see how big a file on disk is, but here are a few:
Open Windows Explorer, find the data set, right-click on it, and choose Properties. Alternatively, go to View and select Details, and you'll see how big all your files are.
Open the Finder, find the data set, right-click (or Ctrl-click) on it, and choose Get Info.
Use cd to get to the proper directory, then type:
ls -lh dataset.dta
where dataset should be replaced by the actual name of your data set.
Assuming you can load the data into Stata, it will tell you how much memory it is using. Look in the Properties window for Size. Memory tells you the total amount of memory Stata is currently using.
How much is too much?
That depends on the computer you're using:
|Typical laptop or desktop
|SSCC Condor Server
*This is a policy limit: we ask you not to use more than 80GB of memory on Winstat or SiloLDS. The servers have more than 80GB, but it must be shared with others and these servers are very sensitive to running out.
**Depending on which Condor server your job is assigned to.
How much memory you need also depends on what you plan to do. Obviously if you plan to add variables or observations to your data set you'll need more memory. You should start paying attention any time you're using more than about half of the memory available to you.
Reducing the Size of Your Data Set
There are several things you can do that will probably shrink your data set.
Drop Unneeded Data
If there are variables or observations in your data set that you will not use, use the drop command to get rid of them (or keep if that's easier). You can always get them back later by changing the data wrangling do file that dropped them and running it again.
If the full data set is too large to load at all, you can load just the part you want by giving the use command a variable list to act on and/or an if condition. If you do this, then the name of the data set goes after the word using. For example, the following will only load variables x and y and observations where x is not missing from a data set called bigdata.
use x y if x<. using bigdata
Similar syntax can be used with import and infix to read in part of a text file.
This probably shouldn't be your permanent solution, because when you load a subset of a data set Stata must still read the entire data set to find the parts to be loaded. You'll be able to load the subset more quickly in the future if you save it as its own file.
Use Smaller Variable Types
For most people, the amount of memory or disk space saved by thinking about variable types isn't worth the effort. But for those working with big data sets, Stata actually has five different types of numeric variables and using the right one can save a significant amount of memory. Three of these are integer types, distinguished by the range of numbers they can store:
|Bytes of Memory used
|-100 to 100
|-32,000 to 32,000
|-2,000,000 to 2,000,000
Type help datatypes for more details, including the exact ranges, but these are easy to remember.
Two variable types, float and double, store numbers with fractions. They can both store very large numbers, but differ in how many digits of accuracy they have:
|Digits of Accuracy
|Bytes of Memory used
The default type is float. To create a variable with a type other than float, specify the type right after the gen command and before the variable name. So instead of:
gen adult = (age>=18)
gen byte adult = (age>=18)
Note that if you tell Stata to make a variable an integer type, it will discard any fractional part. If you run:
gen int x = 1.9
then x will be created and set to 1.
The compress command will examine the data in memory, determine if any variables can be stored in a smaller data type without losing any precision, and convert those that can be. (It has nothing to do with compressing files on disk.) Use it early in your project to compress the data you start with. But you can also run it periodically as an alternative to thinking carefully about the proper type for each new variable you create.
Shorten Strings or Encode Them
Strings require one byte of memory per character for western (ASCII) characters. However, string variables are the same length for all observations. Thus if you have a string variable that contains "Yes", "No", or "I don't know" then the variable will be set to length 12 and use 12 bytes per observation so it can store "I don't know".
If you changed "I don't know" to "DK", then the string only needs to use 3 bytes per observation (for "Yes"). If you changed the three values to "Y", "N", and "D", then it only needs to use 1 byte, though the meaning of "D" is not at all obvious. However, Stata does not actually shrink string variable types when you shorten their values. Run the compress command to actually shrink the variable types.
Encoding a string variable containing "Yes", "No," and "I don't know" as a numeric variable containing 1, 2, and 3 will also reduce its memory usage to 1 byte per observation, but you can set value labels containing the full content of the string. If the string represents a categorical variable, encoding it will allow you to use it in analysis. The encode command will create the numeric variable and set the value labels for you, and we recommend doing so.
Drop Intermediate Results
If you create variables to store intermediate results, drop them as soon as you're done with them. For example, the following code creates a variable called incomePovertyRatio just so it can create an indicator variable lowIncome that identifies subjects whose income is less than 150% of the poverty level:
gen incomePovertyRatio = income/povertyLevel
gen lowIncome = (incomePovertyRatio < 1.5)
Since incomePovertyRatio is only needed to create lowIncome, you can drop it as soon as lowIncome has been created.
Break the Data into Smaller Pieces
If a data set is too big to load into memory, for some tasks you can break it into a set of smaller data sets and work on them one at a time. There might be a categorical variable in the data set such that a separate data set for each category would work well, or you can break it up by observation number. You'll then want to use loops to act on all the individual data sets: Stata Programming Essentials will teach you the basics of loops and Stata Programming Tools briefly discusses looping over a list of files.
However, many tasks, including almost all analysis, needs the entire data set to be loaded into memory. Breaking the data set into smaller pieces probably only makes sense if you can shrink the size of each piece so that in the end you can combine them all into a single data set that can be loaded into memory.
Last Revised: 12/2/2021