SSCC - Social Science Computing Cooperative Supporting Statistical Analysis for Research

Introduction

In a lot of data cleaning and analysis - especially exploratory analysis - you reach a point where you find yourself reusing your code through a process of copying, pasting, and editing a variable name.

For example, if I wanted to calculate the mean, standard deviation, and number of non-missing observations for several variables in a data frame, I might start coding

mean(mtcar$mpg)
sd(mtcars$mpg)
sum(!is.na(mtcars$mpg))

# copy, paste, and edit the above:
mean(mtcar$wt)
sd(mtcars$wt)
sum(!is.na(mtcars$wt))

Doing this once or twice is not a horrible thing: it’s pretty quick, it doesn’t require a lot of thought, and your code might still be pretty readable. But by the fourth or fifth time you do this it is no longer so quick and your code is becoming harder to read. Additionally, if there is ever a change you need to make (like debugging), you now have to remember to fix code in 4-5 places.

This is usually where it becomes worth your while to think through and implement loops, and to package multiple steps as functions.

In R we have two major approaches to using loops: explicit for loops, and the apply family of functions.

In addition, when we want to loop through different data with the same sequence of steps, R makes it simple to define these as functions.

  • Explicit Loops
  • Applying Functions
  • Writing Functions