Supporting Statistical Analysis for Research

# 1 Explicit Loops

It can be easiest to think about a repetition problem in explicit loops (for loops), because all of the mechanics of statement execution remain visible in your code. We’ll want to think about what steps are repeated (the body of a for loop), what data objects vary across repetitions and how we want to sequence them, and if we are collecting results, where to place them.

• repeated statements
• varying data
• a results object

## 1.1 Loop over variables in a data frame

One of the most common situations you will encounter is where you want to repeat the same actions for several variables in a data frame.

Suppose you wanted a table with means, standard deviations, and the number of non-missing observations of all your analysis variables. As an example, let’s look at a pared down version of the mtcars data.

cars <- mtcars[, c("mpg", "wt", "disp", "hp")]

Calculating the individual values for a table is easy enough.

mean(cars$mpg) [1] 20.09062 sd(cars$mpg)
[1] 6.026948
sum(!is.na(cars$mpg)) [1] 32 These are the statements we want to repeat, while the column in the data frame varies. We even have a function that will return all the means as a single data object … but nothing similar for standard deviations or counts. colMeans(cars)  mpg wt disp hp 20.09062 3.21725 230.72188 146.68750  To get something similar for standard deviations and for counts we can loop over the variables. We’ll use a for loop (see help("for")) for (var in seq) expression The keywords for and in are required, as are the parentheses. ### 1.1.1 Loop over an index Consider this code: for (i in 1:4) { print(i) print(sd(cars[,i])) } [1] 1 [1] 6.026948 [1] 2 [1] 0.9784574 [1] 3 [1] 123.9387 [1] 4 [1] 68.56287 Here seq is a vector (a “sequence”) of variable positions, 1:4. The var is i, a variable that will take on the value of each element in seq. In this example, each i will be the position of a variable in cars. The expression being evaluated is a compound expression, i.e. the two expressions enclosed in braces, in order. { print(i) print(sd(cars[,i])) } We make the print explicit here, because data objects are only printed automatically at the “top” level of execution. To save our results as a table, we’ll want to give the table a name. One way to do this here would be to place the standard deviations within a vector, v. In order to address position i in v efficiently, we need to setup (“initialize”) the vector first. In order to make sense of the result, we will also name the elements of our vector. v <- rep(NA, 4) names(v) <- names(cars) for (i in 1:4) { v[i] <- sd(cars[,i]) } v  mpg wt disp hp 6.0269481 0.9784574 123.9386938 68.5628685  Because our expression is now a single statement, the braces are no longer necessary, and this could even be written on one line. (I think using braces makes it easier to read, even when written on one line. That makes it easier to debug if you make a mistake.) for (i in 1:4) v[i] <- sd(cars[,i]) At this point we might take our vector of column means and our vector of standard deviations and combine them in a data frame or a matrix. To complete our table with observation counts we could write yet another loop. Or we could think about doing all of this work with a single loop. For the latter approach we want to collect our results in a matrix with named dimensions. v <- matrix(NA, nrow=4, ncol=3) rownames(v) <- names(cars) # variable names for row names colnames(v) <- c("mean", "sd", "N") # statistics names for col names for (i in 1:4) { v[i, "mean"] <- mean(cars[,i]) v[i, "sd"] <- sd(cars[,i]) v[i, "N"] <- sum(!is.na(cars[,i])) } v  mean sd N mpg 20.09062 6.0269481 32 wt 3.21725 0.9784574 32 disp 230.72188 123.9386938 32 hp 146.68750 68.5628685 32 ### 1.1.2 Loop over variable names Because a data frame always has named columns, we could just as easily use names as positions to index a data frame. This can be especially helpful if we are trying to analyze a few selected variables from a larger data frame (for example, using mtcars.) Here, the sequence is a character vector of column names. analysis_vars <- c("mpg", "wt", "disp", "hp") for (i in analysis_vars) { print(i) print(sd(mtcars[,i])) } [1] "mpg" [1] 6.026948 [1] "wt" [1] 0.9784574 [1] "disp" [1] 123.9387 [1] "hp" [1] 68.56287 Our code is very similar to indexing by position, without needing to create a subset of our data frame or trying to count columns. v <- matrix(NA, nrow=length(analysis_vars), ncol=3) rownames(v) <- analysis_vars colnames(v) <- c("mean", "sd", "N") for (i in analysis_vars) { v[i, "mean"] <- mean(mtcars[,i]) v[i, "sd"] <- sd(mtcars[,i]) v[i, "N"] <- sum(!is.na(mtcars[,i])) } v  mean sd N mpg 20.09062 6.0269481 32 wt 3.21725 0.9784574 32 disp 230.72188 123.9386938 32 hp 146.68750 68.5628685 32 ## 1.2 Loops over rows in a data frame Because our binary operators and many other functions are “vectorized”, we often don’t need to think about looping over the rows of a data frame. Code like dfr <- data.frame(x=rnorm(15), y=runif(15)) dfr$z <- dfr$x + dfr$y

is implicitly looping over the rows of dfr. But occasionally you will find you need to operate within each row, and simple vectorization will not work.

As an example consider the situation where survey respondents have answered five related questions, each on a scale of 1 to 5. You want to construct a scale that is the mean of each person’s responses, but there are some missing answers.

# simulate data for this problem
set.seed(20210205)
q <- as.data.frame(matrix(sample(c(1:5,NA), 35, replace=TRUE), ncol=5))
q
  V1 V2 V3 V4 V5
1  4  2  3  1  3
2  5  3  1 NA  5
3 NA  3  5  3 NA
4  5  5  3  2  5
5  3  5  2 NA NA
6  5  5  1 NA  1
7 NA  1  3  5  1

The vectorized approach will only work where there are no missing values.

(V1 + V2 + V3 + V4 + V5)/5

With a loop, we can make use of the mean function and it’s na.rm argument.

Vscale        <- rep(NA, nrow(q))
names(Vscale) <- row.names(q)

for (i in 1:nrow(q)) {
v <- as.matrix(q[i,])
# an odd coercion problem here
# because mean() does not have a data.frame method
Vscale[i] <- mean(v, na.rm=TRUE)
}

Vscale
       1        2        3        4        5        6        7
2.600000 3.500000 3.666667 4.000000 3.333333 3.000000 2.500000 

(For this specific problem of row means, there are two more graceful solutions. One is the rowMeans function; the other is to apply the mean function, as discussed below.)

## 1.3 Loop over groups

When exploring relationships in our data, we often want to analyze one variable within groups denoted by values in another, categorical variable.

Suppose we wanted to see the means and standard deviations of mpg within different levels of the cyl variable of mtcars.

As the data set is given, cyl is a numeric variable, and we could loop over a vector of unique numeric values

cylvals <- unique(mtcars$cyl) cylvals [1] 6 4 8 The value order here doesn’t matter for running a for loop, but we do need to pay attention in order to properly interpret our results! And these values would not be convenient for indexing a results table. We could also coerce cyl to a factor, since that is how we are using it in this analysis. Then we could loop over the levels of the factor. cars <- mtcars cars$cyl <- factor(mtcars$cyl) cyllevels <- levels(cars$cyl)
cyllevels
[1] "4" "6" "8"

Here, keep in mind that these levels are value labels, not numbers!

cyltable <- matrix(NA, nrow=nlevels(cars$cyl), ncol=2) # dimnames() print more nicely than rownames() and colnames() dimnames(cyltable) <- list(cyls=cyllevels, stats=c("mean", "sd")) for (i in cyllevels) { v <- cars$mpg[cars\$cyl==i]
cyltable[i, "mean"] <- mean(v)
cyltable[i, "sd"]   <- sd(v)
}

cyltable
    stats
cyls     mean       sd
4 26.66364 4.509828
6 19.74286 1.453567
8 15.10000 2.560048

## 1.4 Loop over parameter values

We are not limited to sequences of integers, nor do the numbers have to represent column positions. The sequence can be any arbitrary vector, in any arbitrary order.

Suppose we wanted to simulate random samples of various sizes in order to better understand how the standard error of the mean behaves.

for (n in c(5, 10, 50, 100, 500, 1000)) {
cat(n, "\n")   # use cat() instead of print() for simpler output
cat(sd(rnorm(n)/sqrt(n)), "\n")
cat("\n")
}
5
0.3710791

10
0.2181523

50
0.1489542

100
0.09735525

500
0.0420442

1000
0.03186828 

While this example gives us the general idea that bigger samples yield smaller standard errors, we should really replicate these samples many times. For this we can use a loop within a loop.

samples <- c(5, 10, 50, 100, 500, 1000)
reps  <- 200

results <- matrix(NA, ncol=length(samples), nrow=reps,
dimnames=list(rep=1:reps,
size=samples))

for (n in samples) {
for (r in 1:reps){
results[r, as.character(n)] <- (sd(rnorm(n))/sqrt(n))
}
}

colMeans(results)
         5         10         50        100        500       1000
0.41996018 0.30445001 0.14096484 0.10018702 0.04454776 0.03158740 

And we might notice that a 100-fold increase in sample size gives us an extra decimal place of precision in our estimates.

## 1.5 Loop over data objects

Our sequence does not have to be a vector. Lists are also ordered objects, so it is possible to loop over the items in a list. Understood as a list, a data frame is a list of vectors. So our standard deviation problem could be approached by

cars <- mtcars[, 1:4]

for (i in cars) {
print(sd(i))
}
[1] 6.026948
[1] 1.785922
[1] 123.9387
[1] 68.56287

This gives us very clean and simple code. However, this leaves us without a convenient index for collecting results - this technique is most useful when we aren’t saving results as a single data object.