Supporting Statistical Analysis for Research

2 Apply Functions

Writing for loops can often be abstracted into two related tasks:

• Writing down the steps to loop over, the loop “body”
• Setting up an object to collect the result, “initialization”

Where writing for loops itself becomes repetitive (where we have many similar loops), the first task can be thought of as writing functions. The second task, “initialization”, can be addressed through the use of the apply family of functions. We can

• apply a function to columns of a data frame or matrix
• apply a function to rows of a data frame or matrix
• tapply a function to groups of values
• lapply a function to items in a list

(and more)

2.1 Apply a function to variables in a data frame

Returning to the table of means and standard deviations we have

apply(X, MARGIN, FUN, ...)

where X is a data frame or matrix, MARGIN determines whether you are looping over columns (2) or rows (1), and FUN is the function you wish to employ.

cmeans <- apply(mtcars, 2, mean)
csds   <- apply(mtcars, 2, sd)

data.frame(means=cmeans, stddev=csds)
          means      stddev
mpg   20.090625   6.0269481
cyl    6.187500   1.7859216
disp 230.721875 123.9386938
hp   146.687500  68.5628685
drat   3.596563   0.5346787
wt     3.217250   0.9784574
qsec  17.848750   1.7869432
vs     0.437500   0.5040161
am     0.406250   0.4989909
gear   3.687500   0.7378041
carb   2.812500   1.6152000

Here, each use of apply returns a named vector automatically.

The ... elipses in our syntax diagram indicates we can include additional arguments which are arguments to the function FUN. So if we return to our simulated survey responses with the missing values, we can write

qmeans <- apply(q, 2, mean, na.rm=TRUE)
qsds   <- apply(q, 2, sd,   na.rm=TRUE)

data.frame(means=qmeans, stddev=qsds)
      means    stddev
V1 4.400000 0.8944272
V2 3.428571 1.6183472
V3 2.571429 1.3972763
V4 2.750000 1.7078251
V5 3.000000 2.0000000

2.2 Apply a function to rows in a data frame

This is simply a matter of changing the MARGIN.

qrowmeans <- apply(q, 1, mean, na.rm=TRUE)
qrowsds   <- apply(q, 1, sd,   na.rm=TRUE)

data.frame(means=qmeans, stddev=qsds)
      means    stddev
V1 4.400000 0.8944272
V2 3.428571 1.6183472
V3 2.571429 1.3972763
V4 2.750000 1.7078251
V5 3.000000 2.0000000

2.3 Apply a function to groups of observations

Returning to the problem of means and standard deviations within groups defined by cyl in mtcars we switch to tapply.

tapply(X, INDEX, FUN, ...)

Here X is usually a vector, and the INDEX is a factor, something that can be coerced into a factor, or a list of factors.

mean_bycyl <- tapply(mtcars$mpg, mtcars$cyl, mean)
sd_bycyle  <- tapply(mtcars$mpg, mtcars$cyl, sd)

data.frame(mpg_mean=mean_bycyl, mpg_sd=sd_bycyle)
  mpg_mean   mpg_sd
4 26.66364 4.509828
6 19.74286 1.453567
8 15.10000 2.560048

2.4 Apply a function to a list

Here we have two functions, the aptly named lapply, and sapply. Where the former returns a list, the latter will (usually) return a named vector.

lapply(X, FUN, ...)
sapply(X, FUN, ...)
means_list   <- lapply(mtcars, mean)
means_vector <- sapply(mtcars, mean)

means_vector
       mpg        cyl       disp         hp       drat         wt       qsec
20.090625   6.187500 230.721875 146.687500   3.596563   3.217250  17.848750
vs         am       gear       carb
0.437500   0.406250   3.687500   2.812500 
means_list
$mpg [1] 20.09062$cyl
[1] 6.1875

$disp [1] 230.7219$hp
[1] 146.6875

$drat [1] 3.596563$wt
[1] 3.21725

$qsec [1] 17.84875$vs
[1] 0.4375

$am [1] 0.40625$gear
[1] 3.6875

\$carb
[1] 2.8125