SSCC - Social Science Computing Cooperative Supporting Statistical Analysis for Research

3 Writing Functions - Basics

Eventually you will find that you want to write your own functions, and R is designed to make this very easy. For many of us this first comes up when we have several steps we want to do in sequence (an algorithm) using one of the apply functions. These steps might be a sequence of R expressions, or they might even be a nested sequence of functions that we want to refer to by a simple name.

As an example, consider a function to count the missing values (NAs) in a vector. We might apply this to the columns of a data frame, to the rows, or within groups specified by the values of some other vector.

# setup, a matrix with about 20% missing data
set.seed(20141117)
dm <- matrix(sample(0:9, 100, replace=TRUE), ncol=10)
dm[dm<2] <- NA
dm
      [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
 [1,]    4    2    9    5    6   NA    9    5    5     2
 [2,]    2    6    2    6    3    7    7    9    9     4
 [3,]   NA    4    7    4    4    7    9   NA    5     8
 [4,]   NA   NA    8    2    9    8    9   NA    5     7
 [5,]    6    3   NA    7    3   NA    7    4    7     3
 [6,]   NA    5    3    3    9   NA    5   NA    6    NA
 [7,]    3    8    3   NA    6   NA    7   NA   NA     3
 [8,]    7    7   NA    4    2    9    8   NA    8     9
 [9,]    8   NA    6    5    3    6    8   NA   NA    NA
[10,]   NA    3    9    5    3    2    9    5    5     6

We can count the total number of missing values in our matrix:

sum(is.na(dm))
[1] 23

But we have a problem using this code when we try to count NAs within each column:

apply(dm, 2, sum(is.na))
Error in sum(is.na): invalid 'type' (builtin) of argument

What we need is a function to use with apply.

3.1 Defining a New function

Defining a function is pretty simple, really. Typically a function has a name, an argument list (parameters, or "formals"), a body (the expressions that act on the arguments), and a return value.

name <- function(arg1, arg2 ...){
    expression(arg1)
    ...
    value <- expression
    return(value)
}

(A good place to look for basic documentation on writing functions is in An Introduction to R, "Chapter 10: Writing your own functions". See also help("function").)

In our example, sum(is.na()) will be our expression or body, as our argument we will use v to stand for an arbitrary data object, and we want to return a scalar count that we'll call rv. we'll give this function the name nmiss.

nmiss <- function( v) {
  rv <- sum(is.na(v))
  return(rv)
}

# Then use the function with the matrix
nmiss(dm)
[1] 23
# and use the function with apply()
apply(dm,1,nmiss) # missing per row
 [1] 1 0 2 3 2 4 4 2 4 1

We can store these results in another object, in the usual way

dm.missing <- nmiss(dm)
dm.missing
[1] 23

First notice that a new object, nmiss, has been added to our workspace, the global environment.

Notice also that the objects v and rv (two arbitrary names for data objects, i.e. two placeholders in our definition) do not appear in our workspace. Think of them as local to the nmiss function, or as objects within the nmiss enclosure or environment.

As objects in our workspace, functions have class, and they can be printed, which shows us the details of how they were defined. You can do this with any function, not just those you define yourself!

class(nmiss) # functions have class
[1] "function"
nmiss        # as an "object" it can be printed
function( v) {
  rv <- sum(is.na(v))
  return(rv)
}
<bytecode: 0x00000000148701c0>

3.1.1 More About Returns

Either the return() object, or the last expression evaluated is returned by the function.

For example, this is a common way of specifying "rv" as the returned object:

nmiss <- function( v) {
  rv <- sum(is.na(v))
  rv
  #return(rv)
}

nmiss(dm)
[1] 23

This does NOT work, returning nothing (not even an error!, just a NULL value):

nmiss <- function( v) {
  rv <- sum(is.na(v))
}
nmiss(dm)

The last expression evaluated was an assignment, "<-", a function used for it's side effect.

But the following example does return what we want. We don't need to name the object we want to return, we can simply return the value of the last expression evaluated.

nmiss <- function( v) {
  sum(is.na(v))
}
nmiss(dm)
[1] 23

3.1.2 One-liners and Anonymous Functions

Notice the last example could be written on one line:

nmiss <- function( v) { sum(is.na(v))}
nmiss(dm)
[1] 23

And because the body is a single expression we don't actually need braces for compound expressions:

nmiss <- function( v) sum(is.na(v))
nmiss(dm)
[1] 23

It is not uncommon to see simple functions both defined and used within the same expression:

apply(dm, 1, function( v) sum(is.na(v))) # number of NAs per row
 [1] 1 0 2 3 2 4 4 2 4 1

The last example is often called an "anonymous function".

Here is another anonymous function (as an exercise, trace the order in which the various objects and functions are evaluated):

(function( v) sum(is.na(v)))(dm)
[1] 23

3.1.3 Style

Arguably, skipping the return makes one-liners and anonymous functions easier to read and debug. But as you move on to more complicated functions, especially those that may conditionally have different sorts of return value, you will find it easier to read and debug if you do include the return expressions.


3.2 Exercises

  1. In the Motor Trend car tests data, mtcars, vehicle weight, wt, is reported in thousands of pounds. Write a function that converts thousands-of-pounds to kilgrams. Bonus: show (by calculation) that this leaves the correlation between wt and mpg unaffected.

  2. The same data set reports fuel consumption in miles per gallon. Write a function that converts this to kilometers per liter. Bonus: show that this conversion still leaves the correlation unaffected.

  3. R does not have a standard-error-of-the-mean function. Write one, then produce a table of means and standard errors for the variables in the mtcars data.

  4. When working with time-series data it is often useful to identify gaps in the series. Write a function that identifies gaps by indicating observations preceded by a gap. For example:

 1990  1991  1992  1993  1994  1997  1998  1999  2000 
   NA FALSE FALSE FALSE FALSE  TRUE FALSE FALSE FALSE 

Bonus: write a function that fills out the gap.

  1. In survey data, it is common for data to be coded with 8 = don't know and 9 = refused to answer. We need to convert these to NAs for most statistical work. Write a function that takes a vector or matrix as input, and returns the recoded vector/matrix. For example:
     [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
[1,]    4    2    9    5    6   NA    9    5    5     2
[2,]    2    6    2    6    3    7    7    9    9     4
[3,]   NA    4    7    4    4    7    9   NA    5     8
[4,]   NA   NA    8    2    9    8    9   NA    5     7
[5,]    6    3   NA    7    3   NA    7    4    7     3
[6,]   NA    5    3    3    9   NA    5   NA    6    NA
     [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
[1,]    4    2   NA    5    6   NA   NA    5    5     2
[2,]    2    6    2    6    3    7    7   NA   NA     4
[3,]   NA    4    7    4    4    7   NA   NA    5    NA
[4,]   NA   NA   NA    2   NA   NA   NA   NA    5     7
[5,]    6    3   NA    7    3   NA    7    4    7     3
[6,]   NA    5    3    3   NA   NA    5   NA    6    NA

Note: This is a simplification of a couple of existing R functions. Functions such as the one you are asked to produce are often termed "convenience" functions or "wrappers". Depending on how you solve this problem, the function you simplify may itself be a convenience wrapper - take a look at the code inside the function you use!

  1. Average compounded growth. Given two vectors, one of starting values (say, starting salary) and another of ending values (current salary), we often want to characterize the growth rate that led from one to the other. If we additionally know how many growth periods there were between each pair of values, we can calculate an average growth rate that takes compounding into account.

Write a function that returns the average growth rate as a fraction. In other words, if y = ending value, x=starting value, t=number of time periods, and r = growth multiplier, return r-1. The fundamental relation is \[y=x*(r)^t\]

  1. Write a function returning "degree of consensus." 1 - (variance of respondents)/(max possible variance), where the respondents have given answers on some bounded scale (e.g. a Likert scale from 1 to 5). Total consensus = 1, maximum disagreement = 0.