SSCC - Social Science Computing Cooperative Supporting Statistical Analysis for Research

4 Writing Functions - Refinements

4.0.1 Conditional Returns

Sometimes we want a function to return different sorts of values, conditionally. An example (after SPSS) is a function which calculates the mean of several values as long as there are not too many missing values. The user gets to specify how many missing values can be ignored.

Here, we want our function to sometimes return a numeric value, the mean calculated with NAs dropped, and sometimes to just return an NA.

mean.n <- function (x, max.na) {
  if (nmiss(x) <= max.na) {
    rv <- mean(x, na.rm=TRUE)
  } else {
    rv <- NA
  }
  return(rv)
}

mean.n(dm[1,], 2) # numeric return
[1] 5.222222
mean.n(dm[9,], 2) # NA return
[1] NA

Because the last expression evaluated is returned, this could be written more succinctly as

mean.n <- function (x, max.na) {
  if (nmiss(x) <= max.na) {
    mean(x, na.rm=TRUE)
  } else {
    NA
  }
}

mean.n(dm[1,], 2) # numeric return
[1] 5.222222
mean.n(dm[9,], 2) # NA return
[1] NA

4.0.2 Multiple Returns

A function can only return one data object. To return multiple disparate objects, combine them in a list, and return the list. In this example, suppose we wanted to return the mean, the number of observations used, and the number of missing observations.

mean.n <- function (x, max.na) {
    nobs <- length(x)
    nm <- nmiss(x)
  if (nm <= max.na) {
    rl <- list(mean=mean(x, na.rm=TRUE),
               n=nobs-nm, missing=nm)
  } else {
    rl <- list(mean=NA, n=nobs-nm, missing=nm)
  }
return(rl)
}

mean.n(dm[1,], 2) # numeric return
$mean
[1] 5.222222

$n
[1] 9

$missing
[1] 1
mean.n(dm[9,], 2) # NA return
$mean
[1] NA

$n
[1] 6

$missing
[1] 4

Keep in mind that list returns are going to take more manipulation in a apply setting. See if you can unpack what goes on here!

as.matrix.numeric <- function (x, ...) { # need a method!
    stopifnot(is.vector(x))
    names <- unique(names(x))
    n <- length(names)
    m <- matrix(x, ...)
    if (nrow(m)==n) {
        rownames(m) <- names
    } else if (ncol(m)==n) {
        colnames(m) <- names
    }
    m
}

as.matrix(unlist(apply(dm, 2, mean.n, max.na=2)), ncol=3, byrow=TRUE)
          mean  n missing
 [1,]       NA  6       4
 [2,] 4.750000  8       2
 [3,] 5.875000  8       2
 [4,] 4.555556  9       1
 [5,] 4.800000 10       0
 [6,]       NA  6       4
 [7,] 7.800000 10       0
 [8,]       NA  4       6
 [9,] 6.250000  8       2
[10,] 5.250000  8       2

We will come back to the idea of a function as a method.

4.0.3 Scope and Reach

local verus global/parent

4.0.4 Setting parameter defaults

With many functions, there are sensible default values we can give to our parameters. These may be the mostly commonly specified values, so not having to specify them is a convenience. Or they may be boundary values, so specifying them may make our function revert to some "safe" algorithm.

mean.n <- function (x, max.na=0) {
  if (sum(is.na(x))<=max.na) {
    mean(x, na.rm=TRUE)
  } else {
    NA
  }
}
mean.n(dm[,9]) # numeric return, default=0
[1] NA
mean.n(dm[9,]) # NA return, default=0
[1] NA
mean.n(dm[9,], 3) # numeric return, override the default
[1] NA

4.0.5 Argument Checking

At some point we have to consider the multitude of object types that co-exist within R, and the possibility that someone might try to use our function on an inappropriate object - and that someone might even be us if we have clumsy typing "skills" or a poor memory for detail!

It will be bad enough if our function melts down and returns an arcane error message, but even worse if our function returns some nonsense value and NO error!

Another good reason for argument checking is that we may have one perfectly good algorithm for vectors and another perfectly good algorithm for data frames, but we just need to decide which algorithm to use in a particular function call ... and we'll eventually discuss how to use functions as methods.

Checking for alternatives and errors can be an arduous task, but it is fundamental to well-designed software.

In our mean.n function we have two arguments, to which we want to apply three checks:

  • is x numeric? (note I'm excluding the possibility of means of logical values)
  • is max.na numeric? (we could require an integer value, instead)
  • is max.na a single value? (otherwise if will ignore values beyond the first one, and we could get meaningless results)

We'll use stopifnot to start with, because it has especially simple syntax.

# Add some code to check the arguments are allowable
mean.n <- function (x, max.na=0) {
  stopifnot(is.numeric(x), is.numeric(max.na),
            length(max.na)==1)
  if (sum(is.na(x))<=max.na) {
    mean(x, na.rm=TRUE)
  } else {
    NA
  }
}

mean.n(c(1:3,"one"), 2)
Error in mean.n(c(1:3, "one"), 2): is.numeric(x) is not TRUE

4.0.6 Better error messages

In the previous example the error message was probably clear to you, in part because you've been looking at the code inside the function. You typically don't do that (or you typically forget the details of the function you wrote months/days ago), and Error: is.numeric(x) is not TRUE can be a little cryptic.

Using if and stop gives us the ability to write clearer error messages.

mean.n <- function (x, max.na=0) {
  if (is.matrix(x)) {stop("x is a matrix, should be a vector")}
  stopifnot(is.numeric(x), is.numeric(max.na),
            length(max.na)==1)
  if (sum(is.na(x))<=max.na) {
    mean(x, na.rm=TRUE)
  } else {
    NA
  }
}
mean.n(dm, 2)
Error in mean.n(dm, 2): x is a matrix, should be a vector

A clearer-to-the-user version uses substitute.

mean.n <- function (x, max.na=0) {
  if (is.matrix(x)) {stop(substitute(x), " is a matrix, should be a vector")}
  stopifnot(is.numeric(x), is.numeric(max.na),
            length(max.na)==1)
  if (sum(is.na(x))<=max.na) {
    mean(x, na.rm=TRUE)
  } else {
    NA
  }
}
mean.n(dm, 2)
Error in mean.n(dm, 2): dm is a matrix, should be a vector

4.0.7 Bomb-proofing

It is good practice to think up a variety of error-prone test cases, to make sure your function catches everything you've thought of.

mean.n(dm,2) # error, no matrices
Error in mean.n(dm, 2): dm is a matrix, should be a vector
mean.n(c("cat", "dog")) # error, data is not numeric
Error in mean.n(c("cat", "dog")): is.numeric(x) is not TRUE
mean.n(dm[1,], "two") # error, max-na is not numeric
Error in mean.n(dm[1, ], "two"): is.numeric(max.na) is not TRUE
mean.n(dm[1,], 1:2) # error, max.na is not a single value
Error in mean.n(dm[1, ], 1:2): length(max.na) == 1 is not TRUE

4.1 Making functions available automatically

There are a few options here.

  1. Save the workspace containing your functions as an .Rdata file. If you save it as just ".Rdata" (nothing in front of the dot), it will be automatically loaded when you start R with that working directory. Alternatively, give the file a name (e.g. "function.Rdata") and use load explicitly before you need to use one of your functions.

  2. Put your function definitions in an .r file (a script), and include a source call in an .Rprofile or .Renviron file that will automatically run whenever R starts up.
    This call can be put in a function named .First. Alternatively, just run source when you actually need one of your functions. The advantage of the sourcing approach is that it does not depend (as much) on what directory you start up R in.

  3. Package your functions, and install your package. Although this requires learning how to build packages, it has the distinct advantages of making your functions available regardless of the working directory, and of not cluttering up your global environment.


4.2 Exercises

  1. Modify mean.n so that we can specify a missing fraction, e.g. mean.n(x, max.na=0.1) would mean allow up to 10% of the data to be missing.
  2. Additionally, make max.na=-1 mean any number of missing values are allowed. Don't forget to include error checking!

Last revised: 06/30/2017