# Writing Functions

*June 2021*

# 1 Writing Functions - Basics

Eventually you will find that you want to write your own functions, and R is
designed to make this very easy. For many of us this first comes up when we
have several steps we want to do in sequence using one of the
`apply`

functions. These steps might be a series of R statements, or they might even
be a nested sequence of functions that we want to refer to by a simple name.

As an example, consider a function to count the missing values (`NA`

s) in a vector.
We might want to apply this to the columns of a data frame, to the rows, or within groups
specified by the values of some other vector.

```
# setup, a matrix with about 20% missing data
set.seed(20141117)
dm <- matrix(sample(0:9, 100, replace=TRUE), ncol=10)
dm[dm<2] <- NA
dm
```

```
[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
[1,] 4 2 9 5 6 NA 9 5 5 2
[2,] 2 6 2 6 3 7 7 9 9 4
[3,] NA 4 7 4 4 7 9 NA 5 8
[4,] NA NA 8 2 9 8 9 NA 5 7
[5,] 6 3 NA 7 3 NA 7 4 7 3
[6,] NA 5 3 3 9 NA 5 NA 6 NA
[7,] 3 8 3 NA 6 NA 7 NA NA 3
[8,] 7 7 NA 4 2 9 8 NA 8 9
[9,] 8 NA 6 5 3 6 8 NA NA NA
[10,] NA 3 9 5 3 2 9 5 5 6
```

We can count the total number of missing values in our matrix:

`sum(is.na(dm))`

`[1] 23`

But we have a problem using this code when we try to count NAs within each column:

`apply(dm, 2, sum(is.na))`

`Error in sum(is.na): invalid 'type' (builtin) of argument`

What we need is a function to use with `apply`

.

## 1.1 Defining a New Function

Defining a function is pretty simple, really. Typically a function has
a *name*, an *argument list* (parameters, or “formals”), a *body*
(the expressions that act on
the arguments), and a *return value*.

```
name <- function(arg1, arg2 ...){
expression(arg1)
...
value <- expression
return(value)
}
```

(Basic documentation on writing functions is in Help in
*An Introduction to R*, “Chapter 10: Writing your own functions”.
See also `help("function")`

.)

In our example, `sum(is.na())`

will be our expression or *body*. As an *argument* we will
use `v`

to stand for an arbitrary data object,
and we want to *return* a scalar count that we’ll call `rv`

.

We’ll give this function the *name* `nmiss`

.

```
nmiss <- function( v) {
rv <- sum(is.na(v))
return(rv)
}
```

Then use the function with the whole matrix, as before.

`nmiss(dm)`

`[1] 23`

And finally use the function with `apply`

`apply(dm,1,nmiss) # missing per row`

` [1] 1 0 2 3 2 4 4 2 4 1`

We can store the results in another object in the usual way.

```
dm.missing <- nmiss(dm)
dm.missing
```

`[1] 23`

First notice that a new object, `nmiss`

, has been added to our
workspace, the global environment.

Notice also that the objects `v`

and `rv`

(two arbitrary names
for data objects, i.e. two placeholders in our function definition)
do *not* appear in our workspace.
Think of them as *local* to the `nmiss`

function,
or as objects within the `nmiss`

*enclosure* or *environment*.

As objects in our workspace, functions have class, and they can be printed (do not include the parentheses with the name), which shows us the details of how they were defined. You can do this with any function, not just those you define yourself!

`class(nmiss) # functions have class`

`[1] "function"`

`nmiss # as an "object" it can be printed`

```
function( v) {
rv <- sum(is.na(v))
return(rv)
}
<environment: 0x0000022fe37d17f0>
```

### 1.1.1 More About Returns

Either the `return()`

object, or the value produce by
the last expression
evaluated is returned by the function.

For example, this is a common way of specifying “rv” as the returned object:

```
nmiss <- function( v) {
rv <- sum(is.na(v))
rv
}
nmiss(dm)
```

`[1] 23`

Notice that evaluating an assignment does * NOT* work,
returning nothing (not even an error!, just a

`NULL`

value):```
nmiss <- function( v) {
rv <- sum(is.na(v))
}
nmiss(dm)
```

The last expression evaluated was the assignment, "<-", a function used for it's side effect.

But the following example *does* return what we want. We don't need to
name the object we want to return, we can simply return the value of the
last expression evaluated.

```
nmiss <- function( v) {
sum(is.na(v))
}
nmiss(dm)
```

`[1] 23`

### 1.1.2 One-liners and Anonymous Functions

Notice the last example could be written on one line:

```
nmiss <- function( v) { sum(is.na(v))}
nmiss(dm)
```

`[1] 23`

And because the body is a single expression we don't actually need curly braces for multiple statements:

```
nmiss <- function( v) sum(is.na(v))
nmiss(dm)
```

`[1] 23`

It is not uncommon to see simple functions both defined and used within the same expression:

`apply(dm, 1, function( v) sum(is.na(v))) # number of NAs per row`

` [1] 1 0 2 3 2 4 4 2 4 1`

The last example is often called an "anonymous function".

Here is another anonymous function (as an exercise, trace the order in which the various objects and functions are evaluated):

`(function( v) sum(is.na(v)))(dm)`

`[1] 23`

### 1.1.3 Style

Arguably, skipping the `return`

makes one-liners and anonymous functions easier to
read and debug, as long as they are short and sweet.
But as you move on to more complicated functions, especially those that may
conditionally have different sorts of return value, you will find them easier to read and debug
if you **do** include the `return`

expressions.

## 1.2 Exercises

In the Motor Trend car tests data,

`mtcars`

, vehicle weight,`wt`

, is reported in thousands of pounds. Write a function that converts thousands-of-pounds to kilograms.*Bonus*: Use the result to show (by calculation) that this leaves the correlation between`wt`

in kilos and`mpg`

unaffected.The same data set reports fuel consumption in miles per gallon. Write a function that converts this to kilometers per liter.

*Bonus*: show that this conversion leaves the correlation between the rescaled variables unaffected.R does not have a standard-error-of-the-mean function. Write one, then produce a table of means and standard errors for the variables in the

`mtcars`

data.When working with time-series data it is often useful to identify gaps in the series. Write a function that identifies gaps by indicating observations preceded by a gap. For example:

```
1990 1991 1992 1993 1994 1997 1998 1999 2000
NA FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE
```

*Bonus*: write a function that fills out the gap.

- In survey data, it is common for data to be coded with 8 = don't know and 9 =
refused to answer. We need to convert these to
`NA`

s for most statistical work. Write a function that takes a vector or matrix as input, and returns the recoded vector/matrix. For example:

```
[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
[1,] 4 2 9 5 6 NA 9 5 5 2
[2,] 2 6 2 6 3 7 7 9 9 4
[3,] NA 4 7 4 4 7 9 NA 5 8
[4,] NA NA 8 2 9 8 9 NA 5 7
[5,] 6 3 NA 7 3 NA 7 4 7 3
[6,] NA 5 3 3 9 NA 5 NA 6 NA
```

```
[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
[1,] 4 2 NA 5 6 NA NA 5 5 2
[2,] 2 6 2 6 3 7 7 NA NA 4
[3,] NA 4 7 4 4 7 NA NA 5 NA
[4,] NA NA NA 2 NA NA NA NA 5 7
[5,] 6 3 NA 7 3 NA 7 4 7 3
[6,] NA 5 3 3 NA NA 5 NA 6 NA
```

*Note*: This is a simplification of a couple of existing R functions. Functions such
as the one you are asked to produce are often termed "convenience" functions or
"wrappers". Depending on how you
solve this problem, the function you simplify may itself be a convenience wrapper - take
a look at the code inside the function you use!

- Average compounded growth. Given two vectors, one of starting values (say, starting salary) and another of ending values (current salary), we often want to characterize the growth rate that led from one to the other. If we additionally know how many growth periods there were between each pair of values, we can calculate an average growth rate that takes compounding into account.

Write a function that returns the average growth rate as a fraction. In other words, if y = ending value, x=starting value, t=number of time periods, and r = growth multiplier, return r-1. The fundamental relation is \[y=x*(r)^t\]

- Write a function returning "degree of consensus." 1 - (variance of respondents)/(max possible variance), where the respondents have given answers on some bounded scale (e.g. a Likert scale from 1 to 5). Complete consensus = 1, maximum disagreement = 0.