Supporting Statistical Analysis for Research

# 5 Writing a New Method

## 5.1 Methods and Generic Functions

Many functions are built as "generic" functions. The idea is, the function checks what kind of object it is required to act upon, and then "dispatches" the appropriate "method" (uses the right algorithm or function). If you peak inside functions like plot or scale, you will see that they consist of a call to the function UseMethod!

scale
function (x, center = TRUE, scale = TRUE)
UseMethod("scale")
<bytecode: 0x00000000189a31d8>
<environment: namespace:base>

However, being "generic" does not necessarily mean a function will work for all kinds of objects you might want. For instance, scale has a default algorithm, but does not handle all data.frames, only those that are entirely numeric.

methods(scale)
[1] scale.default
see '?methods' for accessing help and source code
mpg        cyl        disp         hp       drat
Mazda RX4          0.1508848 -0.1049878 -0.57061982 -0.5350928  0.5675137
Mazda RX4 Wag      0.1508848 -0.1049878 -0.57061982 -0.5350928  0.5675137
Datsun 710         0.4495434 -1.2248578 -0.99018209 -0.7830405  0.4739996
Hornet 4 Drive     0.2172534 -0.1049878  0.22009369 -0.5350928 -0.9661175
Hornet Sportabout -0.2307345  1.0148821  1.04308123  0.4129422 -0.8351978
Valiant           -0.3302874 -0.1049878 -0.04616698 -0.6080186 -1.5646078
wt       qsec         vs         am       gear
Mazda RX4         -0.610399567 -0.7771651 -0.8680278  1.1899014  0.4235542
Mazda RX4 Wag     -0.349785269 -0.4637808 -0.8680278  1.1899014  0.4235542
Datsun 710        -0.917004624  0.4260068  1.1160357  1.1899014  0.4235542
Hornet 4 Drive    -0.002299538  0.8904872  1.1160357 -0.8141431 -0.9318192
Hornet Sportabout  0.227654255 -0.4637808 -0.8680278 -0.8141431 -0.9318192
Valiant            0.248094592  1.3269868  1.1160357 -0.8141431 -0.9318192
carb
Mazda RX4          0.7352031
Mazda RX4 Wag      0.7352031
Datsun 710        -1.1221521
Hornet 4 Drive    -1.1221521
Valiant           -1.1221521
scale(iris)
## Error in colMeans(x, na.rm = TRUE) : 'x' must be numeric

The problem here is that data set iris contains a factor variable. I would like the scale function to just act on the numeric vectors within any data frame, ignoring factors, character vectors, and logical vectors. Because scale is a generic function, this is easy to do!

I'll go through these typical steps to write a function to scale (center/standardize) all the numeric variables in a data frame:

1. Write an example that works
2. Turn that into a function
3. Test and refine: bomb-proofing, make into method

I'll want my function to return the whole data frame, with just the appropriate variables (re)scaled. So my final step within the function will be to return a data.frame.

### 5.1.1 Step 1: A working example

Make a copy of the data frame and figure out which columns are scalable. Use scale() on those columns (the default method coerces them into a matrix), returning a matrix.
Use the matrix to write back to the data frame.

# we'll be given a data frame
x <- iris
cols <- sapply(iris, is.numeric)
scaledvars <- scale(iris[, cols])
x[, cols] <- scaledvars
# we'll return(x)

Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1   -0.8976739  1.01560199    -1.335752   -1.311052  setosa
2   -1.1392005 -0.13153881    -1.335752   -1.311052  setosa
3   -1.3807271  0.32731751    -1.392399   -1.311052  setosa
4   -1.5014904  0.09788935    -1.279104   -1.311052  setosa
5   -1.0184372  1.24503015    -1.335752   -1.311052  setosa
6   -0.5353840  1.93331463    -1.165809   -1.048667  setosa
str(x)
'data.frame':   150 obs. of  5 variables:
$Sepal.Length: num -0.898 -1.139 -1.381 -1.501 -1.018 ...$ Sepal.Width : num  1.0156 -0.1315 0.3273 0.0979 1.245 ...
$Petal.Length: num -1.34 -1.34 -1.39 -1.28 -1.34 ...$ Petal.Width : num  -1.31 -1.31 -1.31 -1.31 -1.31 ...
Error in scale_df2(iris$Sepal.Length): dfr must be a data frame ## 5.3 Refinement two: make it a method We'll chose a function name that makes this a "method" of the generic function, scale(). scale.data.frame <- function(dfr) { if (!is.data.frame(dfr)) {stop("dfr must be a data frame")} x <- dfr cols <- sapply(dfr, is.numeric) scaledvars <- scale.default(dfr[, cols]) # otherwise we get a recursive loop x[, cols] <- scaledvars return(x) } z <- scale.data.frame(iris) head(z) Sepal.Length Sepal.Width Petal.Length Petal.Width Species 1 -0.8976739 1.01560199 -1.335752 -1.311052 setosa 2 -1.1392005 -0.13153881 -1.335752 -1.311052 setosa 3 -1.3807271 0.32731751 -1.392399 -1.311052 setosa 4 -1.5014904 0.09788935 -1.279104 -1.311052 setosa 5 -1.0184372 1.24503015 -1.335752 -1.311052 setosa 6 -0.5353840 1.93331463 -1.165809 -1.048667 setosa scale.data.frame(iris$Sepal.Length)
Error in scale.data.frame(iris$Sepal.Length): dfr must be a data frame ### 5.3.1 Here is the magic! z <- scale(iris) head(z) Sepal.Length Sepal.Width Petal.Length Petal.Width Species 1 -0.8976739 1.01560199 -1.335752 -1.311052 setosa 2 -1.1392005 -0.13153881 -1.335752 -1.311052 setosa 3 -1.3807271 0.32731751 -1.392399 -1.311052 setosa 4 -1.5014904 0.09788935 -1.279104 -1.311052 setosa 5 -1.0184372 1.24503015 -1.335752 -1.311052 setosa 6 -0.5353840 1.93331463 -1.165809 -1.048667 setosa str(z) 'data.frame': 150 obs. of 5 variables:$ Sepal.Length: num  -0.898 -1.139 -1.381 -1.501 -1.018 ...
$Sepal.Width : num 1.0156 -0.1315 0.3273 0.0979 1.245 ...$ Petal.Length: num  -1.34 -1.34 -1.39 -1.28 -1.34 ...
$Petal.Width : num -1.31 -1.31 -1.31 -1.31 -1.31 ...$ Species     : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...
head(scale(iris$Sepal.Length)) # this now works by the default method [,1] [1,] -0.8976739 [2,] -1.1392005 [3,] -1.3807271 [4,] -1.5014904 [5,] -1.0184372 [6,] -0.5353840 attributes(scale(iris$Sepal.Length)) # notice the attributes at the end
$dim [1] 150 1$scaled:center
[1] 5.843333

$scaled:scale [1] 0.8280661 ## 5.4 Refinement three: better return Better error message, keep attributes. scale.data.frame <- function(dfr) { if (!is.data.frame(dfr)) {stop(paste(deparse(substitute(dfr)), "must be a data frame"))} x <- dfr cols <- sapply(dfr, is.numeric) scaledvars <- scale.default(dfr[, cols]) # otherwise we get a recursive loop x[, cols] <- scaledvars attr(x, "scaled:center") <- attr(scaledvars, "scaled:center") attr(x, "scaled:scale") <- attr(scaledvars, "scaled:scale") return(x) } z <- scale(iris) head(z) Sepal.Length Sepal.Width Petal.Length Petal.Width Species 1 -0.8976739 1.01560199 -1.335752 -1.311052 setosa 2 -1.1392005 -0.13153881 -1.335752 -1.311052 setosa 3 -1.3807271 0.32731751 -1.392399 -1.311052 setosa 4 -1.5014904 0.09788935 -1.279104 -1.311052 setosa 5 -1.0184372 1.24503015 -1.335752 -1.311052 setosa 6 -0.5353840 1.93331463 -1.165809 -1.048667 setosa str(z) 'data.frame': 150 obs. of 5 variables:$ Sepal.Length: num  -0.898 -1.139 -1.381 -1.501 -1.018 ...
$Sepal.Width : num 1.0156 -0.1315 0.3273 0.0979 1.245 ...$ Petal.Length: num  -1.34 -1.34 -1.39 -1.28 -1.34 ...
$Petal.Width : num -1.31 -1.31 -1.31 -1.31 -1.31 ...$ Species     : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...
- attr(*, "scaled:center")= Named num [1:4] 5.84 3.06 3.76 1.2
..- attr(*, "names")= chr [1:4] "Sepal.Length" "Sepal.Width" "Petal.Length" "Petal.Width"
- attr(*, "scaled:scale")= Named num [1:4] 0.828 0.436 1.765 0.762
..- attr(*, "names")= chr [1:4] "Sepal.Length" "Sepal.Width" "Petal.Length" "Petal.Width"

The error message will only be used if someone tries to bypass scale() and use scale.data.frame() directly

scale.data.frame(iris$Sepal.Length) Error in scale.data.frame(iris$Sepal.Length): iris$Sepal.Length must be a data frame ## 5.5 Refinement four: passing parameters Pass scale and center options to scale() scale.data.frame <- function(dfr, ...) { if (!is.data.frame(dfr)) {stop(paste(deparse(substitute(dfr)), "must be a data frame"))} x <- dfr cols <- sapply(dfr, is.numeric) scaledvars <- scale.default(dfr[, cols], ...) # otherwise we get a recursive loop x[, cols] <- scaledvars attr(x, "scaled:center") <- attr(scaledvars, "scaled:center") attr(x, "scaled:scale") <- attr(scaledvars, "scaled:scale") return(x) } z <- scale(iris, scale=FALSE) str(z) 'data.frame': 150 obs. of 5 variables:$ Sepal.Length: num  -0.743 -0.943 -1.143 -1.243 -0.843 ...
$Sepal.Width : num 0.4427 -0.0573 0.1427 0.0427 0.5427 ...$ Petal.Length: num  -2.36 -2.36 -2.46 -2.26 -2.36 ...
$Petal.Width : num -0.999 -0.999 -0.999 -0.999 -0.999 ...$ Species     : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...
- attr(*, "scaled:center")= Named num [1:4] 5.84 3.06 3.76 1.2
..- attr(*, "names")= chr [1:4] "Sepal.Length" "Sepal.Width" "Petal.Length" "Petal.Width"

## 5.6 Exercises

1. Plot does not have specific methods for logical or character vectors. For logical vectors, if coerces them to numeric type, then plots, for characters it just gives up. Write two functions that create bar charts for these types of vectors, and make them plot methods.

2. Write a mean method for data frames.

Last revised: 12/26/2014