SSCC - Social Science Computing Cooperative Supporting Statistical Analysis for Research

5 Writing a New Method

5.1 Methods and Generic Functions

Many functions are built as "generic" functions. The idea is, the function checks what kind of object it is required to act upon, and then "dispatches" the appropriate "method" (uses the right algorithm or function). If you peak inside functions like plot or scale, you will see that they consist of a call to the function UseMethod!

scale
function (x, center = TRUE, scale = TRUE) 
UseMethod("scale")
<bytecode: 0x00000000189a31d8>
<environment: namespace:base>

However, being "generic" does not necessarily mean a function will work for all kinds of objects you might want. For instance, scale has a default algorithm, but does not handle all data.frames, only those that are entirely numeric.

methods(scale)
[1] scale.default
see '?methods' for accessing help and source code
head(scale(mtcars))
                         mpg        cyl        disp         hp       drat
Mazda RX4          0.1508848 -0.1049878 -0.57061982 -0.5350928  0.5675137
Mazda RX4 Wag      0.1508848 -0.1049878 -0.57061982 -0.5350928  0.5675137
Datsun 710         0.4495434 -1.2248578 -0.99018209 -0.7830405  0.4739996
Hornet 4 Drive     0.2172534 -0.1049878  0.22009369 -0.5350928 -0.9661175
Hornet Sportabout -0.2307345  1.0148821  1.04308123  0.4129422 -0.8351978
Valiant           -0.3302874 -0.1049878 -0.04616698 -0.6080186 -1.5646078
                            wt       qsec         vs         am       gear
Mazda RX4         -0.610399567 -0.7771651 -0.8680278  1.1899014  0.4235542
Mazda RX4 Wag     -0.349785269 -0.4637808 -0.8680278  1.1899014  0.4235542
Datsun 710        -0.917004624  0.4260068  1.1160357  1.1899014  0.4235542
Hornet 4 Drive    -0.002299538  0.8904872  1.1160357 -0.8141431 -0.9318192
Hornet Sportabout  0.227654255 -0.4637808 -0.8680278 -0.8141431 -0.9318192
Valiant            0.248094592  1.3269868  1.1160357 -0.8141431 -0.9318192
                        carb
Mazda RX4          0.7352031
Mazda RX4 Wag      0.7352031
Datsun 710        -1.1221521
Hornet 4 Drive    -1.1221521
Hornet Sportabout -0.5030337
Valiant           -1.1221521
scale(iris)
## Error in colMeans(x, na.rm = TRUE) : 'x' must be numeric

The problem here is that data set iris contains a factor variable. I would like the scale function to just act on the numeric vectors within any data frame, ignoring factors, character vectors, and logical vectors. Because scale is a generic function, this is easy to do!

I'll go through these typical steps to write a function to scale (center/standardize) all the numeric variables in a data frame:

  1. Write an example that works
  2. Turn that into a function
  3. Test and refine: bomb-proofing, make into method

I'll want my function to return the whole data frame, with just the appropriate variables (re)scaled. So my final step within the function will be to return a data.frame.

5.1.1 Step 1: A working example

Make a copy of the data frame and figure out which columns are scalable. Use scale() on those columns (the default method coerces them into a matrix), returning a matrix.
Use the matrix to write back to the data frame.

# we'll be given a data frame
x <- iris
cols <- sapply(iris, is.numeric)
scaledvars <- scale(iris[, cols])
x[, cols] <- scaledvars
# we'll return(x)

Check your results.

head(x)
  Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1   -0.8976739  1.01560199    -1.335752   -1.311052  setosa
2   -1.1392005 -0.13153881    -1.335752   -1.311052  setosa
3   -1.3807271  0.32731751    -1.392399   -1.311052  setosa
4   -1.5014904  0.09788935    -1.279104   -1.311052  setosa
5   -1.0184372  1.24503015    -1.335752   -1.311052  setosa
6   -0.5353840  1.93331463    -1.165809   -1.048667  setosa
str(x)
'data.frame':   150 obs. of  5 variables:
 $ Sepal.Length: num  -0.898 -1.139 -1.381 -1.501 -1.018 ...
 $ Sepal.Width : num  1.0156 -0.1315 0.3273 0.0979 1.245 ...
 $ Petal.Length: num  -1.34 -1.34 -1.39 -1.28 -1.34 ...
 $ Petal.Width : num  -1.31 -1.31 -1.31 -1.31 -1.31 ...
 $ Species     : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...

5.1.2 Step 2: Edit into a function

Here we just replace iris with the parameter dfr. In RStudio you can use the menus Code -- Extract Function.

scale_df <- function(dfr) {
    x <- dfr
    cols <- sapply(dfr, is.numeric)
    scaledvars <- scale(dfr[, cols])
    x[, cols] <- scaledvars
    return(x) # or just "x"
}

5.1.3 Step 3: Test it on something

Preferably a couple of things you expect to work, and a couple of things you expect to fail.

z <- scale_df(iris)
head(z)
  Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1   -0.8976739  1.01560199    -1.335752   -1.311052  setosa
2   -1.1392005 -0.13153881    -1.335752   -1.311052  setosa
3   -1.3807271  0.32731751    -1.392399   -1.311052  setosa
4   -1.5014904  0.09788935    -1.279104   -1.311052  setosa
5   -1.0184372  1.24503015    -1.335752   -1.311052  setosa
6   -0.5353840  1.93331463    -1.165809   -1.048667  setosa

5.2 Refinement one: testing the input

Make sure dfr is a data frame!

scale_df2 <- function(dfr) {
    if (!is.data.frame(dfr)) {stop("dfr must be a data frame")}
    x <- dfr
    cols <- sapply(dfr, is.numeric)
    scaledvars <- scale(dfr[, cols])
    x[, cols] <- scaledvars
    return(x)
}
z <- scale_df2(iris)
head(z)
  Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1   -0.8976739  1.01560199    -1.335752   -1.311052  setosa
2   -1.1392005 -0.13153881    -1.335752   -1.311052  setosa
3   -1.3807271  0.32731751    -1.392399   -1.311052  setosa
4   -1.5014904  0.09788935    -1.279104   -1.311052  setosa
5   -1.0184372  1.24503015    -1.335752   -1.311052  setosa
6   -0.5353840  1.93331463    -1.165809   -1.048667  setosa
scale_df2(iris$Sepal.Length)
Error in scale_df2(iris$Sepal.Length): dfr must be a data frame

5.3 Refinement two: make it a method

We'll chose a function name that makes this a "method" of the generic function, scale().

scale.data.frame <- function(dfr) {
    if (!is.data.frame(dfr)) {stop("dfr must be a data frame")}
    x <- dfr
    cols <- sapply(dfr, is.numeric)
    scaledvars <- scale.default(dfr[, cols]) # otherwise we get a recursive loop
    x[, cols] <- scaledvars
    return(x)
}

z <- scale.data.frame(iris)
head(z)
  Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1   -0.8976739  1.01560199    -1.335752   -1.311052  setosa
2   -1.1392005 -0.13153881    -1.335752   -1.311052  setosa
3   -1.3807271  0.32731751    -1.392399   -1.311052  setosa
4   -1.5014904  0.09788935    -1.279104   -1.311052  setosa
5   -1.0184372  1.24503015    -1.335752   -1.311052  setosa
6   -0.5353840  1.93331463    -1.165809   -1.048667  setosa
scale.data.frame(iris$Sepal.Length)
Error in scale.data.frame(iris$Sepal.Length): dfr must be a data frame

5.3.1 Here is the magic!

z <- scale(iris)

head(z)
  Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1   -0.8976739  1.01560199    -1.335752   -1.311052  setosa
2   -1.1392005 -0.13153881    -1.335752   -1.311052  setosa
3   -1.3807271  0.32731751    -1.392399   -1.311052  setosa
4   -1.5014904  0.09788935    -1.279104   -1.311052  setosa
5   -1.0184372  1.24503015    -1.335752   -1.311052  setosa
6   -0.5353840  1.93331463    -1.165809   -1.048667  setosa
str(z)
'data.frame':   150 obs. of  5 variables:
 $ Sepal.Length: num  -0.898 -1.139 -1.381 -1.501 -1.018 ...
 $ Sepal.Width : num  1.0156 -0.1315 0.3273 0.0979 1.245 ...
 $ Petal.Length: num  -1.34 -1.34 -1.39 -1.28 -1.34 ...
 $ Petal.Width : num  -1.31 -1.31 -1.31 -1.31 -1.31 ...
 $ Species     : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...
head(scale(iris$Sepal.Length)) # this now works by the default method
           [,1]
[1,] -0.8976739
[2,] -1.1392005
[3,] -1.3807271
[4,] -1.5014904
[5,] -1.0184372
[6,] -0.5353840
attributes(scale(iris$Sepal.Length)) # notice the attributes at the end
$dim
[1] 150   1

$`scaled:center`
[1] 5.843333

$`scaled:scale`
[1] 0.8280661

5.4 Refinement three: better return

Better error message, keep attributes.

scale.data.frame <- function(dfr) {
    if (!is.data.frame(dfr)) {stop(paste(deparse(substitute(dfr)), "must be a data frame"))}
    x <- dfr
    cols <- sapply(dfr, is.numeric)
    scaledvars <- scale.default(dfr[, cols]) # otherwise we get a recursive loop
    x[, cols] <- scaledvars
    attr(x, "scaled:center") <- attr(scaledvars, "scaled:center")
    attr(x, "scaled:scale") <- attr(scaledvars, "scaled:scale")
    return(x)
}

z <- scale(iris)

head(z)
  Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1   -0.8976739  1.01560199    -1.335752   -1.311052  setosa
2   -1.1392005 -0.13153881    -1.335752   -1.311052  setosa
3   -1.3807271  0.32731751    -1.392399   -1.311052  setosa
4   -1.5014904  0.09788935    -1.279104   -1.311052  setosa
5   -1.0184372  1.24503015    -1.335752   -1.311052  setosa
6   -0.5353840  1.93331463    -1.165809   -1.048667  setosa
str(z)
'data.frame':   150 obs. of  5 variables:
 $ Sepal.Length: num  -0.898 -1.139 -1.381 -1.501 -1.018 ...
 $ Sepal.Width : num  1.0156 -0.1315 0.3273 0.0979 1.245 ...
 $ Petal.Length: num  -1.34 -1.34 -1.39 -1.28 -1.34 ...
 $ Petal.Width : num  -1.31 -1.31 -1.31 -1.31 -1.31 ...
 $ Species     : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...
 - attr(*, "scaled:center")= Named num [1:4] 5.84 3.06 3.76 1.2
  ..- attr(*, "names")= chr [1:4] "Sepal.Length" "Sepal.Width" "Petal.Length" "Petal.Width"
 - attr(*, "scaled:scale")= Named num [1:4] 0.828 0.436 1.765 0.762
  ..- attr(*, "names")= chr [1:4] "Sepal.Length" "Sepal.Width" "Petal.Length" "Petal.Width"

The error message will only be used if someone tries to bypass scale() and use scale.data.frame() directly

scale.data.frame(iris$Sepal.Length)
Error in scale.data.frame(iris$Sepal.Length): iris$Sepal.Length must be a data frame

5.5 Refinement four: passing parameters

Pass scale and center options to scale()

scale.data.frame <- function(dfr, ...) {
    if (!is.data.frame(dfr)) {stop(paste(deparse(substitute(dfr)), "must be a data frame"))}
    x <- dfr
    cols <- sapply(dfr, is.numeric)
    scaledvars <- scale.default(dfr[, cols], ...) # otherwise we get a recursive loop
    x[, cols] <- scaledvars
    attr(x, "scaled:center") <- attr(scaledvars, "scaled:center")
    attr(x, "scaled:scale") <- attr(scaledvars, "scaled:scale")
    return(x)
}

z <- scale(iris, scale=FALSE)

str(z)
'data.frame':   150 obs. of  5 variables:
 $ Sepal.Length: num  -0.743 -0.943 -1.143 -1.243 -0.843 ...
 $ Sepal.Width : num  0.4427 -0.0573 0.1427 0.0427 0.5427 ...
 $ Petal.Length: num  -2.36 -2.36 -2.46 -2.26 -2.36 ...
 $ Petal.Width : num  -0.999 -0.999 -0.999 -0.999 -0.999 ...
 $ Species     : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...
 - attr(*, "scaled:center")= Named num [1:4] 5.84 3.06 3.76 1.2
  ..- attr(*, "names")= chr [1:4] "Sepal.Length" "Sepal.Width" "Petal.Length" "Petal.Width"

5.6 Exercises

  1. Plot does not have specific methods for logical or character vectors. For logical vectors, if coerces them to numeric type, then plots, for characters it just gives up. Write two functions that create bar charts for these types of vectors, and make them plot methods.

  2. Write a mean method for data frames.

Last revised: 12/26/2014