R For Researchers: Regression (OLS)

April 2015

Overview
Preliminaries
Formula specification
Fitting the model
Variable selection

This article is part of the R for Researchers series. For a list of topics covered by this series, see the Introduction article. If you're new to R we highly recommend reading the articles in order.

Overview

This article will introduce you to some of the commonly used functions for building ordinary least squares (OLS) models. Diagnostic tools for these models will be covered in the Regression Diagnostics article. These two aspects of modelling are done together in practice. They were separated in these articles to provide a focus on the each of these important areas of modelling.

Preliminaries

You will get the most from this article if you follow along with the examples in RStudio. Working the exercise will further enhance your skills with the material. The following steps will prepare your RStudio session to run this article's examples.

Start RStudio and open your RFR project.
Confirm that RFR (the name of your project) is displayed in the upper left corner of the RStudio window.
Open your SalAnalysis script.
Run all the commands in SalAnalysis script.

Formula specification

Regression models are specified as an R formula. The basic form of a formula is

\[response \sim term_1 + \cdots + term_p.\]

The $\sim$ is used to separate the response variable, on the left, from the terms of the model, which are on the right. A term is one of the following

Numeric variable: All numeric variable types result in a single continuous variable.
Logical variable: Results in a single indicator, also known as dummy variable.
Factor variable: Results in a set of indicator variables. The first level of the factor is the reference level and is not coded as an indicator variable. All other levels are encoded as indicator variables.
$term_1:term_2$: Creates a term from the interaction between terms $term_1$ and $term_2$. The encoding to variables is done based on the term type and the preceding terms rules. This interaction construction does not force the terms $term_1$ and $term_2$ as terms into the model.
$term_1*term_2$: This results in the same interaction term as $term_1:term_2$. This interaction form also forces terms $term_1$ and $term_2$ into the model.
$(term_1 + \cdots + term_j)$^$k$: Creates a term for each of the $j$ terms and all interactions up to order $k$ which can be formed from the $j$ terms.
I($expression$): The I() function is used when you need to use +, -, *, or ^ as math symbols to construct a term in the formula. This is commonly used to construct a quadratic term.
poly(x, degree=k): a term which is a jth order polynomial of the variable x. A poly term, like a factor, is a single term which translates into multiple variables in a model. Poly terms have some nice properties which result in several advantages when doing variable selection. These same nice properties make it more difficult to interpret the model after variable selection. Therefore it is common practice to use poly for model selection and then rerun the selected model using polynomial terms constructed from the variable after the variable selection process is complete. Poly() constructs k model variables which are orthogonal to each other. The model variable $c_j$ is the jth order term of the term. That is $c_j = (c_1)^j$, where $c_1$ is the linear contrast. This article does not make direct use of the orthogonality property of the $c_j$ model variables. You do not need to understand the implications of the orthogonal contrast since they will be removed from the model prior to interpreting the model. Here you only need to understand that poly() binds several model variables into a single term and that this will help us do variable selection.
-1: Removes the intercept from the model.

The following variables will be used in several examples.

A: factor with 3 levels "Level1", "Level2", and "Level3"
B: logical
C: numeric
D: numeric
E: poly(C, degree=2)
F: poly(D, degree=2)

Each of these examples shows the model variables which result from a formula.

A+E+D results in the following model variables
- intercept
- ALevel2: indicator variable
- ALevel3: indicator variable
- poly(C, degree=2)1: numeric variable
- poly(C, degree=2)2: numeric variable
- D: numeric variable
B*E results in the following model variables
- intercept
- BTRUE: indicator variable
- E1: numeric variable
- E2: numeric variable
- BTRUE:E1: numeric variable
- BTRUE:E2: numeric variable
A+B+C+A:B results in the following model variables
- intercept
- ALevel2: indicator variable
- ALevel3: indicator variable
- BTRUE: indicator variable
- C: numeric variable
- ALevel2:BTRUE: indicator variable
- ALevel3:BTRUE: indicator variable Note, (A+B)$^2$+C results in the same model variables. This is due to A:B being the second order interaction term of the variables A and B.
B*(C+D) results in the following model variables
- intercept
- BTRUE: indicator variable
- C: numeric variable
- D: numeric variable
- BTRUE:C: numeric variable
- BTRUE:D: numeric variable

Coding a logical variable as numeric with a value of 1 for TRUE and 0 for FALSE will result in the variable name being reported as the coefficient instead of its name with the word TRUE appended.

Exercise

Write formulas using the variables from above to produce models with the following variables

The variables A through C and the interactions between variable A and the other variables
The quadratic form of the variable D and the variable B

Fitting the model

The lm() function fits a model using Ordinary Least Squares (OLS.)

Syntax and use of the lm() function

lm(formula, weights=w, x=logical, data=dataFrame)

Returns a model object. This is a list of objects which result from fitting the model.

The formula parameter is of the form described above.

data is an optional parameter. dataFrame specifies the data.frame which contains the variables to be fit. R will look in the current environment for variables which are not found in the data.frame.

weights is an optional parameter. When present, a weighted fit is done using the w vector as the weights.

x is an optional parameter. When logical is TRUE, the x matrix (which is also know as the design matrix) is included the returned model object. The x matrix columns are the model variables generated by R from the formula. This can be useful when you need to see what variables R created from the formula terms.

We will use lm() to fit salary to the sex variable.

Enter the following command in your script and run it.
```
out <- lm(salary~sex,data=salary)
```
There are no console results from this command.

The summary() function provides a nice summary of a model object. You could also use the str() function to see the details of what is included in the model object.

Enter the following command in your script and run it.
```
summary(out)
```
The results of the above command are shown below.
```
Call:
lm(formula = salary ~ sex, data = salary)

Residuals:
    Min      1Q  Median      3Q     Max 
-57.290 -23.502  -6.828  19.710 116.455 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  101.002      4.809  21.001  < 2e-16 ***
sexMale       14.088      5.065   2.782  0.00567 ** 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 30.03 on 395 degrees of freedom
Multiple R-squared:  0.01921,   Adjusted R-squared:  0.01673 
F-statistic: 7.738 on 1 and 395 DF,  p-value: 0.005667
```
The summary display starts with the call to lm which generated the model object.

The residual summary is the five number summary for the residuals. This can be used for as a quick check for skewed residuals.

The coefficients summary shows the value, standard error, and p-value for each coefficient. The p-values are from Wald tests of each coefficient being equal to zero. For OLS models this is equivalent to an F-test of nested models with the variable of interest being removed in the nested model.

The display ends with summary information on the model. This is the residual standard error, R squared of the model, and the F-test of the significance of the model verse the null model.

From the R output above, we see that R used female as the reference level for sex in the model. The intercept is for a female professor. The reference level is usually not an issue when there are only two levels. For rank there are three levels. When rank is included in a model, we will want AsstProf as the reference level. When we ordered the level of rank in the Data exploration article the AsstProf level was set as the reference level (listed first.) The model fit does not change with a change of the level used as the reference.

Variable selection

There are a number of methods for selecting an optimal model. These methods examine which variables to include and the form of the model. We will use the backward selection method without any intent to imply it is any better or worse than other methods. The backward selection method allows for the demonstration of commonly used functions to select a model. Our demonstration of this approach is also incomplete in that it only looks at a single possible optimal model. One would typically consider a number of possible optimal models before selecting an optimal model.

R has functions and parameters to support a number of criteria for selecting variables. Several common criteria are adjusted R squared, BIC, AIC, and the significance of model terms. Adjusted R squared is returned in the summary of the model object. The AIC() and BIC() functions are used to get these criteria values for a model.

Syntax for the AIC() and BIC() functions.

AIC(modelObj)
BIC(modelObj)

The criteria we will use is a test of the significance of a variable. For OLS this significance is determined with an F-test of the nested models. This is a test of the coefficients being equal to zero. The variables are retained if the coefficients are not likely to be zero.

A full model is needed to start the backwards selection. Our full model will fit salary to discipline, sex, rank, the quadratic form of yrs.service, quadratic form of yrs.since.phd, and the interactions between the discipline, sex, and rank variables and the yrs.service and yrs.since.phd variables. We do not include and interaction between the yrs.service and yrs.since.phd terms because they are correlated.

Recall the correlation between yrSer and yrSin.
```
cor(data.frame(salary$yrSer,salary$yrSin))
```

             salary.yrSer salary.yrSin
salary.yrSer    1.0000000    0.9096491
salary.yrSin    0.9096491    1.0000000

We will calculate the quadratic terms using poly() for use in variable selection. Calculating the poly variable before the call to lm(), produces easier to read results, simpler coefficient names are reported.

Enter the following commands in your script and run them.

yrSin2 <- poly(salary$yrSin, degree=2)
yrSer2 <- poly(salary$yrSer, degree=2)
out2 <- lm(salary~(dscpl+sex+rank)*yrSin2 +
                  (dscpl+sex+rank)*yrSer2,
           data=salary)

There are no console results for these commands.

We will begin the backwards selection by addressing the known collinearity issues by determining if yrs.service or yrs.since.phd is more important to the fit, by our criterion. We start the backwards selection process with this decision to remove the known collinearity issues in the data. We will build models with each of these variables removed. The anova() function can be used to do an F-test of the nested models.

Syntax and use of the anova() function

anova(modelObj,nestedModObj)

Returns the test results between two nested models.

The anova() function defaults to the F-test for lm model objects. The returned information for the F-test is the difference in the sum of squares between the models, the F-statistic for this difference, and the p-value for the F-statistic.

Enter the following commands in your script and run them.

outSer <- lm(salary~(dscpl+sex+rank)*yrSer2,
             data=salary)
outSin <- lm(salary~(dscpl+sex+rank)*yrSin2,
             data=salary)
anova(out2,outSer)
anova(out2,outSin)

The results of the above commands are shown below.

Analysis of Variance Table

Model 1: salary ~ (dscpl + sex + rank) * yrSin2 + (dscpl + sex + rank) * 
    yrSer2
Model 2: salary ~ (dscpl + sex + rank) * yrSer2
  Res.Df    RSS  Df Sum of Sq      F    Pr(>F)    
1    372 177823                                   
2    382 192779 -10    -14956 3.1287 0.0007476 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Analysis of Variance Table

Model 1: salary ~ (dscpl + sex + rank) * yrSin2 + (dscpl + sex + rank) * 
    yrSer2
Model 2: salary ~ (dscpl + sex + rank) * yrSin2
  Res.Df    RSS  Df Sum of Sq      F   Pr(>F)   
1    372 177823                                 
2    382 191253 -10    -13430 2.8096 0.002272 **
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Dropping yrs.service (retaining yrs.since.phd) causes a smaller decrease in the residual sum of squares than dropping yrs.since.phd (retaining yrs.service) on the same degrees of freedom. That is we lose the least amount of fit dropping yrs.service. We will retain yrs.since.phd in the model and remove yrs.service from the model.

We will use the step() function to investigate if there are other variables which can be removed. The step function compares the current model with all nested models which have a single term removed. If none of the models with a single term removed are better than the current model, step is done. Otherwise step picks the best model from the one-term-dropped models and repeats the process until no further improvement in the model can be made by dropping a term.

syntax and use of the step() function.

step(modelObj, scope=maxFormula, test="criteria")

Returns an object containing information on the steps taken. Printing this object provides a display of the steps to the console.

The test parameter is optional, the default criteria is "AIC". It can also take the values "F" and "LRT".

The scope parameter is used in forward selection. Forward selection typically starts with null model as the modelObj and considers models bounded by the maxFormula, the largest model which is to be considered.

We will use step() with our model using years since phd.

Enter the following commands in your script and run them.
```
step(outSin, test="F")
```

The results of the above command are shown below.

Start:  AIC=2482.43
salary ~ (dscpl + sex + rank) * yrSin2

               Df Sum of Sq    RSS    AIC F value Pr(>F)
- sex:yrSin2    2     576.6 191830 2479.6  0.5759 0.5627
- dscpl:yrSin2  2     986.0 192239 2480.5  0.9847 0.3745
- rank:yrSin2   4    3239.1 194492 2481.1  1.6174 0.1691
<none>                      191253 2482.4               

Step:  AIC=2479.63
salary ~ dscpl + sex + rank + yrSin2 + dscpl:yrSin2 + rank:yrSin2

               Df Sum of Sq    RSS    AIC F value Pr(>F)
- dscpl:yrSin2  2     885.4 192715 2477.5  0.8862 0.4131
- sex           1     527.5 192357 2478.7  1.0559 0.3048
- rank:yrSin2   4    3795.7 195626 2479.4  1.8995 0.1098
<none>                      191830 2479.6               

Step:  AIC=2477.46
salary ~ dscpl + sex + rank + yrSin2 + rank:yrSin2

              Df Sum of Sq    RSS    AIC F value    Pr(>F)    
- sex          1     632.7 193348 2476.8  1.2673   0.26097    
<none>                     192715 2477.5                      
- rank:yrSin2  4    4602.9 197318 2478.8  2.3048   0.05783 .  
- dscpl        1   18675.7 211391 2512.2 37.4066 2.352e-09 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Step:  AIC=2476.76
salary ~ dscpl + rank + yrSin2 + rank:yrSin2

              Df Sum of Sq    RSS    AIC F value  Pr(>F)    
<none>                     193348 2476.8                    
- rank:yrSin2  4    4899.8 198248 2478.7  2.4518 0.04561 *  
- dscpl        1   18914.6 212263 2511.8 37.8590 1.9e-09 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1


Call:
lm(formula = salary ~ dscpl + rank + yrSin2 + rank:yrSin2, data = salary)

Coefficients:
          (Intercept)                 dscplB          rankAssocProf  
                27.61                  14.26                  55.96  
             rankProf                yrSin21                yrSin22  
                85.44                -956.51                -345.36  
rankAssocProf:yrSin21       rankProf:yrSin21  rankAssocProf:yrSin22  
               888.07                1133.18                 353.31  
     rankProf:yrSin22  
               185.13

The final model from step is shown at the end of the step display. This is the best fit model using the the "F" test as the criteria.

The final model does not include sex. The data set was collected to examine the differences in salaries between genders. We want to retain sex in our model to be able to answer this question. We will use the smallest model which includes sex from the step function. From the step results we see that the model which included dscpl, sex, rank, yrSin2, and rank:yrSin2 is the best model from step which included the variable for sex. The step result for this model show that the p-value for the test of rank:yrSin2 is .05783. We will drop the rank:yrSin2 term and retain sex in our model.

We will fit this model to continue our backward selection process.

Enter the following command in your script and run it.

outSin2 <- lm(salary~dscpl + sex + rank + yrSin2,data=salary)

There are no console results from this command.

We need to test if there are additional variables which need to be removed from the model, while retaining the sex variable. We will use the drop1 function. The drop1, like the add1, function does one step of what the step() function does.

Syntax of the drop1() and add1() function

drop1(modelObj, test="criteria")
add1(modelObj, scope=maxFormula, test="criteria")

The returned object and parameters are as in the step() function.
Enter the following command in your script and run it.
```
drop1(outSin2, test="F")
```

The results of the above command are shown below.

Single term deletions

Model:
salary ~ dscpl + sex + rank + yrSin2
       Df Sum of Sq    RSS    AIC F value    Pr(>F)    
<none>              197318 2478.8                      
dscpl   1   18774.4 216093 2512.9 37.1077 2.682e-09 ***
sex     1     929.6 198248 2478.7  1.8374   0.17604    
rank    2   28412.8 225731 2528.2 28.0790 4.049e-12 ***
yrSin2  2    3626.6 200945 2482.1  3.5840   0.02868 *  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

There are no additional terms we can drop from the model by our F-test criteria.

The use of the poly terms was useful for the variable selection process. We will now rerun the model using individual variables for the linear and quadratic terms of yrSin. These individual variables will be on their original scale, and will be easier to interpret. These individual variables will also be easier to work with for hypothesis testing and predicting.

Enter the following commands in your script and run them.

salary$yrSinSqr <- salary$yrSin^2  # this goes in data prep 
outSin3 <- lm(salary~dscpl + sex + rank + yrSin+yrSinSqr,data=salary)
summary(outSin3)

The results of the above commands are shown below.


Call:
lm(formula = salary ~ dscpl + sex + rank + yrSin + yrSinSqr, 
    data = salary)

Residuals:
    Min      1Q  Median      3Q     Max 
-62.956 -13.315  -1.405   9.831  96.306 

Coefficients:
               Estimate Std. Error t value Pr(>|t|)    
(Intercept)   60.378762   5.329187  11.330  < 2e-16 ***
dscplB        14.199904   2.331061   6.092 2.68e-09 ***
sexMale        5.233598   3.860948   1.356  0.17604    
rankAssocProf  5.551081   5.033353   1.103  0.27077    
rankProf      34.100878   6.184019   5.514 6.38e-08 ***
yrSin          1.512625   0.565503   2.675  0.00779 ** 
yrSinSqr      -0.025037   0.009508  -2.633  0.00879 ** 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 22.49 on 390 degrees of freedom
Multiple R-squared:  0.4569,    Adjusted R-squared:  0.4485 
F-statistic: 54.68 on 6 and 390 DF,  p-value: < 2.2e-16

The yrSinSqr can now be tested for significance. The results of an F-test are equivalent to the Wald based t-test of the coefficient. Using the drop1() function provides the same p-value as in the summary() function above

Enter the following commands in your script and run them.
```
drop1(outSin3, test="F")
```

The results of the above command are shown below.

Single term deletions

Model:
salary ~ dscpl + sex + rank + yrSin + yrSinSqr
         Df Sum of Sq    RSS    AIC F value    Pr(>F)    
<none>                197318 2478.8                      
dscpl     1   18774.4 216093 2512.9 37.1077 2.682e-09 ***
sex       1     929.6 198248 2478.7  1.8374  0.176036    
rank      2   28412.8 225731 2528.2 28.0790 4.049e-12 ***
yrSin     1    3619.9 200938 2484.1  7.1547  0.007791 ** 
yrSinSqr  1    3508.1 200826 2483.8  6.9337  0.008795 ** 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

This is the best model from our backwards selection process.

The sex term was significant when considered by itself. This is seen in the p-value of .00567 for the sex coefficient in the "out" model. When accounting for the variation associated with discipline, rank, and years of service, sex is not significant as is seen by the p-value of .17604 in the "outSin3" model. This is a reminder that a variable being significant is with respect to the total set of terms in a model.

This process of variable selection was designed for instruction on the functions used for model selection. Models should not be selected independent of diagnostics. Model diagnostic tools are covered in the Regression Diagnostics article. Diagnostics should be run in parallel to the steps of model selections. We will see in the Diagnostic article that our selected model violates assumptions for OLS models.

Commit your changes to SalAnalysis.

Exercise

These exercises use the alfalfa dataset and the work you started on the alfAnalysis script. Open the script and run all the commands in the script to prepare your session for these problems.

Note, we will use the shade and irrig variables as continuous variables for these exercises. They could also be considered as factor variables. Since both represent increasing levels we first try to use them as scale.

Set the the reference level of the inoc variable to cntrl.
Create a quadratic poly term for the shade variable.
Regress yield on the irrig, inoc, the quadratic shade term, and all their interactions.
Use the backward selection method to reduce the model. Use the significance of the term as the criterion, as was done in the lesson.
Commit your changes to AlfAnalysis.

Solutions

Next: Regression Diagnostics

Previous: Data presentation

Last Revised: 8/27/2015