Example: Outliers and High Leverage Points

Consider the fictitious data set of 19 points plus three points, labeled A, B, and C, given in Figure 2.6 and Table 2.5. Think of the first 19 points as “good” observations that represent some type of phenomena. We want to investigate the effect of adding a single aberrant point.

begin{matrix}
begin{array}{c}text{Table 2.5 19 Base Points Plus Three Types of Unusual Observations}
end{array}\scriptsize
begin{array}{ccl}
hline text{Variables} & phantom{XXXXXXXXXXXXX}text{19 Base Points}phantom{XXXXXXXXXXX} & phantom{X}Aphantom{XX} Bphantom{XX} Cphantom{X}
end{array}\scriptsize
begin{array}{c|cccccccccc|ccc} hline phantom{Vari}xphantom{Vari}& 1.5 & 1.7 & 2.0 & 2.2 & 2.5 & 2.5 & 2.7 & 2.9 & 3.0 & 3.5 & 3.4 & 9.5 & 9.5 \ y & 3.0 & 2.5 & 3.5 & 3.0 & 3.1 & 3.6 & 3.2 & 3.9 & 4.0 & 4.0 & 8.0 & 8.0 & 2.5 \ hline x & 3.8 & 4.2 & 4.3 & 4.6 & 4.0 & 5.1 & 5.1 & 5.2 & 5.5 & & & & \ y & 4.2 & 4.1 & 4.8 & 4.2 & 5.1 & 5.1 & 5.1 & 4.8 & 5.3 & & & & \ hline end{array}
end{matrix}

F2Outlier
Figure 2.6 Scatterplot of 19 base plus three unusual points, labeled A, B and C.

R Code for Figure 2.6

To investigate the effect of each type of aberrant point, Table 2.6 summarizes the results of four separate regressions. The first regression is for the nineteen base points. The other three regressions use the nineteen base points plus each type of unusual observation.

begin{matrix}
begin{array}{c}
text{Table 2.6 Results from Four Regressions}
end{array}\scriptsize
begin{array}{l|rrrrr} hline text{Data} & b_0 & b_1 & s & R^2(%) & t(b_1) \ hline text{19 Base Points} & 1.869 & 0.611 & 0.288 & 89.0 & 11.71 \ text{19 Base Points} ~+~ A & 1.750 & 0.693 & 0.846 & 53.7 & 4.57 \ text{19 Base Points} ~+~ B & 1.775 & 0.640 & 0.285 & 94.7 & 18.01 \ text{19 Base Points} ~+~ C & 3.356 & 0.155 & 0.865 & 10.3 & 1.44 \ hline end{array}
end{matrix}
[raw]

See R Code in Action

[/raw]

R Code and Output for Table 2.6

Table 2.6 shows that a regression line provides a good fit for the nineteen base points. The coefficient of determination, (R^2), indicates about 89% of the variability has been explained by the line. The size of the typical error, s, is about 0.29, small compared to the scatter in the y-values. Further, the t-ratio for the slope coefficient is large.

When the outlier point A is added to the nineteen base points, the situation deteriorates dramatically. The (R^2) drops from 89% to 53.7% and s increases from about 0.29 to about 0.85. The fitted regression line itself does not change that much even though our confidence in the estimates has decreased.

An outlier is unusual in the y-value, but "unusual in the y-value" depends on the x-value. To see this, keep the y-value of Point A the same, but increase the x-value and call the point B.

When the point B is added to the nineteen base points, the regression line provides a better fit. Point B is close to being on the line of the regression fit generated by the nineteen base points. Thus, the fitted regression line and the size of the typical error, s, do not change much. However, (R^2) increases from 89% to nearly 95 percent. If we think of ( R^2) as (1-(Error~SS)/(Total~SS)), by adding point B we have increased ( Total~SS), the total squared deviations in the y's, even though leaving ( Error~SS) relatively unchanged. Point B is not an outlier, but it is a high leverage point.

To show how influential this point is, drop the y-value considerably and call this the new point C. When this point is added to the nineteen base points, the situation deteriorates dramatically. The (R^2) coefficient drops from 89% to 10%, and the s more than triples, from 0.29 to 0.87. Further, the regression line coefficients change dramatically.

Most users of regression at first do not believe that one point in twenty can have such a dramatic effect on the regression fit. The fit of a regression line can always be improved by removing an outlier. If the point is a high leverage point and not an outlier, it is not clear whether the fit will be improved when the point is removed.

Simply because you can dramatically improve a regression fit by omitting an observation does not mean you should always do so! The goal of data analysis is to understand the information in the data. Throughout the text, we will encounter many data sets where the unusual points provide some of the most interesting information about the data. The goal of this subsection is to recognize the effects of unusual points; Chapter 5 will provide options for handling unusual points in your analysis.

All quantitative disciplines, such as accounting, economics, linear programming, and so on, practice the art of sensitivity analysis. Sensitivity analysis is a description of the global changes in a system due to a small local change in an element of the system. Examining the effects of individual observations on the regression fit is a type of sensitivity analysis.

[raw] [/raw]