Partitioning the Variability

The squared deviations, (left( y_i-overline{y}right) ^2), provide a basis for measuring the spread of the data. If we wish to estimate the (i)th dependent variable without knowledge of x, then (overline{y}) is an appropriate estimate and (y_i- overline{y}) represents the deviation of the estimate. We use (Total~SS=sum_{i=1}^{n}left( y_i-overline{y}right) ^2), the total sum of squares, to represent the variation in all of the responses.

Suppose now that we also have knowledge of x, an explanatory variable. Using the fitted regression line, for each observation we can compute the corresponding fitted value, (widehat{y}_i = b_0 + b_1x_i). The fitted value is our estimate with knowledge of the explanatory variable. As before, the difference between the response and the fitted value, (y_i- widehat{y}_i), represents the deviation of this estimate. We now have two “estimates” of (y_i), these are (widehat{y}_i) and (overline{y}). Presumably, if the regression line is useful, then ( widehat{y}_i) is a more accurate measure than (overline{y}). To judge this usefulness, we algebraically decompose the total deviation as:
begin{equation}label{E2:deviationdecomp} begin{array}{ccccc} underbrace{y_i-overline{y}} & = & underbrace{y_i-widehat{y}_i} & + & underbrace{widehat{y}_i-overline{y}} \ text{total} & {small =} & text{unexplained} & {small +} & text{explained} \ text{deviation} & & text{deviation} & & text{deviation} end{array} end{equation} Interpret this equation as “the deviation without knowledge of x equals the deviation with knowledge of x plus the deviation explained by x.” Figure 2.4 is a geometric display of this decomposition. In the figure, an observation above the line was chosen, yielding a positive deviation from the fitted regression line, to make the graph easier to read. A good exercise is to draw a rough sketch corresponding to Figure 2.4 with an observation below the fitted regression line.

F2ANOVADecomp
Figure 2.4 Geometric display of the deviation decomposition.

R Code for Figure 2.4

Now, from the algebraic decomposition in equation (2.1), square each side of the equation and sum over all observations. After a little algebraic manipulation, this yields begin{equation}label{E2:ANOVADecomposition} sum_{i=1}^{n}left( y_i-overline{y}right) ^2=sum_{i=1}^{n}left( y_i-widehat{y}_iright) ^2+sum_{i=1}^{n}left( widehat{y}_i- overline{y}right) ^2. end{equation} We rewrite this as (Total~SS=Error~SS+Regression~SS) where (SS) stands for sum of squares. We interpret:

  • (Total~SS) as the total variation without knowledge of x,
  • (Error~SS) as the total variation remaining after the introduction of x, and
  • (Regression~SS) as the difference between the (Total~SS) and (Error~SS) , or the total variation “explained” through knowledge of x.

When squaring the right-hand side of equation (2.1), we have the cross-product term (2left( y_i-widehat{y}_iright) left( widehat{y}_i-overline{y}right) ). With the “algebraic manipulation,” one can check that the sum of the cross-products over all observations is zero. This result is not true for all fitted lines but is a special property of the least squares fitted line.

In many instances, the variability decomposition is reported through only a single statistic.


Definition. The coefficient of determination is denoted by the symbol (R^2), called “(R)-square,” and defined as begin{equation*} R^2=frac{Regression~SS}{Total~SS}. end{equation*}

We interpret (R^2) to be the proportion of variability explained by the regression line. In one extreme case where the regression line fits the data perfectly, we have (Error~SS=0) and (R^2=1). In the other extreme case where the regression line provides no information about the response, we have (Regression~SS=0) and (R^2=0.) The coefficient of determination is constrained by the inequalities (0 leq R^2 leq 1) with larger values implying a better fit.

[raw] [/raw]