Method of Least Squares

Now we begin to explore the question, “Can knowledge of population help us understand sales?” To respond to this question, we identify sales as the response, or dependent, variable. The population variable, which is used to help understand sales, is called the explanatory, or independent, variable.

Suppose that we have available the sample data of fifty sales $\{y_1, \ldots, y_{50} \}$ and your job is to predict the sales of a randomly selected ZIP code. Without knowledge of the population variable, a sensible predictor is simply $\overline{y}=6,495$, the average of the available sample. Naturally, you anticipate that areas with larger populations will have larger sales. That is, if you also have knowledge of population, then can this estimate be improved? If so, then by how much?

To answer these questions, the first step assumes an approximate linear relationship between x and y. To fit a line to our data set, we use the method of least squares. We need a general technique so that, if different analysts agree on the data and agree on the fitting technique, then they will agree on the line. If different analysts fit a data set using eyeball approximations, in general they will arrive at different lines, even using the same data set.

The method begins with the line $y=b_0^{\ast}+b_1^{\ast}x$, where the intercept and slope, $b_0^{\ast}$ and $b_1^{\ast}$, are merely generic values. For the $i$th observation, $y_i-\left( b_0^{\ast}+b_1^{\ast} x_i \right)$ represents the deviation of the observed value $y_i$ from the line at $x_i$. The quantity $SS(b_0^{\ast},b_1^{\ast}) = \sum_{i=1}^{n} \left( y_i- \left( b_0^{\ast}+b_1^{\ast}x_i \right) \right) ^{2}$ represents the sum of squared deviations for this candidate line. The least squares method consists of determining the values of $b_0^{\ast}$ and $b_1^{\ast}$ that minimize $SS(b_0^{\ast},b_1^{\ast})$. This is an easy problem that can be solved by calculus, as follows. Taking partial derivatives with respect to each argument yields $\frac{\partial }{\partial b_0^{\ast}} SS(b_0^{\ast},b_1^{\ast})$ $=\sum_{i=1}^{n}(-2)\left( y_i-\left( b_0^{\ast}+b_1^{\ast}x_i \right) \right)$ and $\frac{\partial }{\partial b_1^{\ast}} SS(b_0^{\ast},b_1^{\ast})$ $=\sum_{i=1}^{n}(-2x_i)\left( y_i-\left( b_0^{\ast}+b_1^{\ast}x_i \right) \right)$. The reader is invited to take second partial derivatives to ensure that we are minimizing, not maximizing, this function. Setting these quantities equal to zero and canceling constant terms yields $\sum_{i=1}^{n} \left( y_i- \left( b_0^{\ast}+b_1^{\ast}x_i \right) \right) =0$ and $\sum_{i=1}^{n} x_i \left( y_i-\left( b_0^{ast}+b_1^{ast} x_i \right) \right) =0,$ which are known as the normal equations. Solving these equations yields the values of $b_0^{\ast}$ and $b_1^{\ast}$ that minimize the sum of squares, as follows.


Definition. The least squares intercept and slope estimates are

$b_1=\frac{s_y}{s_x}~~~~~\mathrm{and}~~~~~b_0=\overline{y}-b_1 \overline{x}.$ The line that they determine, $\widehat{y}=b_0 +b_1 x$, is called the fitted regression line.


We have dropped the asterisk, or star, notation because $b_0$ and $b_1$ are no longer “candidate” values.

Does this procedure yield a sensible line for our Wisconsin lottery sales? Earlier, we computed $r=0.886$. From this and the basic summary statistics in Table 2.1, we have $b_1$ = 0.886 (8,103)/11,098=0.647 and $b_0$ = 6,495-(0.647)9,311 = 469.7. This yields the fitted regression line $\widehat{y} = 469.7 + (0.647)x$. The carat, or “hat,” on top of the $y$ reminds us that this $\widehat{y}$, or $\widehat{SALES}$, is a fitted value. One application of the regression line is to estimate sales for a specific population say, x=10,000. The estimate is the height of the regression line, which is 469.7 + (0.647)(10,000) = 6,939.7.

[raw] [/raw]