Prediction Intervals

In Section 2.1, we showed how to use least squares estimators to predict the lottery sales for a zip code, outside of our sample, having a population of 10,000. Because prediction is such an important task for actuaries, we formalize the procedure so that it can be used on a regular basis.

To predict an additional observation, we assume that the level of explanatory variable is known and is denoted by (x_{ast}). For example, in our previous lottery sales example we used (x_{ast} = 10,000). We also assume that the additional observation follows the same linear regression model as the observations in the sample.

Using our least square estimators, our point prediction is (widehat{y}_{ast} = b_0 + b_1 x_{ast}), the height of the fitted regression line at (x_{ast}) We may decompose the prediction error into two parts:

begin{matrix}
begin{array}{ccccc} underbrace{y_{ast} – widehat{y}_{ast}} & = & underbrace{beta_0 – b_0 + left( beta_1 – b_1 right) x_{ast}} & + & underbrace{varepsilon_{ast}} \ text{prediction error} & {small =} & text{error in estimating the } & {small +} & text{deviation of the additional } \ & & text{regression line at }x_{ast} & & text{response from its mean}
end{array} end{matrix}

It can be shown that the standard error of the prediction is begin{equation*} se(pred) = s sqrt{1+frac{1}{n}+frac{left( x_{ast}-overline{x}right) ^2}{(n-1)s_x^2}}. end{equation*} As with (se(b_1)), the terms (n^{-1}) and (left( x_{ast}-overline{x} right) ^2/left[ (n-1)s_x^2right] ) become close to zero as the sample size (n) becomes large. Thus, for large (n), we have that (se(pred)approx s), reflecting that the error in estimating the regression line at a point becomes negligible and deviation of the additional response from its mean becomes the entire source of uncertainty.

Definition. A (100(1-alpha))% prediction interval at (x_{ast}) is begin{equation}label{E2:predinteval} widehat{y}_{ast} pm t_{n-2,1-alpha /2} ~se(pred) end{equation} where the (t)-value (t_{n-2,1-alpha /2}) is the same as used for hypothesis testing and the confidence interval.

For example, the point prediction at (x_{ast} = 10,000) is (widehat{y}_{ast})= 469.7 + 0.647 (10000) = 6,939.7. The standard error of this prediction is begin{equation*} se(pred) = 3,792 sqrt{1+frac{1}{50} + frac{left( 10,000-9,311right)^2}{(50-1)(11,098)^2}} = 3,829.6. end{equation*} With a (t)-value equal to 2.011, this yields an approximate 95% prediction interval begin{equation*} 6,939.7 pm (2.011)(3,829.6) = 6,939.7 pm 7,701.3 = (-761.6, ~14,641.0). end{equation*} We interpret these results by first pointing out that our best estimate of lottery sales for a zip code with a population of 10,000 is $6,939.70. Our 95% prediction interval represents a range of reliability for this prediction. If we could see many zip codes, each with a population of 10,000, on average we expect about 19 out of 20, or 95%, would have lottery sales between 0 and $14,641. It is customary to truncate the lower bound of the prediction interval to zero if negative values of the response are deemed to be inappropriate.

[WpProQuiz 13]

[raw] [/raw]