Example: Data-Snooping in Stepwise Regression

The idea of this illustration is due to Rencher and Pun (1980). Consider n=100 observations of y and fifty explanatory variables, (x_1, x_2, ldots,x_{50}). The data we consider here were simulated using independent standard normal random variates. Because the variables were simulated independently, we are working under the null hypothesis of no relation between the response and the explanatory variables, that is, H(_0): (beta_1=beta_2=) (ldots) (=beta_{50}=0). Indeed, when the model with all fifty explanatory variables was fit, it turns out that s=1.142, ( R^2=46.2%) and F-ratio = ((Regression~MS) / (Error~MS)) = 0.84. Using an F-distribution with (df_1)=50 and (df_2)=49, the 95th percentile is 1.604. In fact, 0.84 is the 27th percentile of this distribution, indicating that the p-value is 0.73. Thus, as expected, the data are in congruence with (H_0).

Next, a stepwise regression with t-value = 2 was performed. Two variables were retained by this procedure, yielding a model with s=1.05, ( R^2=9.5%) and F-ratio = 5.09. For an F-distribution with (df_1)=2 and (df_2)=97, the 95th percentile is F-value = 3.09. This indicates that the two variables are statistically significant predictors of y. At first glance, this result is surprising. The data were generated so that y is unrelated to the explanatory variables. However, because F-ratio (>) F-value, the F-test indicates that two explanatory variables are significantly related to y. The reason is that stepwise regression has performed many hypothesis tests on the data. For example, in Step 1, fifty tests were performed to find significant variables. Recall that a 5% level means that we expect to make roughly one mistake in 20. Thus, with fifty tests, we expect to find 50(0.05)=2.5 “significant” variables, even under the null hypothesis of no relationship between y and the explanatory variables.

To continue, a stepwise regression with t-value = 1.645 was performed. Six variables were retained by this procedure, yielding a model with s=0.99, (R^2=22.9%) and F-ratio = 4.61. As before, an F-test indicates a significant relationship between the response and these six explanatory variables.

To summarize, using simulation we constructed a data set so that the explanatory variables have no relationship with the response. However, when using stepwise regression to examine the data, we “found” seemingly significant relationships between the response and certain subsets of the explanatory variables. This example illustrates a general caveat in model selection: when explanatory variables are selected using the data, t-ratios and F-ratios will be too large, thus overstating the importance of variables in the model.

Illustrative R Code

Stepwise regression and best regressions are examples of automatic variable selection procedures. In your modeling work, you will find these procedures to be useful because they can quickly search through several candidate models. However, these procedures do ignore nonlinear alternatives as well as the effect of outliers and high leverage points. The main point of the procedures is to mechanize certain routine tasks. This automatic selection approach can be extended and indeed, there are a number of so-called “expert systems” available in the market. For example, algorithms are available that “automatically” handle unusual points such as outliers and high leverage points. A model suggested by automatic variable selection procedures should be subject to the same careful diagnostic checking procedures as a model arrived at by any other means.

[WpProQuiz 25]

[raw] [/raw]