2.6 Building a Better Model: Residual Analysis

In this section, you learn how to:
  • Describe how diagnostic checking and residual analysis are used in a statistical analysis
  • Describe several model misspecifications commonly encountered in a regression analysis

Video Overview of the Section (Alternative .mp4 Version – 12:29 min)

Quantitative disciplines calibrate models with data. Statistics takes this one step further, using discrepancies between the assumptions and the data to improve model specification. We will examine the Section 2.2 modeling assumptions in light of the data and use any mismatch to specify a better model; this process is known as diagnostic checking (like when you go to a doctor and he or she performs diagnostic routines to check your health).

We will begin with the Section 2.2 error representation. Under this set of assumptions, the deviations {(varepsilon _i)} are identically and independently distributed (i.i.d), and under assumption F5, normally distributed. To assess the validity of these assumptions, one uses (observed) residuals {(e_i)} as approximations for the (unobserved) deviations {(varepsilon _i)}. The basic theme is that if the residuals are related to a variable or display any other recognizable pattern, then we should be able to take advantage of this information and improve our model specification. The residuals should contain little or no information and represent only natural variation from the sampling that cannot be attributed to any specific source. Residual analysis is the exercise of checking the residuals for patterns.

There are five types of model discrepancies that analysts commonly look for. If detected, the discrepancies can be corrected with the appropriate adjustments in the model specification.


Model Misspecification Issues
  1. Lack of Independence. There may exist relationships among the deviations {(varepsilon _i)} so that they are not independent.
  2. Heteroscedasticity. Assumption E3 that indicates that all observations have a common (although unknown) variability, known as homoscedasticity. Heteroscedascity is the term used when the variability varies by observation.
  3. Relationships between Model Deviations and Explanatory Variables. If an explanatory variable has the ability to help explain the deviation (varepsilon ), then one should be able to use this information to better predict (y).
  4. Nonnormal Distributions. If the distribution of the deviation represents a serious departure from normality, then the usual inference procedures are no longer valid.
  5. Unusual Points. Individual observations may have a large effect on the regression model fit, meaning that the results may be sensitive to the impact of a single observation.

This list will serve the reader throughout your study of regression analysis. Of course, with only an introduction to basic models we have not yet seen alternative models that might be used when we encounter these model discrepancies. In this book’s Part II on time series models, we will study lack of independence among data ordered over time. Section 5.7 will consider heteroscedasticity in further detail. The introduction to multiple linear regression in Chapter 3 will be our first look at handling relationships between {(varepsilon _i)} and additional explanatory variables. We have, however, already had an introduction to the effect of normal distributions, seeing that a qq plot can detect non-normality and that transformations can help induce approximate normality. In this section, we discuss the effects of unusual points.

Much of residual analysis is done by examining a standardized residual, a residual divided by its standard error. An approximate standard error of the residual is s; in Chapter 3 we will give a precise mathematical definition. There are two reasons why we often examine standardized residuals in lieu of basic residuals. First, if responses are normally distributed, then standardized residuals are approximately realizations from a standard normal distribution. This provides a reference distribution to compare values of standardized residuals. For example, if a standardized residual exceeds two in absolute value, this is considered unusually large and the observation is called an outlier. Second, because standardized residuals are dimensionless, we get carryover of experience from one data set to another. This is true regardless of whether or not the normal reference distribution is applicable.

[raw] [/raw]