1.5 Regression and Sampling Designs

In this section, you learn how to:

Explain assumptions underpinning a basic data generating process
Contrast a distribution of a parent population to a statistic’s sampling distribution.
Interpret the quality of the normal distribution approximation of a statistic’s distribution in terms of the symmetry of parent population’s distribution
Identify alternative terms for dependent and explanatory variables
Contrast causality to statistical control
Describe the link between regression and adverse selection in pricing

Video Overview of the Section (Alternative .mp4 Version – 9:18 min)

Approximating normality will be an important issue in practical applications of linear regression. Parts I and II of this book focus on linear regression, where we will learn basic regression concepts and sampling design. Part III will focus on nonlinear regression, involving binary, count and fat-tailed responses, where the normal is not the most helpful reference distribution. Ideas concerning basic concepts and design will also be used in the nonlinear setting.

In regression analysis, we focus on one measurement of interest and call this the dependent variable. Other measurements are used as explanatory variables. A goal is to compare differences in the dependent variable in terms of differences in the explanatory variables. As noted in Section 1.1, regression is used extensively in many scientific fields. Table 1.3 lists alternative terms that you may encounter as you read regression applications.

\begin{matrix}
\begin{array}{c}
\text{Table 1.3. Terminology for Regression Variables}
\end{array}\\
\small
\begin{array}{ll}\hline \text{y-Variable} & \text{x-Variable} \\ \text{Outcome of interest} & \text{Explanatory Variable} \\ \text{Dependent Variable} & \text{Independent Variable} \\
\text{Endogenous Variable} & \text{Exogenous Variable} \\ \text{Response} & \text{Treatment} \\ \text{Regressand} & \text{Regressor} \\
\text{Left-hand side Variable} & \text{Right-hand side Variable} \\ \text{Explained Variable} & \text{Predictor Variable} \\ \text{Output} & \text{Input} \\
\hline \end{array}
\end{matrix}

In the latter part of the nineteenth century and early part of the twentieth century, statistics was beginning to make an important impact on the development of experimental science. Experimental sciences often use designed studies, where the data are under the control of an analyst. Designed studies are performed in laboratory settings, where there are tight physical restrictions on every variable that a researcher thinks may be important. Designed studies also occur in larger field experiments, where the mechanisms for control are different than in laboratory settings. Agriculture and medicine use designed studies. Data from a designed study are said to be experimental data.

To illustrate, a classic example is to consider the yield of a crop such as corn, where each of several parcels of land (the observations) are assigned various levels of fertilizer. The goal is ascertain the effect of fertilizer (the explanatory variable) on the corn yield (the response variable). Although researchers attempt to make parcels of land as much alike as possible, differences inevitably arise. Agricultural researchers use randomization techniques to assign different levels of fertilizer to each parcel of land. In this way, analysts can explain the variation in corn yields in terms of the variation of fertilizer levels. Through the use of randomization techniques, researchers using designed studies can infer that the treatment has a causal effect on the response. Chapter 6 discusses causality further.

For actuarial science and other social sciences, designed studies are the exception rather than the rule. For example, if we want to study the effects of smoking on mortality, it is highly unlikely that we could get study participants to agree to be randomly assigned to smoker/nonsmoker groups for several years just so that we could observe their mortality patterns! As with the Section 1.1 Galton study, social science researchers generally work with observational data. Observational data are not under control of the analyst.

With observational data, we can not infer causal relationships but we can readily introduce measures of association. To illustrate, in the Galton data, it is apparent that “tall” parents are likely to have “tall” children and conversely “short” parents are likely to have “short” children. Section 2.1 will introduce a correlation and other measures of association. However, we can not infer causality from the data. For example, there may be another variable, such as family diet, that is related to both variables. Good diet in the family could be associated with tall heights of parents and adult children, whereas poor diet stifles growth. If this were the case, we would call family diet a confounding variable.

In designed experiments such as the Rand Health Insurance Experiment, we can control for the effects of variables such as health status through random assignment methods. In observational studies, we use statistical control, rather than experimental control. To illustrate, for the Galton data, we might split our observations into two groups, one for “good family diet” and one for “poor family diet,” and examine the relationship between parents’ and children’s height for each subgroup. This is the essence of the regression method, to compare a (y) and an (x), “controlling for” the effects of other explanatory variables.

Of course, to use statistical control and regression methods, one must record family diet, and any other measures of height that may confound the effects of parents’ height on the height of their adult child. The difficulty in designing studies is trying to imagine all of the variables that could possibly affect a response variable, an impossible task in most social science problems of interest. To give some guidance on when “enough is enough,” Section 6.2 will discuss measures of an explanatory variable’s importance and its impact on model selection.

◄ Previous page

Next page ►