1.4 Sampling and the Role of Normality

A statistic is a summary measure of data, such as a mean, median or percentile. Collections of statistics are very useful for analysts, decision-makers and everyday consumers for understanding massive amounts of data that represent complex situations. To this point, our focus has been on introducing sensible techniques to summarize variables; techniques that will be used repeatedly thought this text. However, the true usefulness of the discipline of statistics is its ability to say something about the unknown, not merely to summarize information already available. To this end, we need to make some fairly formal assumptions about the manner in which the data are observed. As a science, a strong feature of statistics (as a discipline) is the ability to critique these assumptions and offer improved alternatives in specific situations.

It is customary to assume that the data are drawn from a larger population that we are interested in describing. The process of drawing the data is known as the data generating process. We denote this sample as $\{y_1,\ldots,y_n \}$. So that we may critique, and modify, these sampling assumptions, we list them below in detail:


\begin{array}{l} \hline \textit{Basic Sampling Assumptions} \\
\hline 1. ~\mathrm{E~}y_i=\mu \\
2.~ \mathrm{Var~}y_i=\sigma ^{2} \\
3.~ {y_i} \text{ are independent} \\
4.~ {y_i} \text{ are normally distributed.} \\ \hline \end{array}

In this basic set-up, $\mu$ and $\sigma ^{2}$ serve as parameters that describe the location and scale of the parent population. The goal is to infer something sensible about them based on statistics such as $\overline{y}$ and $s_y^{2}$. For the third assumption, we assume independence among the draws. In a sampling scheme, this may be guaranteed by taking a simple random sample from a population. The fourth assumption is not required for many statistical inference procedures because central limit theorems provide approximate normality for many statistics of interest. However, a formal justification of some statistics, such as t-statistics, requires this additional assumption.

Section 1.8 provides an explicit statement of one version of the central limit theorem, giving conditions under which $\overline{y}$ is approximately normally distributed. This section also discusses a related result, known as an Edgeworth approximation, that shows that the quality of the normal approximation is better for symmetric parent populations when compared to skewed distributions.

How does this discussion apply to the study of regression analysis? After all, so far we have focused only on the simple arithmetic average, $\overline{y}$. In subsequent chapters, we will emphasize that linear regression is the study of weighted averages; specifically, many regression coefficients can be expressed as weighted averages with appropriately chosen weights. Central limit and Edgeworth approximation theorems are available for weighted averages – these results will ensure approximate normality of regression coefficients. To use normal curve approximations in a regression context, we will often transform variables to achieve approximate symmetry.

[raw] [/raw]