Spring 2015 Midterm

A chief executive officer (CEO) is a leader of a firm or organization. The CEO leads by developing and implementing a strategic policy for the firm. The CEO is in charge of a management team that is responsible for the daily firm operations, financial strength and corporate social responsibilities.

The CEO also leads the firm in compensation. Generally, a CEO is the most highly paid person in a firm; CEO salaries are at the top of the pyramid. Although some industries have employees whose salaries exceed the CEO’s, for example sales agents, the broad rule is that CEO salaries form an effective upper bound for employee compensation. Thus, although very few managers ever become chief executive officers, there is a great deal of interest in CEO salaries. CEO compensation indirectly influences salaries for a large portion of the firm workforce.

CEO salaries in the United States are of interest because of their relationship to salaries in international firms and to salaries of people that do not belong to Corporate America. Top managers in the United States have come under a great deal of criticism for being so highly paid compared to their international counterparts. Yet, compensation of CEOs may not be out of line compared to top professionals in other fields. For example, Linden and Machan (1992, “Put Them at Risk!” Forbes Magazine, p. 158) compares CEO salaries with professionals such as actors, models, surgeons, sports personalities and so on, and finds the compensation comparable.

Measuring annual compensation for a CEO is fraught with difficulties. Compensation clearly includes salary plus bonuses, that is, cash payments that may or may not be performance related. Other compensation is more difficult to measure and may include restricted stock awards and contributions to retirement, health insurance, and other employee benefit plans. Remuneration may also come in the form of stock gains based on the CEO’s stock ownership or exercise of stock options, although we did not consider this source of income.

The data for this study were drawn from the May 25, 1992 issue of Forbes Magazine entitled “What 800 Companies Paid for their Bosses.” This article provides several measures of CEO compensation, as well as characteristics of the CEO and measures of his firm’s performance. We say “his” because of the 800 CEOs studied in this article, only one was a woman. The goal of this report is to study CEO and firm characteristics to determine the important factors influencing CEO compensation.

To understand the determinants of CEO compensation, one hundred observations were randomly selected from the 800 listed in the Forbes article. Although the Forbes article did not cite the basis for a firm to be included in its survey, the 800 companies seem to represent the largest publicly traded companies in the United States. Our sample of one hundred CEOs and their firms represent a cross-sectional sample of America’s largest corporations. In our cross-section, the CEO and firm characteristics were based on 1991 measures.

Table 1 provides variable definitions.
$$
{scriptsize
begin{matrix}{large text{Table 1. Variable Definitions} }\
begin{array}{ll} hline
Variable & Definitions \
hline
text{COMP} & text{Sum of salary, bonus and other 1991 compensation, in thousands of dollars.} \
& ~~~text{Other compensation does not include stock gains}. \
text{AGE} & text{CEOs age, in years} \
text{SALES} & text{1991 sales revenues, in millions of dollars} \
text{TENURE} & text{Number of years employed by the firm} \
text{EXPER} & text{Number of years as the firm CEO} \
text{VAL} & text{Market value of the CEOs stock, in thousands of dollars} \
text{PCTOWN} & text{Percentage of firm’s market value owned by the CEO } \
text{PROF} & text{1991 profits of the firm, before taxes, in millions of dollars} \
text{EDUCATN} & text{Education level.} \
& 0 text{ indicates that the CEO does not have an undergraduate degree} \
& 1 text{ indicates that the CEO has only an undergraduate degree} \
& 2 text{ indicates that the CEO has a graduate degree} \
text{BACKGRD} & text{Categorical variable to professional background of the CEO} \ hline \
end{array}
end{matrix}
}
$$

Part I. Preliminary Summarization.

1. From a preliminary examination of the data, the 51st observation, had an unusually low compensation. This was Craig McCaw, CEO of McCaw Cellular, who reported a salary of $155,000 in 1991. This was despite a five-year total reported salary of over fifty-three million dollars. As founder of McCaw Cellular, Mr. McCaw received a substantial amount of remuneration outside of figures reported in 1991. Omit him from the sample.
Solution

R-Code
CEO100 <- read.csv(choose.files(),header=TRUE)
#fix(CEO100)
#summary(CEO100)
#I.1  REMOVE MCCAW
CEO <- subset(CEO100,COMPANY!="mccaw")
attach(CEO)

2. Create the variables LOGCOMP, the natural logarithm of COMP, LOGSALES, the natural logarithm of SALES and LOGVAL, the natural logarithm of VAL.

2a. Create histograms of COMP and LOGCOMP; compare the two distributions, commenting in particular on the effect that the logarithmic transformation has on the symmetry.
2b. Do this also for SALES and VAL.

Solution

2a. The COMPensation distribution is skewed to the right and fat-tailed. 
The logarithmic transformation serves to symmetrize the distribution 
and ``pull in'' those individuals with large levels of compensation.
2b. The same is true for SALES and VAL; 
both distributions are right-skewed and fat-tailed. 
The VAL distribution is very right-skewed.

R-Code
LOGCOMP  <- log(COMP)
LOGSALES <- log(SALES)
LOGVAL   <- log(VAL)
par(mfrow=c(1, 2))
hist(COMP,main="");hist(LOGCOMP,nclass=10,main="")
hist(SALES,main="");hist(LOGSALES,nclass=10,main="")
hist(VAL,main="");hist(LOGVAL,nclass=10,main="")

R-Code Output

Histograms of Compensation and Logarithmic Compensation

Histograms of Sales and Logarithmic Sales

Histograms of Market Value and Logarithmic Market Value

3. Compute summary statistics of the continuous variables COMP, LOGCOMP, AGE, SALES, LOGSALES, TENURE, EXPER, VAL, LOGVAL, PCTOWN, and PROF. Identify the median value of each variable.
Solution

The median values are given in the table below.

R-Code
#Summary Stats of Firm and CEO Variables
Xymat <- data.frame(cbind(COMP,LOGCOMP,SALES,LOGSALES,TENURE,EXPER,VAL,PROF))
XymatA <- Xymat
Mean    <- sapply(XymatA, mean,  na.rm=TRUE)
S.d.    <- sapply(XymatA, sd,    na.rm=TRUE)
Minimum <- sapply(XymatA, min,   na.rm=TRUE)
Maximum <- sapply(XymatA, max,   na.rm=TRUE)
Median  <- sapply(XymatA, median,na.rm=TRUE)
summvar <- cbind(Mean, Median, S.d., Minimum, Maximum)
round(summvar,digits=3)

R-Code Output
             Mean   Median     S.d.   Minimum   Maximum
COMP     1131.434  809.000  851.426   307.000  4657.000
LOGCOMP     6.826    6.696    0.614     5.727     8.446
SALES    4110.515 2344.000 4721.951   228.000 21351.000
LOGSALES    7.809    7.760    1.003     5.429     9.969
TENURE     23.768   27.000   12.491     1.000    46.000
EXPER       8.929    6.000    8.308     0.500    35.000
VAL        44.039    3.600  183.639     0.100  1689.000
PROF      142.192   82.000  340.631 -1086.000  1618.000

Part II. Basic Linear Regression.

1. Plot SALES versus COMP and then LOGSALES versus LOGCOMP. Discuss the difficulties in modeling the relationship between SALES versus COMP that are not apparent in a relationship between LOGSALES and LOGCOMP.
Solution

From the plot of SALES versus COMP, we see many CEO's in the 
lower left-hand corner of the plot a few with large SALES and large COMP. 
It is difficult to discern an overall pattern. 
From the plot of LOGSALES versus LOGCOMP, the relationship is clearer. 
As LOGSALES increases, so does LOGCOMP. 
There is still a lot of variability in the plot but patterns are more clear.

R-Code
par(mfrow=c(1, 2))
plot(SALES,COMP)
plot(LOGSALES,LOGCOMP)

R-Code Output
Plots of Compensation vs Sales and Log Compensation vs Log Sales

2. Compute correlations among the continuous variables COMP, LOGCOMP, AGE, SALES, LOGSALES, TENURE, EXPER, VAL, LOGVAL, PCTOWN, and PROF. Identify the variable (excluding LOGCOMP) that seems to have the strongest relationship with COMP. Also, identify the variable (excluding COMP) that seems to have the strongest relationship with LOGCOMP.
Solution

The variable LOGSALES has the highest correlation with 
  COMP and LOGCOMP.

R-Code
round(cor(XymatA),digits=3)

R-Code Output
          COMP LOGCOMP  SALES LOGSALES TENURE  EXPER    VAL  PROF
COMP     1.000   0.930  0.372    0.433  0.223  0.232  0.052 0.365
LOGCOMP  0.930   1.000  0.399    0.496  0.236  0.216  0.025 0.331
SALES    0.372   0.399  1.000    0.881  0.288 -0.071 -0.001 0.393
LOGSALES 0.433   0.496  0.881    1.000  0.349 -0.062  0.060 0.346
TENURE   0.223   0.236  0.288    0.349  1.000  0.390  0.064 0.287
EXPER    0.232   0.216 -0.071   -0.062  0.390  1.000  0.295 0.082
VAL      0.052   0.025 -0.001    0.060  0.064  0.295  1.000 0.124
PROF     0.365   0.331  0.393    0.346  0.287  0.082  0.124 1.000

3. Fit a basic linear model, using LOGCOMP as the outcome of interest and LOGSALES as the explanatory variable.

3a. Interpret the coefficient associated with LOGSALES as an elasticity.
3b. Provide 90% and 99% confidence intervals for your answer in 3a.

Solution

3a. For every percentage increase in SALES, 
we expect COMPENSATION to increase by 0.303 percent.
3b. A 90% confidence interval is (0.214, 0.393).
A 99% confidence interval is (0.162, 0.445).

R-Code
modelBLR <- lm(LOGCOMP ~ LOGSALES)
summary(modelBLR)
confint(modelBLR, level=.90)
confint(modelBLR, level=.99)

R-Code Output
> summary(modelBLR)
Coefficients:
            Estimate Std. Error t value Pr(>|t|)
(Intercept)  4.45625    0.42494  10.487  < 2e-16 ***
LOGSALES     0.30344    0.05398   5.622 1.82e-07 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.5359 on 97 degrees of freedom
Multiple R-squared: 0.2457,     Adjusted R-squared: 0.238
F-statistic:  31.6 on 1 and 97 DF,  p-value: 1.817e-07
> confint(modelBLR, level=.90)
                  5 %      95 %
(Intercept) 3.7505515 5.1619585
LOGSALES    0.2138002 0.3930821
> confint(modelBLR, level=.99)
                0.5 %    99.5 %
(Intercept) 3.3397397 5.5727702
LOGSALES    0.1616175 0.4452648

Part III. Multiple Linear Regression - I.

1. Create a binary variable, PERCENT5, that indicates whether the CEO owns more than five percent of the firm's stock. Create another binary variable, GRAD, that indicates EDUCATN=2.
Solution

R-Code
GRAD  <-  1*(EDUCATN==2)
PERCENT5 <-  1*(PCTOWN >5)

2. Run a regression model using LOGCOMP as the outcome of interest and four explanatory variables, LOGSALES, GRAD, PERCENT5, and EXPER.

2a. Interpret the sign of the coefficient associated with GRAD. Comment also on the statistical significance of this variable.
2b. For this model fit, is EXPER as statistical significant variable? To response to this question, use a formal test of hypothesis. State your null and alternative hypotheses, decision-making criterion, and decision-making rule. Use a 10% significance level.

Solution

2a. The sign of the coefficient associated with GRAD indicates, 
other things being equal, that a CEO with a graduate degree has a 
lower level of compensation than his or her peers 
(with no or a only bachelor's degree). 
The t-ratio indicates that is statistically significant, at the 1.7% level for 
a two-sided test and at the 0.85% level for a one-sided test.

2b. Test (H_0: beta_{EXPER} = 0) versus (H_a: beta_{EXPER} neq 0)
at the 10% level of significance using a t-statistic.
The degrees of freedom is df =94. The corresponding t-value is
1.661, using 10% significant level.
begin{equation*}
t-mathrm{ratio}=frac{mathrm{estimator-hypothesized~value~of~parameter}}
{mathrm{standard~error~of~the~estimator}}=3.192
end{equation*} Because (3.192 leq 1.661), we reject (H_0) in favor of the alternative. 
That is, EXPER is statistically significant.

R-Code
model1 <- lm(LOGCOMP ~ LOGSALES+GRAD+PERCENT5+EXPER)
summary(model1)

R-Code Output
> summary(model1)

Call:
lm(formula = LOGCOMP ~ LOGSALES + GRAD + PERCENT5 + EXPER)

Coefficients:
             Estimate Std. Error t value Pr(>|t|)
(Intercept)  4.746812   0.409735  11.585  < 2e-16 ***
LOGSALES     0.279626   0.048584   5.756 1.08e-07 ***
GRAD        -0.353974   0.104126  -3.399 0.000992 ***
PERCENT5    -0.641284   0.175111  -3.662 0.000413 ***
EXPER        0.019242   0.006028   3.192 0.001921 **
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.473 on 94 degrees of freedom
Multiple R-squared: 0.4305,     Adjusted R-squared: 0.4062
F-statistic: 17.76 on 4 and 94 DF,  p-value: 6.858e-11

Part III. Multiple Linear Regression - I.

We run a regression model using LOGCOMP as the outcome of interest and four explanatory variables, LOGSALES, GRAD, PERCENT5, and EXPER. Correlations and the fitted regression model appear below.

III.3

a. Determine the partial correlation coefficient between EXPER and LOGCOMP, controlling for other explanatory variables.
b. Compare the usual correlation coefficient between EXPER and LOGCOMP to the partial correlation calculated in part a. Contrast the different appearances that these coefficients provide and describe why differences may arise for this data set.

Table. Correlation Coefficients
> round(cor(cbind(LOGCOMP,LOGSALES,GRAD,PERCENT5,EXPER,LOGVAL)),digits=3)
         LOGCOMP LOGSALES   GRAD PERCENT5  EXPER LOGVAL
LOGCOMP    1.000    0.496 -0.331   -0.181  0.216  0.366
LOGSALES   0.496    1.000 -0.159   -0.034 -0.062  0.114
GRAD      -0.331   -0.159  1.000   -0.256 -0.207 -0.402
PERCENT5  -0.181   -0.034 -0.256    1.000  0.247  0.530
EXPER      0.216   -0.062 -0.207    0.247  1.000  0.535
LOGVAL     0.366    0.114 -0.402    0.530  0.535  1.000

Fitted Regression Model
> model2 <- lm(LOGCOMP ~ LOGSALES+GRAD+PERCENT5+EXPER+LOGVAL)
> summary(model2)

Call:
lm(formula = LOGCOMP ~ LOGSALES + GRAD + PERCENT5 + EXPER + LOGVAL)

Residuals:
    Min      1Q  Median      3Q     Max 
-0.9347 -0.2800  0.0077  0.2019  1.2599 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept)  4.916566   0.375611  13.090  < 2e-16 ***
LOGSALES     0.246090   0.044939   5.476 3.68e-07 ***
GRAD        -0.239013   0.098382  -2.429    0.017 *  
PERCENT5    -1.011556   0.179882  -5.623 1.95e-07 ***
EXPER        0.005557   0.006291   0.883    0.379    
LOGVAL       0.132218   0.029558   4.473 2.18e-05 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.4314 on 93 degrees of freedom
Multiple R-squared:  0.5313,	Adjusted R-squared:  0.5061 
F-statistic: 21.09 on 5 and 93 DF,  p-value: 4.908e-14

Solution

3a. The partial correlation coefficient can be calculated using
begin{eqnarray*}
r(LOGCOMP,x_{EXPER}|other x's)&=&frac{t(b_{EXPER})}{sqrt{
t(b_{EXPER})^{2}+n-(k+1)}}\
&=&frac{0.883}{sqrt{
0.883^{2}+99-(6)}}\
&=& 0.091.
end{eqnarray*}
3b. The (ordinary) correlation between EXPER and LOGCOMP is 0.216
 whereas the partial correlation is 0.091. This suggests that when we control for other variables, 
such as LOGVAL, that the relationship between EXPER and LOGCOMP becomes weaker. 
In particular, LOGVAL is strongly correlation with EXPER (0.535) and with LOGCOMP (0.366). 
This variable may be inducing part of the strength of the relationship captured
 in the ordinary correlation coefficient.

Part IV. Multiple Linear Regression - II.

Professional background (BACKGRD) of the CEO contains eleven categories, such as marketing, finance, accounting, insurance and so on. We use this factor to explain logarithmic compensation (LOGCOMP).

IV.1. The number and mean effects of BACKGRD on LOGCOMP are described in the table below. A boxplot is given in Figure 1. Describe what we learn from the table and boxplot about the effect of BACKGRD on LOGCOMP.

> cbind(summarize(LOGCOMP,BACKGRD,length),
+      round(summarize(LOGCOMP,BACKGRD,mean,na.rm=TRUE),digits=3))
   BACKGRD LOGCOMP BACKGRD LOGCOMP
       0       1       0   6.690
       1      17       1   7.064
       2       3       2   7.103
       3      13       3   6.916
       4      13       4   6.679
       5      12       5   6.553
       6       7       6   6.764
       7      12       7   7.067
       8       6       8   6.914
       9      14       9   6.621
      10       1      10   5.956
> boxplot(LOGCOMP~EDUCATN,ylab="LOGCOMP",xlab="EDUCATN")

Figure 1. Box plot of logarithmic compensation, by professional background

Solution

The table and the boxplot suggest important differences in compensation 
   by level of professional background. 
Background = 10 has the lowest level of compensation 
   (although only one person is in this category), 
Background = 2 has the highest mean, Background = 8 has the highest median.

IV.2. Consider a regression model using only the factor, BACKGRD; the fitted output is below.

a. Provide an expression for the regression function for this model, defining each term.
b. Provide an expression for the fitted regression function, using the fitted output. Further, give the fitted value for an observation with BACKGRD = 0 and with BACKGRD = 1, both in logarithmic units as well as dollars.
c. Is BACKGRD a statistically significant determinant of LOGCOMP? State your null and alternative hypotheses, decision-making criterion, and your decision-making rules. (Hint: Use the R² statistic to compute an F-statistic.)

> summary(lm(LOGCOMP ~ factor(BACKGRD)))

Call:
lm(formula = LOGCOMP ~ factor(BACKGRD))

Coefficients:
                  Estimate Std. Error t value Pr(>|t|)
(Intercept)        6.68960    0.60605  11.038   <2e-16 ***
factor(BACKGRD)1   0.37441    0.62362   0.600    0.550
factor(BACKGRD)2   0.41388    0.69980   0.591    0.556
factor(BACKGRD)3   0.22644    0.62892   0.360    0.720
factor(BACKGRD)4  -0.01012    0.62892  -0.016    0.987
factor(BACKGRD)5  -0.13649    0.63079  -0.216    0.829
factor(BACKGRD)6   0.07449    0.64789   0.115    0.909
factor(BACKGRD)7   0.37771    0.63079   0.599    0.551
factor(BACKGRD)8   0.22431    0.65461   0.343    0.733
factor(BACKGRD)9  -0.06845    0.62732  -0.109    0.913
factor(BACKGRD)10 -0.73376    0.85708  -0.856    0.394
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.606 on 88 degrees of freedom
Multiple R-squared: 0.1248,     Adjusted R-squared: 0.0253

Solution

2a. ( mathrm{E~} LOGCOMP = beta_0 + beta_1 mathrm{I}(BACKGROUND=1) + cdots + + beta_{10} mathrm{I}(BACKGROUND=10)),
where (mathrm{I}(BACKGROUND=j)) is a binary variable that indicates if the background is in type j, 
(j=1, ldots, 10). Here, the reference level is BACKGROUND=0.
2b. ( widehat{LOGCOMP} = b_0 + b_1 mathrm{I}(BACKGROUND=1) + cdots + + b_{10} mathrm{I}(BACKGROUND=10)),
where (b_0 = 6.68960), (b_1=0.37441, ldots, b_{10} = -0.73376).
For BACKGRD = 0, the fitted log compensation is (b_0= 6.68960) log dollars, 
     or (exp(6.68960) = 804) thousands of dollars.
For BACKGRD = 1, the fitted log compensation is (b_0 + b_1 = 6.68960+ 0.37441 = 7.0640) log dollars, 
     or (exp(7.0640) = 1,169.11) thousands of dollars.
2c. The null hypothesis is (H_0: beta_1 = beta_2 = cdots = beta_{10} = 0) . 
  The alternative hypothesis, (H_a) s that at least one of the (beta_j)'s is not zero.
To compute the test statistic, we have
begin{equation*}F-ratio=frac{R^2}{1-R^2}
frac{(n-(k+1))}{k} = frac{0.1248}{1-0.1248}frac{99-11}{10}=1.2548.
end{equation*} We compare this to an F-distribution with degrees of freedom (df_1 =10) and (df_2=88). 
  From this distribution, we see that an approximate 95$th$ percentile is F-value = 1.95.
Thus, because F-ratio < F-value, we do not have enough evidence to reject the null hypothesis. 
  That is, BACKGRD is not a statistically significant determinant of compensation.

Part V. Variable Selection.

V.1 We run a regression model using LOGCOMP as the outcome of interest and four explanatory variables, LOGSALES, GRAD, PERCENT5, and EXPER. In Figure 2 is a set of four diagnostic plots of this model.

a. In the upper left-hand panel is a plot of residuals versus fitted values. What type of model misspecification does this type of plot help detect?
b. Does the plot of residuals versus fitted values in Figure 2 reveal a serious model misspecification?
c. In the upper right-hand panel is a normal qq-plot. Describe this plot and say what type of model misspecification it helps to detect.
d. Does the normal qq-plot in Figure 2 reveal a serious model misspecification?
e. In the lower right-hand panel is a plot of standardized residuals versus leverages. Describe this plot and say what type of model misspecification it helps to detect.
f. Observation 87 appears in Figure 2. Is it a high leverage point? Describe the average leverage for this data set and give a rule of thumb cut-off for a point to be a high leverage point.
g. Observation 87 appears in Figure 2. Is it an outlier? Give a rule of thumb cut-off for a point to be an outlier.

Figure 2. Diagnostic Plots of a Model of Logarithmic Compensation

Solution

a. A plot of residuals versus fitted values helps to detect heteroscedasticity.
b. No, this plot reveals no serious heteroscedasticity issues.
Recall that we have taken the logarithmic transformation of COMPENSATION already
- this type of transform often mitigates heteroscedasticity problems.
c. This is a normal qq-plot based on model residuals.
The vertical axis gives the actual standardized residuals.
The horizontal axis calculates the corresponding residuals under the normal assumption.
This plot detects deviations from the assumption of normality
- we can also get information about outlying observations.
d. This plot reveals that the approximate normality assumption
is reasonable for most of the distribution.
Some of the smallest and largest observations are not what we would expect to see under the normality assumption.
e. The distribution of standardize residuals helps us identify outlying observations.
The distribution of leverages helps us to identify high leverage points.
The plot helps us to identify their joint effect.
f. Observation 87 is marked in the upper-left, upper-right and lower-left panels.
It is not a high leverage point. The leverage for this observation is about 0.02.
For this data set, the average leverage is ((k+1)/n = 5/ 99 =.05).
Thus, the leverage for observation 87 is less than the mean.
The usual cut-off for high leverage is 3 times the mean, or about 0.15 for this data set.
g. The standardized residual for observation 87 is above 3 (about 3.3).
One cut-off for an observation to be an outlier is 2 (another is 3).
Observation 87 exceeds both cut-offs and would typically be labeled as an outlier.

V.2 We run a regression model using LOGCOMP as the outcome of interest and four explanatory variables, LOGSALES, GRAD, PERCENT5, and EXPER. In Figure 3 is a plot of LOGVAL versus the residuals from this model. The correlation between these two variables is 0.292.

a. What do we hope to learn from a plot of a potential explanatory variable versus residuals from a model fit?
b. What new model does the information in Figure 3 suggest that we specify?

Figure 3. Plot of LOGVAL versus Standarized Residuals from a Model of Logarithmic Compensation — Figure 3. Plot of LOGVAL versus Standardized Residuals from a Model of Logarithmic Compensation

Solution

a. Plots of potential explanatory variable versus residuals 
from a model fit provide a suggest for incorporating explanatory variables into 
the model specification. We think of residuals from a model fit as the y 
value after we have extracted, or ``controlled for,'' the x values. 
If there is a strong relationship between the residuals and an explanatory variable, 
then this suggests a pattern that we might use in developing our model fit.
b. The positive correlation and the approximate linear relation 
in the plot suggest incorporating LOGVAL linearly into the model.

Part VI. Some Algebra Problems.

VI.1 Regression through the origin. Consider the model (y_i=beta_1 z_i^2 + varepsilon_i), a quadratic model passing through the origin.

a. Determine the least squares estimate of (beta_1).
b. Using the following set of n=5 observations, given a numerical result for the least squares estimate of (beta_1) determined in part (a).

begin{equation*}
begin{array}{l|rrrrr}
hline
i & 1 & 2 & 3 & 4 & 5 \
z_i & -2 & -1 & 0 & 1 & 2 \
y_i & 4 & 0 & 0 & 1 & 4 \ hline
end{array}
end{equation*}
Solution

a. The error sum of squares is
begin{equation*}
mathrm{SS}(b_1^{ast} )=sum_{i=1}^nleft( y_i - b_1^{ast }z_i^2right) ^{2}
end{equation*} Taking derivatives, we have:
begin{equation*}
frac{partial }{partial b_1^{ast }}SS(b_1^{ast})=sum_{i=1}^n(-2z_{i}^2)left( y_{i}-b_1^{ast }z_{i}^2right)=0
end{equation*} Setting this equal to zero yields
begin{equation*}
sum_{i=1}^nleft(b_1^{ast }z_{i}^4-z_{i}^2 y_{i}right) =0
end{equation*} Solving for (b_1) gives our result.
begin{equation*}
b_1 = frac{sum_{i=1}^n z_i^2 y_i}{sum_{i=1}^nz_i^{4}}.
end{equation*}
b. Plugging in, we have
begin{eqnarray*}
b_1 &=& frac{sum_{i=1}^n z_i^2 y_i}{sum_{i=1}^nz_i^{4}} \
&=& frac{(-2)^2 (4)+(-1)^2(0)+(0)^2(0)+(1)^2(1)+(2)^2(4) }
{(-2)^4+(-1)^4+(0)^4+(1)^4+(2)^4}
= frac{33}{34}
end{eqnarray*}

VI.2 You are doing regression with one explanatory variable and so consider the basic linear regression model (y_i = beta_0 + beta_1 x_i + varepsilon_i).

a. Show that the ith leverage can be simplified to
begin{equation*}
h_{ii} = frac{1}{n} + frac{(x_i - overline{x})^2}{(n-1) s_x^2}.
end{equation*}
b. Show that (overline{h}= 2 / n).
c. Suppose that (h_{ii} = 6/n) . How many standard deviations is (x_i) away (either above or below) from the mean?

Solution

a. begin{equation*}
mathbf{x_i}=(1,x_i)' , ~~~mathbf{X}^{prime }mathbf{X=}left(begin{array}{ccc}1 & ... & 1 \x_1 & ... & x_nend{array}right)left(begin{array}{cc}1 & x_1 \... & ... \1 & x_nend{array}right)=left(begin{array}{cc}n & sum_{i=1}^{n}x_i \sum_{i=1}^{n}x_i & sum_{i=1}^{n}x_i^2end{array}right)\
end{equation*}
and begin{equation*}left( mathbf{X}^{prime }mathbf{X}right)^{-1}mathbf{=}frac{1}{ sum_{i=1}^{n}x_i^2-noverline{x}^2}left(begin{array}{cc}n^{-1}sum_{i=1}^{n}x_i^2 & -overline{x} \-overline{x} & 1end{array}right) .
end{equation*}
begin{eqnarray*}
h_{ii} & = & mathbf{x_i}^{prime}left(mathbf{X}^{prime }mathbf{X}right)^{-1}mathbf{x_i}\
&=&left(begin{array}{cc}1 & x_iend{array}right)frac{1}{ sum_{i=1}^{n}x_i^2-noverline{x}^2}left(begin{array}{cc}n^{-1}sum_{i=1}^{n}x_i^2 & -overline{x} \-overline{x} & 1end{array}right)left(begin{array}{c}1 \x_iend{array}right)\
&=&frac{1}{ sum_{i=1}^{n}x_i^2-noverline{x}^2}left(begin{array}{cc}n^{-1}sum_{i=1}^{n}x_i^2-overline{x}x_i & -overline{x}+x_iend{array}right)left(begin{array}{c}1 \x_iend{array}right)\
&=&frac{n^{-1}sum_{i=1}^{n}x_i^2-overline{x}x_i-overline{x}x_i+x_i^2}{sum_{i=1}^{n}x_i^2-noverline{x}^2}\
&=&frac{n^{-1}(sum_{i=1}^{n}x_i^2-noverline{x}^2)+overline{x}^2-2overline{x}x_i+x_i^2}{sum_{i=1}^{n}x_i^2-noverline{x}^2}\
&=&frac{1}{n}+frac{(x_i-overline{x})^2}{sum_{i=1}^{n}(x_i-overline{x})^2}\
&=&frac{1}{n}+frac{(x_i-overline{x})^2}{(n-1)s_x^2}
end{eqnarray*}
b.    begin{equation*} bar{h}=frac{sum_{i=1}^n h_{ii}}{n}=frac{1}{n}+frac{1}{n}sum_{i=1}^nfrac{(x_i-bar{x})^2}{(n-1)s_{x}^2}=frac{1}{n}+frac{1}{n}=frac{2}{n}
end{equation*}
c. Let (c=(x_i-bar{x})/s_x)
begin{equation*} frac{6}{n}=h_{ii}=frac{1}{n}+frac{(x_i-bar{x})^2}{(n-1)s_{x}^2}=frac{1}{n}+frac{(c s_x)^2}{(n-1)s_{x}^2}=frac{1}{n}+frac{c^2}{n-1}
end{equation*}
begin{equation*} c=sqrt{5-frac{5}{n}}
end{equation*}
For large n, (x_i) is approximately (c=sqrt{5}=2.236) 
 standard deviations from away from the mean.