Example: Term Life Insurance – Continued – Actuarial Science and Analytics Resources

We now return to the marital status of respondents from the Survey of Consumer Finances (SCF). Recall that marital status is not measured continuously but rather takes on values that falls into distinct groups that we treat as unordered. In Chapter 3, we grouped survey respondents according to whether or not they are “single,” where being single includes never married, separated, divorced, widowed, and are not married and living with a partner. We now supplement this by considering the categorical variable, MARSTAT, that represents the marital status of the survey respondent. This may be:

1, for married
2, for living with partner
0, for other (SCF further breaks down this category into separated, divorced, widowed, never married and inapplicable, persons age 17 or less, no further persons).

As before, the dependent variable is y = LNFACE, the amount that the company will pay in the event of the death of the named insured (in logarithmic dollars). Table 4.1 summarizes the dependent variable by level of the categorial variable. This table shows that the marital status “married” is the most prevalent in the sample and that those married choose to have the most life insurance coverage. Figure 4.1 gives a more complete picture of the distribution of LNFACE for each of the three types of marital status. The table and figure also suggests that those living together have less life insurance coverage than the other two categories.
begin{matrix}begin{array}{c}
text{Table 4.1 Summary Statistics of Logarithmic Face By Marital Status}
end{array}\small
begin{array}{lcccc} hline & & & & text{Standard} \ & text{MARSTAT} & text{Number} & text{Mean} & text{deviation}\hline text{Other} & 0 & 57 & 10.958 & 1.566 \ text{Married} & 1 & 208 & 12.329 & 1.822 \ text{Living together} & 2 & 10 & 10.825 & 2.001 \ hline text{Total} & & 275 & 11.990 & 1.871 \ hline end{array} end{matrix}

R Code for Table 4.1

R-Code
Term <- read.table('http://instruction.bus.wisc.edu/jfrees/jfreesbooks/Regression%20Modeling/BookWebDec2010/CSVData/TermLife.csv', header=TRUE, sep=",")

#  PICK THE SUBSET OF THE DATA CORRESPONDING TO TERM PURCHASE
Term2 <- subset(Term, subset=FACE > 0)
names(Term2)

R-Code Output
 [1] "GENDER"             "AGE"                "MARSTAT"            "EDUCATION"          "ETHNICITY"         
 [6] "SMARSTAT"           "SGENDER"            "SAGE"               "SEDUCATION"         "NUMHH"             
[11] "INCOME"             "TOTINCOME"          "CHARITY"            "FACE"               "FACECVLIFEPOLICIES"
[16] "CASHCVLIFEPOLICIES" "BORROWCVLIFEPOL"    "NETVALUE"

R-Code
attach(Term2)
Term2$LNFACE <- with(Term2, log(FACE))
Term2$LNINCOME <- with (Term2, log(INCOME))
Term2$MARSTAT <- as.factor(Term2$MARSTAT)
table(Term2$MARSTAT)

R-Code Output
  0   1   2 
 57 208  10

R-Code
#  SUMMARY BY LEVEL OF MARSTAT
library(Rcmdr)
library(abind)
numSummary(Term2[, "LNFACE"], groups=Term2$MARSTAT, statistics=c("mean", "sd"))

R-Code Output
      mean       sd data:n
0 10.95842 1.566224     57
1 12.32909 1.822243    208
2 10.82507 2.000644     10

R-Code
numSummary(Term2[, "LNFACE"], statistics=c("mean", "sd"))

R-Code Output
     mean       sd   n
 11.99029 1.870728 275

F4BoxFACEMARSTAT — Figure 4.1 Box Plots of Logarithmic Face, by Level of Marital Status

R Code for Figure 4.1

R-Code
boxplot(LNFACE ~ MARSTAT, ylab="LNFACE", xlab="MARSTAT", data=Term2)

Are the continuous and categorical variables jointly important determinants of response? To answer this, a regression was run using LNFACE as the response and five explanatory variables, three continuous and two binary (for marital status). Recall that our three continuous explanatory variables are: LNINCOME (logarithmic annual income), the number of years of EDUCATION of the survey respondent and the number of household members, NUMHH.

For the binary variables, first define MAR0 to be the binary variable that is one if MARSTAT=0 and zero otherwise. Similarly, define MAR1 and MAR2 to be binary variables that indicate MARSTAT=1 and MARSTAT=2, respectively. There is a perfect linear dependency among these three binary variables in that MAR0 + MAR1 + MAR2 = 1 for any survey respondent. Thus, we need only two of the three. However, there is not a perfect dependency among any two of the three. It turns out that Corr(MAR0,MAR1) = -0.90, Corr(MAR0,MAR2) =-0.10 and Corr(MAR1,MAR2) = -0.34.

R Code to Compute Correlation

R-Code
#  MAKE BINARY VARIABLES
Term2$MAR0 <- with(Term2, 1*(MARSTAT == 0))
Term2$MAR1 <- with(Term2, 1*(MARSTAT == 1))
Term2$MAR2 <- with(Term2, 1*(MARSTAT == 2))

Check1 <- data.frame(MARSTAT, Term2$MAR0, Term2$MAR1, Term2$MAR2)
#fix(Check1)

#  CHECK THE DEPENDENCIES AMONG MARO, MAR1, MAR2
cor(Term2[, c("MAR0", "MAR1", "MAR2")], use="complete.obs")

R-Code Output
            MAR0       MAR1        MAR2
MAR0  1.00000000 -0.9009557 -0.09933133
MAR1 -0.90095572  1.0000000 -0.34227197
MAR2 -0.09933133 -0.3422720  1.00000000

A regression model was run using LNINCOME, EDUCATION, NUMHH, MAR0 and MAR2 as explanatory variables. The fitted regression equation turns out to be begin{eqnarray*} widehat{y} &=& 2.605 + 0.452 textrm{LNINCOME} +0.205 textrm{EDUCATION} + 0.248 textrm{NUMHH} \ & & ~~ -0.557 textrm{MAR0} -0.789 textrm{MAR2}. end{eqnarray*}

R Code for Regression

R-Code
summary(lm(LNFACE ~ LNINCOME+EDUCATION+NUMHH +MAR0+MAR2, data=Term2))

R-Code Output
Call:
lm(formula = LNFACE ~ LNINCOME + EDUCATION + NUMHH + MAR0 + MAR2, 
    data = Term2)

Residuals:
    Min      1Q  Median      3Q     Max 
-5.8875 -0.8505  0.1124  0.8468  4.5173 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  3.39477    0.90019   3.771 0.000200 ***
LNINCOME     0.45151    0.07872   5.736 2.61e-08 ***
EDUCATION    0.20467    0.03862   5.299 2.42e-07 ***
NUMHH        0.24770    0.06940   3.569 0.000424 ***
MAR0        -0.55707    0.25929  -2.148 0.032574 *  
MAR2        -0.78941    0.49532  -1.594 0.112169    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 1.513 on 269 degrees of freedom
Multiple R-squared:  0.358,	Adjusted R-squared:  0.3461 
F-statistic:    30 on 5 and 269 DF,  p-value: < 2.2e-16

To interpret the regression coefficients associated with marital status, consider a respondent who is married. In this case, then MAR0=0, MAR1=1 and MAR2=0, so that begin{eqnarray*} widehat{y}_m &=& 2.605 + 0.452 textrm{LNINCOME} +0.205 textrm{EDUCATION} + 0.248 textrm{NUMHH} . end{eqnarray*} Similarly, if the respondent is coded as living together, then MAR0=0, MAR1=0 and MAR2=1, and
begin{align} widehat{y}_{lt} &= 2.605 + 0.452 textrm{LNINCOME} +0.205 textrm{EDUCATION} + 0.248 textrm{NUMHH}\ &-0.789. end{align} The difference between (widehat{y}_m) and (widehat{y}_{lt}) is (0.789.) Thus, we may interpret the regression coefficient associated with MAR2, -0.789, to be the difference in fitted values for someone living together compared to a similar person who is married (the omitted category).

Similarly, we can interpret -0.557 to be the difference between the ``other'' category and the married category, holding other explanatory variables fixed. For the difference in fitted values between the ``other'' and the ``living together'' categories, we may use (-0.557 - (-0.789) = 0.232.)

Although the regression was run using MAR0 and MAR2, any two out of the three would produce the same ANOVA Table 4.2. However, the choice of binary variables does impact the regression coefficients. Table 4.3 shows three models, omitting MAR1, MAR2 and MAR0, respectively. For each fit, the coefficients associated with the continuous variables remain the same. As we have seen, the binary variable interpretations are with respect to the omitted category, known as the reference level. Although they change from model to model, they overall interpretation remains the same. That is, if we would like to estimate the difference in coverage between the ``other'' and the ``living together'' category, the estimate would be 0.232, regardless of the model.

begin{matrix}begin{array}{c}
text{Table 4.2 Term Life with Marital Status ANOVA Table}
end{array}\small
begin{array}{lrrr} hline text{Source} & text{Sum of Squares} & df & text{Mean Square} \ hline
text{Regression} & 343.28 & 5 & 68.66 \ text{Error} & 615.62 & 269 & 2.29 \ text{Total} & 948.90& 274 & \ hline
end{array}\scriptsize
begin{array}{l}
text{Residual Standard Error} s= 1.513, R^2 = 35.8%, R_a^2 = 34.6%end{array} end{matrix}

R Code for Table 4.2

R-Code
anova(lm(LNFACE ~ LNINCOME+EDUCATION+NUMHH +MAR0+MAR2, data=Term2))

R-Code Output
Analysis of Variance Table

Response: LNFACE
           Df Sum Sq Mean Sq F value    Pr(>F)    
LNINCOME    1 222.63 222.629 97.2800 < 2.2e-16 ***
EDUCATION   1  51.50  51.502 22.5044 3.407e-06 ***
NUMHH       1  54.34  54.336 23.7426 1.883e-06 ***
MAR0        1   9.00   8.999  3.9321   0.04839 *  
MAR2        1   5.81   5.813  2.5400   0.11217    
Residuals 269 615.62   2.289                      
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Although the three models in Table 4.3 are the same except for different choices of parameters, they do appear different. In particular, the (t)-ratios differ and give different appearances of statistical significance. For example, both of the (t)-ratios associated with marital status in Model 2 are less than 2 in absolute value, suggesting that marital status is unimportant. In contrast, both Models 1 and 3 have at least one marital status binary that exceeds 2 in absolute value, suggesting statistical significance. Thus, you can influence the appearance of statistical significance by altering the choice of the reference level. To assess the overall importance of marital status (not just each binary variable),
Section 4.2 will introduce tests of sets of regression coefficients.

begin{matrix}begin{array}{c}
text{Table 4.3 Term Life Regression Coefficients with Marital Status}
end{array}\scriptsize
begin{array}{llll}
hline phantom{XXXXXXXXXXX} & text{Model 1}phantom{XXXXX}& phantom{XX}text{Model 2}phantom{XXXXX}& phantom{XX}text{Model 3}phantom{XXXXX}\
end{array}\scriptsize
begin{array}{l|rr|rr|rr} hline text{Explanatory} \ text{Variable} & text{Coefficient} & t-text{ratio} & text{Coefficient} & t-text{ratio}& text{Coefficient} & t-text{ratio}\hline text{LNINCOME} & 0.452 & 5.74 & 0.452 & 5.74 & 0.452 & 5.74 \ text{EDUCATION} &0.205 & 5.30 &0.205 & 5.30&0.205 & 5.30 \ text{NUMHH} & 0.248 & 3.57 & 0.248 & 3.57 & 0.248 & 3.57 \hline text{Intercept} & 3.395 & 3.77 & 2.605& 2.74 & 2.838 & 3.34\ text{MAR0} & -0.557 & -2.15& 0.232 & 0.44\ text{MAR1} & & & 0.789 & 1.59 & 0.557 & 2.15\ text{MAR2} & -0.789 & -1.59 & & & -0.232 & -0.44\ hline end{array}
end{matrix}

R Code for Table 4.3

R-Code
summary(lm(LNFACE ~ LNINCOME+EDUCATION+NUMHH +MAR0+MAR2, data=Term2))

R-Code Output
Call:
lm(formula = LNFACE ~ LNINCOME + EDUCATION + NUMHH + MAR0 + MAR2, 
    data = Term2)

Residuals:
    Min      1Q  Median      3Q     Max 
-5.8875 -0.8505  0.1124  0.8468  4.5173 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  3.39477    0.90019   3.771 0.000200 ***
LNINCOME     0.45151    0.07872   5.736 2.61e-08 ***
EDUCATION    0.20467    0.03862   5.299 2.42e-07 ***
NUMHH        0.24770    0.06940   3.569 0.000424 ***
MAR0        -0.55707    0.25929  -2.148 0.032574 *  
MAR2        -0.78941    0.49532  -1.594 0.112169    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 1.513 on 269 degrees of freedom
Multiple R-squared:  0.358,	Adjusted R-squared:  0.3461 
F-statistic:    30 on 5 and 269 DF,  p-value: < 2.2e-16

R-Code
summary(lm(LNFACE ~ LNINCOME+EDUCATION+NUMHH +MAR1+MAR2, data=Term2))

R-Code Output
Call:
lm(formula = LNFACE ~ LNINCOME + EDUCATION + NUMHH + MAR1 + MAR2, 
    data = Term2)

Residuals:
    Min      1Q  Median      3Q     Max 
-5.8875 -0.8505  0.1124  0.8468  4.5173 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  2.83770    0.84882   3.343 0.000946 ***
LNINCOME     0.45151    0.07872   5.736 2.61e-08 ***
EDUCATION    0.20467    0.03862   5.299 2.42e-07 ***
NUMHH        0.24770    0.06940   3.569 0.000424 ***
MAR1         0.55707    0.25929   2.148 0.032574 *  
MAR2        -0.23234    0.53283  -0.436 0.663155    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 1.513 on 269 degrees of freedom
Multiple R-squared:  0.358,	Adjusted R-squared:  0.3461 
F-statistic:    30 on 5 and 269 DF,  p-value: < 2.2e-16

[WpProQuiz 20]

◄ Previous page

Next page ►