* Coarsened measure * If the effect is linear, {&beta}{sub:1}is unbiased by coarsening, * but {&beta}{sub:0} is biased. The direction of the bias * depends on which direction the measure shifts, on * average. * Data that is uniformly distributed is the easy case. * If coarsening lumps the data in bands/categories/classes, and the * data values are recorded as one of the boundaries of the band * (whether the floor or the ceiling), the linear effect is unchanged * but the intercept shifts. * If these data values are recorded as the midpoints of each band, * then both the linear effect and the intercept are left unbiased. * Suppose the independent data is lumpy with a recurring normal distribution, * perhaps there are population booms at 5-year intervals. Suppose further * that the data are coarsened at these 5-year spikes. Then the estimated * linear effect becomes biased. The more pronounced the spikes in the * original data, the greater the bias due to coarsening (coarsening * produces leverage?). Changing the data point used to represent each * band merely shifts the intercept. Coding data at the interval mid-point * minimizes the effect of the bias, as the true line and the biased line * intersect at the data mean. * Suppose the data is lumpy with a recurring skew distribution (exponential). * And suppose the data is coarsened to the lower band. Here the linear * effect is unbiased, but the intercept has shifted. Shifting to the * midpoint reduces the intercept bias, while shifting to the band mean * eliminates it (but given coarse data, you won't generally know what * the band mean was). Sharpness of the spike makes no difference. postfile results sample measure b0 b1 using results, replace forvalues i = 1/250 { clear quietly set obs 250 // lumpy age distribution *generate age = 20 + 5*ceil(_n/50) + runiform(0,5) *generate age = 20 + 5*ceil(_n/50) + runiform(-5,5) generate age = 20 + 5*ceil(_n/50) + rexponential(5) *generate age = 20 + 5*ceil(_n/50) + rnormal(0, 2.5) *generate age = 20 + 5*ceil(_n/50) + abs(rnormal(0, 2.5)) // linear effect generate inc = 1000 + 100*age //+ rnormal(0, 500) quietly regress inc age post results (`i') (0) (_b[_cons]) (_b[age]) // coarsen age measure, 5 year intervals *generate age5 = 5*ceil(age/5) // shift age up, shift _cons down generate age5 = 5*floor(age/5) *generate age5 = 5*floor(age/5) +2.5 // shift age down *replace age5=20 if age5< 20 *generate age5 = 5*ceil(age/5) - 2.5 // shift to midpoint *generate age5 = 5*floor(age/5) + 1.8 // shift exp to band mean quietly regress inc age5 post results (`i') (1) (_b[_cons]) (_b[age5]) } postclose results use results, clear reshape wide b0 b1, i(sample) j(measure) summarize b00 b10 b01 b11 gen b0shift = b01 - b00 label variable b0shift "{&Delta}{&beta}{sub:0}" gen b1shift = b11 - b10 label variable b1shift "{&Delta}{&beta}{sub:1}" ttest b0shift==0 ttest b1shift==0 *histogram b0shift, name(b0, replace) *histogram b1shift, name (b1, replace) *graph combine b0 b1 // quadratic effect postfile results sample measure b0 b1 b2 using results, replace forvalues i = 1/100 { clear set obs 500 generate age = 20 + 5*ceil(_n/50) + rexponential(3) generate inc = 1000 + 100*age -0.5*age^2 + rnormal(0, 500) quietly regress inc c.age##c.age post results (`i') (0) (_b[_cons]) (_b[age]) (_b[c.age#c.age]) generate age5 = 5*ceil(age/5) - 3 quietly regress inc c.age5##c.age5 post results (`i') (1) (_b[_cons]) (_b[age5]) (_b[c.age5#c.age5]) } postclose results use results, clear reshape wide b0 b1 b2, i(sample) j(measure) summarize b00 b10 b20 b01 b11 b21 gen b0shift = b01 - b00 gen b1shift = b11 - b10 gen b2shift = b21 - b20 summarize b0shift b1shift b2shift