Supporting Statistical Analysis for Research
4.5 Dropping unneeded observations
These exercises use the PSID.csv data set
that was imported in the prior section.
Import the
PSID.csvdata set.from pathlib import Path import pandas as pdpsid_path = Path('..') / 'datasets' / 'PSID.csv' psid_in = pd.read_csv(psid_path) psid_in = ( psid_in .rename( columns={ 'Unnamed: 0': 'obs_num', 'intnum': 'intvw_num', 'persnum': 'person_id', 'married': 'marital_status'})) psid = psid_in.copy(deep=True) psid = psid.drop(columns='obs_num') print(psid.dtypes)intvw_num int64 person_id int64 age int64 educatn float64 earnings int64 hours int64 kids int64 marital_status object dtype: objectDisplay some of the observations where there are more than 90 kids in the household. Chose several of the pertinent variables to display.
(psid .query(' kids > 90') .loc[:, ['person_id', 'age', 'educatn', 'kids', 'marital_status']] .head(n=15) .pipe(print))person_id age educatn kids marital_status 10 3 48 13.0 98 divorced 150 186 41 12.0 98 married 323 178 49 12.0 98 married 357 5 34 99.0 99 no histories 447 3 34 12.0 98 divorced 544 2 47 12.0 98 divorced 590 182 49 12.0 99 no histories 739 3 48 3.0 98 never married 749 21 49 0.0 99 no histories 857 177 40 0.0 98 married 1027 3 45 12.0 98 married 1076 2 50 0.0 99 no histories 1167 171 49 0.0 98 divorced 1174 173 40 9.0 98 divorced 1187 175 37 0.0 98 divorcedCreate a copy of the data frame that removes the observations where
marriedwasno historyorNA/DF. You may have combined these categories into a missing category in the preparatory exercises.psid_copy = ( psid.query( 'marital_status == "no history" | marital_status == "NA/DF"')) (psid_copy .loc[:, ['person_id', 'age', 'educatn', 'kids', 'marital_status']] .head(n=15) .pipe(print))person_id age educatn kids marital_status 1665 3 45 17.0 0 NA/DF 1843 3 38 12.0 1 NA/DF 2240 174 36 17.0 0 NA/DF 2244 177 32 14.0 1 NA/DF 2840 4 46 14.0 0 NA/DF 2971 9 31 14.0 2 NA/DF 3563 2 46 12.0 0 NA/DF 3643 4 30 11.0 2 NA/DF 3818 174 41 99.0 0 NA/DF