Supporting Statistical Analysis for Research
5.4 Factors and Indicators
These exercises use the mtcars.csv data set.
Import the
mtcars.csvdata set.from pathlib import Path import pandas as pd import numpy as npmtcars_path = Path('..') / 'datasets' / 'mtcars.csv' mtcars_in = pd.read_csv(mtcars_path) mtcars_in = mtcars_in.rename(columns={'Unnamed: 0': 'make_model'}) mtcars = mtcars_in.copy(deep=True) print(mtcars.dtypes)make_model object mpg float64 cyl int64 disp float64 hp int64 drat float64 wt float64 qsec float64 vs int64 am int64 gear int64 carb int64 dtype: objectFactor the
cyl,gearandcarbvariables.mtcars = ( mtcars .apply( func=lambda x: x.astype('category') if x.name in ['cyl', 'gear', 'carb'] else x)) print(mtcars.dtypes)make_model object mpg float64 cyl category disp float64 hp int64 drat float64 wt float64 qsec float64 vs int64 am int64 gear category carb category dtype: objector
mtcars = mtcars_in.copy(deep=True) cyl_lev = pd.Series(mtcars['cyl'].unique()).sort_values() gear_lev = pd.Series(mtcars['gear'].unique()).sort_values() carb_lev = pd.Series(mtcars['carb'].unique()).sort_values() mtcars = ( mtcars .assign( cyl = lambda df: pd.Categorical(df['cyl'], categories=cyl_lev), gear = lambda df: pd.Categorical(df['gear'], categories=gear_lev), carb = lambda df: pd.Categorical(df['carb'], categories=carb_lev))) print(mtcars.dtypes)make_model object mpg float64 cyl category disp float64 hp int64 drat float64 wt float64 qsec float64 vs int64 am int64 gear category carb category dtype: objectCreate a variable that identifies the observations that are in the top 25 percent of miles per gallon. Display a few of these vehicles.
Hint, you will need to find a function to identify the percentage points of a variable.
Note, that the quantile function returns a series.
mtcars = ( mtcars .assign( efficient = lambda df: np.where( df['mpg'] >= df['mpg'].quantile([0.75]).at[0.75], True, False))) (mtcars .loc[:, ['make_model', 'mpg', 'efficient']] .head() .pipe(print))make_model mpg efficient 0 Mazda RX4 21.0 False 1 Mazda RX4 Wag 21.0 False 2 Datsun 710 22.8 True 3 Hornet 4 Drive 21.4 False 4 Hornet Sportabout 18.7 Falseor
mtcars = ( mtcars .assign( efficient = lambda df: np.where( df['mpg'] >= df['mpg'].quantile([0.75]).iloc[0], True, False))) (mtcars .loc[:, ['make_model', 'mpg', 'efficient']] .head() .pipe(print))make_model mpg efficient 0 Mazda RX4 21.0 False 1 Mazda RX4 Wag 21.0 False 2 Datsun 710 22.8 True 3 Hornet 4 Drive 21.4 False 4 Hornet Sportabout 18.7 FalseCreate a variables that bins the values of
hpusing the following amounts of hp: 100, 170, 240, and 300.mtcars = ( mtcars .assign( power = lambda df: pd.cut(df['hp'], bins=[-np.inf, 100, 170, 240, 300, np.inf], labels=['gocart', 'slow', 'typical', 'fast', 'beast']))) (mtcars .loc[:, ['make_model', 'mpg', 'efficient', 'power']] .head() .pipe(print))make_model mpg efficient power 0 Mazda RX4 21.0 False slow 1 Mazda RX4 Wag 21.0 False slow 2 Datsun 710 22.8 True gocart 3 Hornet 4 Drive 21.4 False slow 4 Hornet Sportabout 18.7 False typical