16 Subsetting

Subsetting involves making a dataset smaller by selecting a portion of the data that we are interested in.

We can take subsets of a dataset either by columns or by rows. For example, we might have a dataset with extra columns we do not care about, or we might want to look at rows where a certain condition is true (such as age greater than or equal to 18).

16.1 Extractors

We have previously seen that we can index vectors by position, name, or condition to extract and replace elements. Now we extend this to data frames. In addition to the square bracket extractors, [ ], we also have two “list” extractors, the dollar sign , $, and the doubled square brackets, [[ ]]. Here it is relevant that a data frame is a special kind of list.

  • The square brackets, [ ], allow us to index both the rows and the columns of a data frame. We can index by position, name, or condition.

    When used with data frames (or matrices), we index by two positions, rows and columns. The square brackets should contain two objects, a vector indexing the rows and a vector indexing the columns. To select all the elements along one dimension, omit that vector but include the comma.

    Try these examples:

    mtcars["Valiant", "mpg"]  # a single element
    mtcars[ , c("mpg", "wt")] # all rows, two columns
    mtcars[1:6, ]             # rows 1-6, all columns
    mtcars[ , ]               # everything

    The next example illustrates indexing rows by position and columns by name.

    mtcars[1:6, c("mpg", "wt")] # [rows, columns]
                       mpg    wt
    Mazda RX4         21.0 2.620
    Mazda RX4 Wag     21.0 2.875
    Datsun 710        22.8 2.320
    Hornet 4 Drive    21.4 3.215
    Hornet Sportabout 18.7 3.440
    Valiant           18.1 3.460
  • The dollar sign extractor, $, allows us to index a single named element of a list. In a data frame, that would be a column, a single variable. (We’ll use the head function to show just the first few values.)

    head( mtcars$mpg )
    [1] 21.0 21.0 22.8 21.4 18.7 18.1
  • The double brackets, [[ ]], allow us to index a single element of a list by either name or position. This can be useful for programming repeated tasks.

    head( mtcars[[3]] )      # by position, displacement
    [1] 160 160 108 258 360 225
    head( mtcars[["disp"]] ) # by name
    [1] 160 160 108 258 360 225
  • In addition to these extractors, we also have a subset function that provides a very useful variation on the square bracket extractor.

    subset(mtcars, 
           subset=carb >=6,       # rows
           select=c("mpg", "wt")) # columns
                   mpg   wt
    Ferrari Dino  19.7 2.77
    Maserati Bora 15.0 3.57

16.2 Subsetting by Columns

Now, load the dplyr package, which contains the functions we need to subset.

library(dplyr)

The select() function from dplyr gives us a number of ways to take a subset of columns. Four ways to select columns are covered below: By Name, By Pattern, By Range, and By Type. For a full list of the ways to select a subset of columns, run ?tidyselect::language.

16.2.1 By Name

The simplest and most straightforward way to select columns is to pass their names to select(). The output of select() will be our original dataframe, but it will only include the columns we name, and they will be in the specified order. In the mtcars dataset, cyl is before hp, but if we specify hp before cyl in select(), they will appear in that order.

(Note that the code below uses head() to only show the first six rows for space reasons.)

colnames(mtcars)
 [1] "mpg"  "cyl"  "disp" "hp"   "drat" "wt"   "qsec" "vs"   "am"   "gear" "carb"
mtcars %>% 
  select(mpg, hp, cyl) %>% 
  head()
                   mpg  hp cyl
Mazda RX4         21.0 110   6
Mazda RX4 Wag     21.0 110   6
Datsun 710        22.8  93   4
Hornet 4 Drive    21.4 110   6
Hornet Sportabout 18.7 175   8
Valiant           18.1 105   6

As it was used above, select() only keeps the columns we name. We can also have it only drop the columns we name by using a - in front of the column name.

mtcars %>% 
  select(-disp, -drat, -wt, -qsec, -vs, -am, -gear, -carb) %>% 
  head()
                   mpg cyl  hp
Mazda RX4         21.0   6 110
Mazda RX4 Wag     21.0   6 110
Datsun 710        22.8   4  93
Hornet 4 Drive    21.4   6 110
Hornet Sportabout 18.7   8 175
Valiant           18.1   6 105

In some cases, we might want to select a certain subset of columns multiple times, and we do not want to have to retype them every time. In that case, we can pass a character vector to select(). We can also add another column, hp, without adding it to our list when running the command.

myCols <- c("mpg", "vs", "carb")

mtcars %>% 
  select(myCols, hp) %>% 
  head()
Note: Using an external vector in selections is ambiguous.
i Use `all_of(myCols)` instead of `myCols` to silence this message.
i See <https://tidyselect.r-lib.org/reference/faq-external-vector.html>.
This message is displayed once per session.
                   mpg vs carb  hp
Mazda RX4         21.0  0    4 110
Mazda RX4 Wag     21.0  0    4 110
Datsun 710        22.8  1    1  93
Hornet 4 Drive    21.4  1    1 110
Hornet Sportabout 18.7  0    2 175
Valiant           18.1  1    1 105

We can also simply reorder the columns by giving select() the columns we would like first and then everything().

mtcars %>% 
  select(myCols, hp, everything()) %>% 
  head()
                   mpg vs carb  hp cyl disp drat    wt  qsec am gear
Mazda RX4         21.0  0    4 110   6  160 3.90 2.620 16.46  1    4
Mazda RX4 Wag     21.0  0    4 110   6  160 3.90 2.875 17.02  1    4
Datsun 710        22.8  1    1  93   4  108 3.85 2.320 18.61  1    4
Hornet 4 Drive    21.4  1    1 110   6  258 3.08 3.215 19.44  0    3
Hornet Sportabout 18.7  0    2 175   8  360 3.15 3.440 17.02  0    3
Valiant           18.1  1    1 105   6  225 2.76 3.460 20.22  0    3

select() includes the functionality of rename(), so we can do in one line (second example) what normally takes two (first example). The pattern is new_name = old_name.

mtcars %>% 
  rename(horsepower = hp) %>% 
  select(horsepower, mpg) %>% 
  head()
                  horsepower  mpg
Mazda RX4                110 21.0
Mazda RX4 Wag            110 21.0
Datsun 710                93 22.8
Hornet 4 Drive           110 21.4
Hornet Sportabout        175 18.7
Valiant                  105 18.1
mtcars %>% 
  select(horsepower = hp, mpg) %>% 
  head()
                  horsepower  mpg
Mazda RX4                110 21.0
Mazda RX4 Wag            110 21.0
Datsun 710                93 22.8
Hornet 4 Drive           110 21.4
Hornet Sportabout        175 18.7
Valiant                  105 18.1

16.2.2 By Pattern

Columns can be selected by a pattern, such as the letters they start with or contain. To do this, we need to use other functions inside of select().

starts_with() returns columns that start with a character string. Both “disp” and “drat” start with “d”.

mtcars %>% 
  select(starts_with("d")) %>% 
  head()
                  disp drat
Mazda RX4          160 3.90
Mazda RX4 Wag      160 3.90
Datsun 710         108 3.85
Hornet 4 Drive     258 3.08
Hornet Sportabout  360 3.15
Valiant            225 2.76

ends_with() does the opposite, and looks for a character string at the end of the column names. “disp” and “hp” end with “p”.

mtcars %>% 
  select(ends_with("p")) %>% 
  head()
                  disp  hp
Mazda RX4          160 110
Mazda RX4 Wag      160 110
Datsun 710         108  93
Hornet 4 Drive     258 110
Hornet Sportabout  360 175
Valiant            225 105

contains() is more general than both starts_with() and ends_with() since it looks for a character string anywhere in the column name. “ar” is found in both “gear” and “carb”.

mtcars %>% 
  select(contains("ar")) %>% 
  head()
                  gear carb
Mazda RX4            4    4
Mazda RX4 Wag        4    4
Datsun 710           4    1
Hornet 4 Drive       3    1
Hornet Sportabout    3    2
Valiant              3    1

The functions above all use literal character strings. dplyr also allows for the use of regular expressions with the matches() function. The code below, "^[a-g].a", looks for a string that starts with (^) the letters a-g ([a-g]), then has any character (.), and then the letter a (a). Both “drat” and “gear” fulfill these matching criteria.

mtcars %>% 
  select(matches("^[a-g].a"))
                    drat gear
Mazda RX4           3.90    4
Mazda RX4 Wag       3.90    4
Datsun 710          3.85    4
Hornet 4 Drive      3.08    3
Hornet Sportabout   3.15    3
Valiant             2.76    3
Duster 360          3.21    3
Merc 240D           3.69    4
Merc 230            3.92    4
Merc 280            3.92    4
Merc 280C           3.92    4
Merc 450SE          3.07    3
Merc 450SL          3.07    3
Merc 450SLC         3.07    3
Cadillac Fleetwood  2.93    3
Lincoln Continental 3.00    3
Chrysler Imperial   3.23    3
Fiat 128            4.08    4
Honda Civic         4.93    4
Toyota Corolla      4.22    4
Toyota Corona       3.70    3
Dodge Challenger    2.76    3
AMC Javelin         3.15    3
Camaro Z28          3.73    3
Pontiac Firebird    3.08    3
Fiat X1-9           4.08    4
Porsche 914-2       4.43    5
Lotus Europa        3.77    5
Ford Pantera L      4.22    5
Ferrari Dino        3.62    5
Maserati Bora       3.54    5
Volvo 142E          4.11    4

For more on regular expressions, see the chapter on Character Vectors or RStudio’s Regular Expressions cheatsheet.

16.2.3 By Range

The : operator can be used to select a range of variables, either by position or by number. Be careful to check that your columns are in the order you think they are!

mtcars %>% 
  select(2:4) %>% 
  head()
                  cyl disp  hp
Mazda RX4           6  160 110
Mazda RX4 Wag       6  160 110
Datsun 710          4  108  93
Hornet 4 Drive      6  258 110
Hornet Sportabout   8  360 175
Valiant             6  225 105
mtcars %>% 
  select(cyl:hp) %>% 
  head()
                  cyl disp  hp
Mazda RX4           6  160 110
Mazda RX4 Wag       6  160 110
Datsun 710          4  108  93
Hornet 4 Drive      6  258 110
Hornet Sportabout   8  360 175
Valiant             6  225 105

If we have multiple columns with a common prefix and a numeric suffix, we can select a range with num_range(). First, let’s change the column names of mtcars to create a fictional dataset with individuals’ incomes in different years, where each row is an individual and each column is their reported income for a different year.

mtcars_income <- mtcars
colnames(mtcars_income) <- paste0("income", 1960:1970)
colnames(mtcars_income)
 [1] "income1960" "income1961" "income1962" "income1963" "income1964" "income1965"
 [7] "income1966" "income1967" "income1968" "income1969" "income1970"
mtcars_income %>% 
  select(num_range("income", 1962:1964)) %>% 
  head()
                  income1962 income1963 income1964
Mazda RX4                160        110       3.90
Mazda RX4 Wag            160        110       3.90
Datsun 710               108         93       3.85
Hornet 4 Drive           258        110       3.08
Hornet Sportabout        360        175       3.15
Valiant                  225        105       2.76
mtcars_income %>% 
  select(num_range("income", seq(1960, 1970, 2))) %>% 
  head()
                  income1960 income1962 income1964 income1966 income1968 income1970
Mazda RX4               21.0        160       3.90      16.46          1          4
Mazda RX4 Wag           21.0        160       3.90      17.02          1          4
Datsun 710              22.8        108       3.85      18.61          1          1
Hornet 4 Drive          21.4        258       3.08      19.44          0          1
Hornet Sportabout       18.7        360       3.15      17.02          0          2
Valiant                 18.1        225       2.76      20.22          0          1

To use num_range(), give it the prefix ("income") and a numeric vector with the desired numeric suffixes. The first example specifies the range 1962:1964, giving us 1962, 1963, and 1964. The second example uses seq(1960, 1970, 2), so we will get every two years from 1960 to 1970.

mtcars_income %>% 
  select(num_range("income", 1962:1964)) %>% 
  head()
                  income1962 income1963 income1964
Mazda RX4                160        110       3.90
Mazda RX4 Wag            160        110       3.90
Datsun 710               108         93       3.85
Hornet 4 Drive           258        110       3.08
Hornet Sportabout        360        175       3.15
Valiant                  225        105       2.76
mtcars_income %>% 
  select(num_range("income", seq(1960, 1970, 2))) %>% 
  head()
                  income1960 income1962 income1964 income1966 income1968 income1970
Mazda RX4               21.0        160       3.90      16.46          1          4
Mazda RX4 Wag           21.0        160       3.90      17.02          1          4
Datsun 710              22.8        108       3.85      18.61          1          1
Hornet 4 Drive          21.4        258       3.08      19.44          0          1
Hornet Sportabout       18.7        360       3.15      17.02          0          2
Valiant                 18.1        225       2.76      20.22          0          1

If the numbers are out of order in the dataframe, this is not an issue, unlike with :. The examples below first select the columns income1962, income1964, and income1963, in that order. The second example with : fails to select income1963 because it selects columns by position, not by name.

mtcars_income %>% 
  select(income1962, income1964, income1963) %>% 
  select(num_range("income", 1962:1964)) %>% 
  head()
                  income1962 income1963 income1964
Mazda RX4                160        110       3.90
Mazda RX4 Wag            160        110       3.90
Datsun 710               108         93       3.85
Hornet 4 Drive           258        110       3.08
Hornet Sportabout        360        175       3.15
Valiant                  225        105       2.76
mtcars_income %>% 
  select(income1962, income1964, income1963) %>% 
  select(income1962:income1964) %>% 
  head()
                  income1962 income1964
Mazda RX4                160       3.90
Mazda RX4 Wag            160       3.90
Datsun 710               108       3.85
Hornet 4 Drive           258       3.08
Hornet Sportabout        360       3.15
Valiant                  225       2.76

It is also okay if some columns are not in the dataset. Of course, select() will not return these non-existent columns, but it will also not have an error. To see this, we can drop income1963.

mtcars_income %>% 
  select(-income1963) %>% 
  select(num_range("income", 1962:1964)) %>% 
  head()
                  income1962 income1964
Mazda RX4                160       3.90
Mazda RX4 Wag            160       3.90
Datsun 710               108       3.85
Hornet 4 Drive           258       3.08
Hornet Sportabout        360       3.15
Valiant                  225       2.76

16.2.4 By Type

which() can help us select columns that are of a certain type, such as numeric or character. chickwts has one numeric column and one character column.

chickwts %>% 
  head()
  weight      feed
1    179 horsebean
2    160 horsebean
3    136 horsebean
4    227 horsebean
5    217 horsebean
6    168 horsebean
chickwts %>% 
  select(where(is.numeric)) %>% 
  head()
  weight
1    179
2    160
3    136
4    227
5    217
6    168

16.2.5 Multiple Criteria

For more complex tasks, multiple criteria for selecting columns can be combined with logical operators (&, |, !).

mtcars %>% 
  select((starts_with("d") & !where(is.character)) | contains("ar"), mpg) %>% 
  head()
                  disp drat gear carb  mpg
Mazda RX4          160 3.90    4    4 21.0
Mazda RX4 Wag      160 3.90    4    4 21.0
Datsun 710         108 3.85    4    1 22.8
Hornet 4 Drive     258 3.08    3    1 21.4
Hornet Sportabout  360 3.15    3    2 18.7
Valiant            225 2.76    3    1 18.1

16.3 Subsetting by Rows

To select a subset of rows from a dataframe, we can provide one or more conditions to the filter() function.

If we want to see rows from mtcars with mpg greater than 25, we can do this as follows:

mtcars %>% 
  filter(mpg > 25)
                mpg cyl  disp  hp drat    wt  qsec vs am gear carb
Fiat 128       32.4   4  78.7  66 4.08 2.200 19.47  1  1    4    1
Honda Civic    30.4   4  75.7  52 4.93 1.615 18.52  1  1    4    2
Toyota Corolla 33.9   4  71.1  65 4.22 1.835 19.90  1  1    4    1
Fiat X1-9      27.3   4  79.0  66 4.08 1.935 18.90  1  1    4    1
Porsche 914-2  26.0   4 120.3  91 4.43 2.140 16.70  0  1    5    2
Lotus Europa   30.4   4  95.1 113 3.77 1.513 16.90  1  1    5    2

Multiple conditions can be given to filter(), and these are separated by commas. If we give multiple conditions to filter(), it will only return rows that meet all of our conditions. In other words, we can think of each , like &, and the rows in the output are those for which all conditions are TRUE. Note how the two blocks of code below return the same output.

mtcars %>% 
  filter(mpg > 25,
         hp < 100,
         vs == 1)
                mpg cyl disp hp drat    wt  qsec vs am gear carb
Fiat 128       32.4   4 78.7 66 4.08 2.200 19.47  1  1    4    1
Honda Civic    30.4   4 75.7 52 4.93 1.615 18.52  1  1    4    2
Toyota Corolla 33.9   4 71.1 65 4.22 1.835 19.90  1  1    4    1
Fiat X1-9      27.3   4 79.0 66 4.08 1.935 18.90  1  1    4    1
mtcars %>% 
  filter(mpg > 25 &
         hp < 100 &
         vs == 1)
                mpg cyl disp hp drat    wt  qsec vs am gear carb
Fiat 128       32.4   4 78.7 66 4.08 2.200 19.47  1  1    4    1
Honda Civic    30.4   4 75.7 52 4.93 1.615 18.52  1  1    4    2
Toyota Corolla 33.9   4 71.1 65 4.22 1.835 19.90  1  1    4    1
Fiat X1-9      27.3   4 79.0 66 4.08 1.935 18.90  1  1    4    1

Because filter() runs logical tests on the rows of our dataset, we should make use of R’s logical operators (&, |, !), relational operators (<, >, <=, >=, ==, !=), and other functions that return logical values (%in%, is.na(), etc.). If it has been a while, or if you jumped to this chapter, you will find the chapter on Logical Vectors helpful.

To illustrate the use of is.na(), we first need some missing data. Since mtcars has no missing data, we can use the mutate() function to temporarily add a column with NA values. The mutate() code below gives us a new column called missingdata that repeats the vector c(1, 2, 3, 4, 5, 6, 7, NA) four times. (mtcars has 32 rows, so we need to repeat a vector with 8 numbers 4 times.) !is.na() will then return rows that do not (!) have NA in missingdata.

mtcars %>% 
  mutate(missingdata = rep(c(1:7, NA), 4)) %>% 
  filter(cyl %in% c(4, 6),
         mpg > 17 & mpg <= 21,
         am != 0 | gear == 4,
         !is.na(missingdata))
               mpg cyl  disp  hp drat    wt  qsec vs am gear carb missingdata
Mazda RX4     21.0   6 160.0 110 3.90 2.620 16.46  0  1    4    4           1
Mazda RX4 Wag 21.0   6 160.0 110 3.90 2.875 17.02  0  1    4    4           2
Merc 280      19.2   6 167.6 123 3.92 3.440 18.30  1  0    4    4           2
Merc 280C     17.8   6 167.6 123 3.92 3.440 18.90  1  0    4    4           3
Ferrari Dino  19.7   6 145.0 175 3.62 2.770 15.50  0  1    5    6           6

dplyr has two other functions that are useful when using filter(): between() and near().

between() is a shortcut for variable >= number & variable <= number. It takes three arguments: the variable name, the lower boundary, and the upper boundary. The two blocks of code below are equivalent.

mtcars %>% 
  filter(between(mpg, 25, 30))
               mpg cyl  disp hp drat    wt qsec vs am gear carb
Fiat X1-9     27.3   4  79.0 66 4.08 1.935 18.9  1  1    4    1
Porsche 914-2 26.0   4 120.3 91 4.43 2.140 16.7  0  1    5    2
mtcars %>% 
  filter(mpg >= 25 & mpg <= 30)
               mpg cyl  disp hp drat    wt qsec vs am gear carb
Fiat X1-9     27.3   4  79.0 66 4.08 1.935 18.9  1  1    4    1
Porsche 914-2 26.0   4 120.3 91 4.43 2.140 16.7  0  1    5    2

near() is an alternative to == that takes into account computer precision. Sometimes numbers that should be equal actually differ by a very small amount because of the way computers store data. near() takes this into account, but == tests for exact equality.

sqrt(2) ^ 2
[1] 2
(sqrt(2) ^ 2) == 2
[1] FALSE
near(sqrt(2) ^ 2, 2)
[1] TRUE

The tolerance, or how much error we can accept, when testing equality is specified by the tol argument. Its default is sqrt(.Machine$double.eps), or about 0.000000015. Changing the tolerance allows us to do more with the near() function. Instead of specifying lower and upper boundaries like with between(), with near(), we can specify the middle value and how much lower or higher we want. Again, the two blocks of code below are equivalent.

mtcars %>% 
  filter(near(mpg, 27.5, 2.5))
               mpg cyl  disp hp drat    wt qsec vs am gear carb
Fiat X1-9     27.3   4  79.0 66 4.08 1.935 18.9  1  1    4    1
Porsche 914-2 26.0   4 120.3 91 4.43 2.140 16.7  0  1    5    2
mtcars %>% 
  filter(mpg >= (27.5 - 2.5) & mpg <= (27.5 + 2.5))
               mpg cyl  disp hp drat    wt qsec vs am gear carb
Fiat X1-9     27.3   4  79.0 66 4.08 1.935 18.9  1  1    4    1
Porsche 914-2 26.0   4 120.3 91 4.43 2.140 16.7  0  1    5    2

Instead of specifying the values directly, we can compute them from the data. The code below filters to get rows with mpg within one standard deviation of the mean.

mtcars %>% 
  filter(near(mpg, mean(mpg), sd(mpg)))
                   mpg cyl  disp  hp drat    wt  qsec vs am gear carb
Mazda RX4         21.0   6 160.0 110 3.90 2.620 16.46  0  1    4    4
Mazda RX4 Wag     21.0   6 160.0 110 3.90 2.875 17.02  0  1    4    4
Datsun 710        22.8   4 108.0  93 3.85 2.320 18.61  1  1    4    1
Hornet 4 Drive    21.4   6 258.0 110 3.08 3.215 19.44  1  0    3    1
Hornet Sportabout 18.7   8 360.0 175 3.15 3.440 17.02  0  0    3    2
Valiant           18.1   6 225.0 105 2.76 3.460 20.22  1  0    3    1
Duster 360        14.3   8 360.0 245 3.21 3.570 15.84  0  0    3    4
Merc 240D         24.4   4 146.7  62 3.69 3.190 20.00  1  0    4    2
Merc 230          22.8   4 140.8  95 3.92 3.150 22.90  1  0    4    2
Merc 280          19.2   6 167.6 123 3.92 3.440 18.30  1  0    4    4
Merc 280C         17.8   6 167.6 123 3.92 3.440 18.90  1  0    4    4
Merc 450SE        16.4   8 275.8 180 3.07 4.070 17.40  0  0    3    3
Merc 450SL        17.3   8 275.8 180 3.07 3.730 17.60  0  0    3    3
Merc 450SLC       15.2   8 275.8 180 3.07 3.780 18.00  0  0    3    3
Chrysler Imperial 14.7   8 440.0 230 3.23 5.345 17.42  0  0    3    4
Toyota Corona     21.5   4 120.1  97 3.70 2.465 20.01  1  0    3    1
Dodge Challenger  15.5   8 318.0 150 2.76 3.520 16.87  0  0    3    2
AMC Javelin       15.2   8 304.0 150 3.15 3.435 17.30  0  0    3    2
Pontiac Firebird  19.2   8 400.0 175 3.08 3.845 17.05  0  0    3    2
Porsche 914-2     26.0   4 120.3  91 4.43 2.140 16.70  0  1    5    2
Ford Pantera L    15.8   8 351.0 264 4.22 3.170 14.50  0  1    5    4
Ferrari Dino      19.7   6 145.0 175 3.62 2.770 15.50  0  1    5    6
Maserati Bora     15.0   8 301.0 335 3.54 3.570 14.60  0  1    5    8
Volvo 142E        21.4   4 121.0 109 4.11 2.780 18.60  1  1    4    2

Any calculations we have filter() do are done on whatever grouping exists in the dataframe. This can cause problems if we are forget our dataframe is grouped, but we can also use it for more advanced filtering. The code below is similar to the code above, except the mean and standard deviation are calculated separately for each value of cyl (4, 6, 8). (Four-cylinder cars have their own mean and standard deviation for mpg, six-cylinder cars have theirs, and so do eight-cylinder cars. See the chapter on Aggregating for more on group_by().)

mtcars %>% 
  group_by(cyl) %>% 
  filter(near(mpg, mean(mpg), sd(mpg)))
# A tibble: 21 x 11
# Groups:   cyl [3]
     mpg   cyl  disp    hp  drat    wt  qsec    vs    am  gear  carb
   <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
 1  21       6  160    110  3.9   2.62  16.5     0     1     4     4
 2  21       6  160    110  3.9   2.88  17.0     0     1     4     4
 3  22.8     4  108     93  3.85  2.32  18.6     1     1     4     1
 4  14.3     8  360    245  3.21  3.57  15.8     0     0     3     4
 5  24.4     4  147.    62  3.69  3.19  20       1     0     4     2
 6  22.8     4  141.    95  3.92  3.15  22.9     1     0     4     2
 7  19.2     6  168.   123  3.92  3.44  18.3     1     0     4     4
 8  16.4     8  276.   180  3.07  4.07  17.4     0     0     3     3
 9  17.3     8  276.   180  3.07  3.73  17.6     0     0     3     3
10  15.2     8  276.   180  3.07  3.78  18       0     0     3     3
# ... with 11 more rows

16.4 Subsetting Exercises

Use the state.x77 dataset for 1 and 2.

  1. Keep all columns except Population. (Hint: If you received an error message about UseMethod, what kind of object does select() require?)

  2. Drop columns with spaces in their names.

Use the iris dataset for 3 and 4.

  1. Drop any columns that are not numeric or do not start with “S”.

  2. Keep rows where Sepal.Length is divisible by 1.0. (Hint: see %% in ?Arithmetic.)

Use the airquality dataset for 5 and 6.

  1. Find days where the value of Ozone is in the range 30-50.

  2. Find days where the value of Ozone is more than two standard deviations away from the mean, with mean and standard deviation calculated by month. (Hint: If you got zero rows in the output, did you check if Ozone is fully observed?)

Bonus:

  • Use the mtcars dataset. Create a 3x3 scatterplot matrix (pass a dataframe to plot()) of the mpg, weight, and horsepower of cars that have four or six cylinders, at least four forward gears, and manual transmission. See help(mtcars) to figure out which column is which.