# 14 Aggregating

For some analysis and graphing tasks you will need to work
with summary data in a data frame. Occasionally this might be
the summary
statistics of an entire data set, but more often it
will be measures that summarize groups of observations.
The task of producing such a data frame may be called
*summarizing*,
*aggregating*, or *collapsing* data.

For some tasks, the summary data set itself may be a suitable endpoint of your data wrangling, but for other problems you will want to merge the summary data back onto the original data set. In base R, this is often a three step process:

- produce summary statistics
- ensure the result is a data frame
- merge

While it is useful to understand how to set this up
in base R, functions in the tidyverse package `dplyr`

can greatly simplify all this, especially where you
want to merge summary statistics back onto your
original data set. (See Chapter 18.)

What follows here is base R.

## 14.1 Summary Statistics

Let’s begin with a review of how to produce various summary statistics, but with the added emphasis of turning the results into data.frames.

Let’s work with a subset of the `mtcars`

data.

`cars <- mtcars[,1:5]`

### 14.1.1 Grand Summaries

The usual summary statistics applied to vectors - mean, minimum, maximum, length (i.e. counts) - produce results that are essentially scalars. The only hard part here is remembering to consider missing values!

Applied to multiple columns in a data.frame using the
`apply`

or `sapply`

functions, our summary
statistics are returned as vectors. Here, it may not be
obvious how to turn this into a data.frame. The key is
to realize that vectors are considered column vectors by
default. The matrix transpose function, `t()`

, fixes
this for us.

```
means <- sapply(cars, mean)
as.data.frame(t(means))
```

```
mpg cyl disp hp drat
1 20.09062 6.1875 230.7219 146.6875 3.596563
```

### 14.1.2 Group Summaries

More often than grand summaries we will be interested in group
summaries. To produce these summaries with a vector and a grouping
variable, you might typically use the `tapply`

function, returning
a vector of results.

```
mpgmin <- tapply(cars$mpg, cars$cyl, min)
mpgmin
```

```
4 6 8
21.4 17.8 10.4
```

And we can turn this into a data frame, with the grouping data values as row names.

`as.data.frame(mpgmin)`

```
mpgmin
4 21.4
6 17.8
8 10.4
```

Another useful function, which produces a data frame directly, is
`aggregate`

. For summaries of a single variable, the formula
method of specification is perhaps easiest to write and labels
the result most nicely.

`aggregate(mpg ~ cyl, cars, min)`

```
cyl mpg
1 4 21.4
2 6 17.8
3 8 10.4
```

The aggregate function is also useful where you have multiple variables to summarize by the same groups, which tapply will not handle.

`aggregate(cars, list(cars$cyl), min)`

```
Group.1 mpg cyl disp hp drat
1 4 21.4 4 71.1 52 3.69
2 6 17.8 6 145.0 105 2.76
3 8 10.4 8 275.8 150 2.76
```

## 14.2 Merging

Often we want to include summary data alongside the original data. This entails merging the summary data with the original data frame.

### 14.2.1 Merging a Named Vector

Where we have a grouped summary statistic in a named vector, we can merge this with the original data directly, skipping the data framing step. We simply use the “by” variable from the original data as a vector of category names (after coercing to character) to construct a new vector in the data frame.

```
mpgmin <- tapply(cars$mpg, cars$cyl, min)
cars$mpgmin <- mpgmin[as.character(cars$cyl)]
head(cars)
```

```
mpg cyl disp hp drat mpgmin
Mazda RX4 21.0 6 160 110 3.90 17.8
Mazda RX4 Wag 21.0 6 160 110 3.90 17.8
Datsun 710 22.8 4 108 93 3.85 21.4
Hornet 4 Drive 21.4 6 258 110 3.08 17.8
Hornet Sportabout 18.7 8 360 175 3.15 10.4
Valiant 18.1 6 225 105 2.76 17.8
```

### 14.2.2 Merging a Data Frame

See Chapter 13 for more on the basics of merging data frames. In this case, we especially need to think about how our summary statistic columns are names in order to avoid confusion in the merged data.

```
cars <- mtcars[,1:3] # simple example
maxcars <- aggregate(cars, list(cars$cyl), max) # summarize
# cleanup prior to merging
names(maxcars) <- c("cyl", paste(names(cars), "max", sep="_"))
maxcars$cyl_max <- NULL # drop
carsplus <- merge(cars, maxcars, all=TRUE)
carsplus[11:20,]
```

```
cyl mpg disp mpg_max disp_max
11 4 27.3 79.0 33.9 146.7
12 6 21.0 160.0 21.4 258.0
13 6 21.0 160.0 21.4 258.0
14 6 17.8 167.6 21.4 258.0
15 6 21.4 258.0 21.4 258.0
16 6 18.1 225.0 21.4 258.0
17 6 19.2 167.6 21.4 258.0
18 6 19.7 145.0 21.4 258.0
19 8 18.7 360.0 19.2 472.0
20 8 17.3 275.8 19.2 472.0
```

## 14.3 Aggregating Exercises

Use the `chickwts`

dataset.

Is this a balanced experiment (are the same number of chickens given each feed)?

Which feed was associated with the largest variation (standard deviation) in weight?

Without reducing the number of rows, add a column with the range (maximum - minimum) of weight for each feed.