Chapter 10 Data Transformation

10.1 Introduction

10.1.1 tidyverse

It is often said that 80% of data analysis is spent on the process of cleaning and preparing the data. (Dasu and Johnson, 2003)

For most applied researchers, data preparation usually involves 3 main steps.

  1. Transforming data frames, e.g. filtering, summarizing, and conducting calculations across groups.
  2. Tidying data into the appropriate format.
  3. Merging or linking several datasets to create a bigger dataset.

The tidyverse is a suite of packages designed specifically to help with these steps. These are by no means the only packages out there for data wrangling, but they are increasingly popular for their readable, straightforward syntax and sensible default behaviors.

In this chapter, we’re going to focus on how to use the dplyr package for data transformation tasks.

10.1.2 Gapminder

For this unit, we’ll be working with the “Gapminder” dataset again.

gap <- read.csv("data/gapminder-FiveYearData.csv", stringsAsFactors = TRUE)
kable(head(gap))
country year pop continent lifeExp gdpPercap
Afghanistan 1952 8425333 Asia 28.8 779
Afghanistan 1957 9240934 Asia 30.3 821
Afghanistan 1962 10267083 Asia 32.0 853
Afghanistan 1967 11537966 Asia 34.0 836
Afghanistan 1972 13079460 Asia 36.1 740
Afghanistan 1977 14880372 Asia 38.4 786

10.1.3 Why dplyr?

So far, you’ve seen the basics of manipulating data frames, e.g. subsetting and basic calculations. For instance, we can use base R functions to calculate summary statistics across groups of observations, e.g., the mean GDP per capita within each region:

mean(gap[gap$continent == "Africa", "gdpPercap"])
#> [1] 2194
mean(gap[gap$continent == "Americas", "gdpPercap"])
#> [1] 7136
mean(gap[gap$continent == "Asia", "gdpPercap"])
#> [1] 7902

But this isn’t ideal because it involves a fair bit of repetition. Repeating yourself will cost you time, both now and later, and potentially introduce some nasty bugs.

Luckily, the dplyr package provides a number of very useful functions for manipulating dataframes. These functions will save you time by reducing repetition. As an added bonus, you might even find the dplyr grammar easier to read.

Here, we’re going to cover 6 of the most commonly used dplyr functions. We’ll also cover pipes (%>%), which are used to combine them.

  1. select()
  2. filter()
  3. group_by()
  4. summarize()
  5. mutate()
  6. arrange()

If you have have not installed this package earlier, please do so now:

# not run
# install.packages('dplyr')

Now let’s load the package:

library(dplyr)

10.2 dplyr Functions

10.2.1 Select Columns with select

Imagine that we just received the gapminder dataset, but are only interested in a few variables in it. We could use the select() function to keep only the variables we select.

year_country_gdp <- select(gap, year, country, gdpPercap)
kable(head(year_country_gdp))
year country gdpPercap
1952 Afghanistan 779
1957 Afghanistan 821
1962 Afghanistan 853
1967 Afghanistan 836
1972 Afghanistan 740
1977 Afghanistan 786
knitr::include_graphics(path = "img/dplyr-fig1.png")

If we open up year_country_gdp, we’ll see that it only contains the year, country and gdpPercap. This is equivalent to the base R subsetting function:

year_country_gdp_base <- gap[,c("year", "country", "gdpPercap")]
kable(head(year_country_gdp))
year country gdpPercap
1952 Afghanistan 779
1957 Afghanistan 821
1962 Afghanistan 853
1967 Afghanistan 836
1972 Afghanistan 740
1977 Afghanistan 786

But, as we will see, dplyr makes for much more readible, efficient code because of its pipe operator.

10.2.2 The Pipe

knitr::include_graphics(path = "img/pipe.jpg")

Above, we used what’s called ‘normal’ grammar, but the strengths of dplyr lie in combining several functions using pipes. Since the pipes grammar is unlike anything we’ve seen in R before, let’s repeat what we’ve done above using pipes.

In typical base R code, a simple operation might be written like:

# NOT run
cupcakes <- bake(pour(mix(ingredients)))

A computer has no trouble understanding this and your cupcakes will be made just fine, but a person has to read right to left to understand the order of operations - the opposite of how most western languages are read - making it harder to understand what is being done!

To be more readable without pipes, we might break up this code into intermediate objects:

## NOT run
batter <- mix(ingredients)
muffin_tin <- pour(batter)
cupcakes <- bake(muffin_tin)

But, this can clutter our environment with a lot of variables that aren’t very useful to us. Plus, these variables are often are named very similar things (e.g. step, step1, step2…) which can lead to confusion and the creation of hard-to-track-down bugs.

Enter the Pipe…

The pipe makes it easier to read code by laying out operations from left to right – each line can be read like a line of a recipe for the perfect data frame!

Pipes take the input on the left side of the %>% symbol and pass it in as the first argument to the function on the right side.

With pipes, our cupcake example might be written like:

## NOT run
cupcakes <- ingredients %>% 
  mix() %>% 
  pour() %>% 
  bake()

Tips for Piping

  1. Remember that you don’t assign anything within the pipes – that is, you should not use <- inside the piped operation. Only use this at the beginning of your code if you want to save the output.

  2. Remember to add the pipe %>% at the end of each line involved in the piped operation. A good rule of thumb: since RStudio will automatically indent lines of code that are part of a piped operation, if the line isn’t indented, it probably hasn’t been added to the pipe. If you have an error in a piped operation, always check to make sure the pipe is connected as you expect.

  3. In RStudio, the hotkey for the pipe is Ctrl + Shift + M.

select & Pipe (%>%)

Since the pipe grammar is unlike anything we’ve seen in R before, let’s repeat what we did above with the gapminder dataset using pipes:

year_country_gdp <- gap %>% select(year, country, gdpPercap)

Let’s walk through it step by step.

First, we summon the gapminder data frame and pass it on to the next step using the pipe symbol %>%.

The second step is the select() function. In this case, we don’t specify which data object we use in the call to select() since we’ve piped it in from the previous line.

Fun Fact: There is a good chance you have encountered pipes before in the shell. In R, a pipe symbol is %>% while in the shell it is |. But the concept is the same!

10.2.3 Filter Rows with filter

Now let’s say we’re only interested in African countries. We can combine select and filter to select only the observations where continent is Africa.

year_country_gdp_africa <- gap %>%
    filter(continent == "Africa") %>%
    select(year, country, gdpPercap)

As with last time, first we pass the gapminder dataframe to the filter() function, then we pass the filtered version of the gapminder dataframe to the select() function.

To clarify, both the select and filter functions subset the data frame. The difference is that select extracts certain columns, while filter extracts certain rows.

Note: The order of operations is very important in this case. If we used select first, filter would not be able to find the variable continent since we would have removed it in the previous step.

10.2.4 Calculate Across Groups with group_by

A common task you’ll encounter when working with data is running calculations on different groups within the data. For instance, what if we wanted to calculate the mean GDP per capita for each continent?

In base R, you would have to run the mean() function for each subset of data:

mean(gap$gdpPercap[gap$continent == "Africa"])
#> [1] 2194
mean(gap$gdpPercap[gap$continent == "Americas"])
#> [1] 7136
mean(gap$gdpPercap[gap$continent == "Asia"])
#> [1] 7902
mean(gap$gdpPercap[gap$continent == "Europe"])
#> [1] 14469
mean(gap$gdpPercap[gap$continent == "Oceania"])
#> [1] 18622

That’s a lot of repetition! To make matters worse, what if we wanted to add these values to our original data frame as a new column? We would have to write something like this:

gap$mean.continent.GDP <- NA
gap$mean.continent.GDP[gap$continent == "Africa"] <- mean(gap$gdpPercap[gap$continent == "Africa"])
gap$mean.continent.GDP[gap$continent == "Americas"] <- mean(gap$gdpPercap[gap$continent == "Americas"])
gap$mean.continent.GDP[gap$continent == "Asia"] <- mean(gap$gdpPercap[gap$continent == "Asia"])
gap$mean.continent.GDP[gap$continent == "Europe"] <- mean(gap$gdpPercap[gap$continent == "Europe"])
gap$mean.continent.GDP[gap$continent == "Oceania"] <- mean(gap$gdpPercap[gap$continent == "Oceania"])

You can see how this can get pretty tedious, especially if we want to calculate more complicated or refined statistics. We could use loops or apply functions, but these can be difficult, slow, and error-prone.

split-apply-combine

The abstract problem we’re encountering here is know as “split-apply-combine”:

knitr::include_graphics(path = "img/splitapply.png")

We want to split our data into groups (in this case continents), apply some calculations on that group, then combine the results together afterwards.

Luckily, dplyr offers a much cleaner, straight-forward solution to this problem.

First, let’s remove the column we just made.

gap <- gap %>% select(-mean.continent.GDP) # drop a column with - 
# OR
gap$mean.continent.GDP <- NULL

10.2.4.1 group_by

We’ve already seen how filter() can help us select observations that meet certain criteria (in the above: continent == "Africa"). More helpful, however, is the group_by() function, which will essentially use every unique criteria that we could have used in filter().

A grouped_df can be thought of as a list where each item in the list is a data.frame which contains only the rows that correspond to a particular value for continent (at least in the example above).

knitr::include_graphics(path = "img/dplyr-fig2.png")

10.2.5 Summarize Across Groups with summarize

group_by() on its own is not particularly interesting. It’s much more exciting used in conjunction with the summarize() function.

This will allow us to create new variable(s) by applying transformations to variables in each of our groups (continent-specific data frames).

In other words, using the group_by() function, we split our original data frame into multiple pieces, to which we then apply summary functions (e.g. mean() or sd()) within summarize().

The output is a new data frame reduced in size, with one row per group.

gdp_bycontinents <- gap %>%
    group_by(continent) %>%
    summarize(mean_gdpPercap = mean(gdpPercap))
kable(head(gdp_bycontinents))
continent mean_gdpPercap
Africa 2194
Americas 7136
Asia 7902
Europe 14469
Oceania 18622
knitr::include_graphics(path = "img/dplyr-fig3.png")

That allowed us to calculate the mean gdpPercap for each continent.

But it gets even better – the function group_by() allows us to group by multiple variables. Let’s group by year and continent.

gdp_bycontinents_byyear <- gap %>%
    group_by(continent, year) %>%
    summarize(mean_gdpPercap = mean(gdpPercap))
kable(head(gdp_bycontinents_byyear))
continent year mean_gdpPercap
Africa 1952 1253
Africa 1957 1385
Africa 1962 1598
Africa 1967 2050
Africa 1972 2340
Africa 1977 2586

That is already quite powerful, but it gets even better! You’re not limited to defining 1 new variable in summarize().

gdp_pop_bycontinents_byyear <- gap %>%
    group_by(continent, year) %>%
    summarize(mean_gdpPercap = mean(gdpPercap),
              sd_gdpPercap = sd(gdpPercap),
              mean_pop = mean(pop),
              sd_pop = sd(pop))
kable(head(gdp_pop_bycontinents_byyear))
continent year mean_gdpPercap sd_gdpPercap mean_pop sd_pop
Africa 1952 1253 983 4570010 6317450
Africa 1957 1385 1135 5093033 7076042
Africa 1962 1598 1462 5702247 7957545
Africa 1967 2050 2848 6447875 8985505
Africa 1972 2340 3287 7305376 10130833
Africa 1977 2586 4142 8328097 11585184

10.2.6 Add New Variables with mutate

What if we wanted to add these values to our original data frame instead of creating a new object?

For this, we can use the mutate() function, which is similar to summarize() except that it creates new variables in the same dataframe that you pass into it.

gapminder_with_extra_vars <- gap %>%
    group_by(continent, year) %>%
    mutate(mean_gdpPercap = mean(gdpPercap),
              sd_gdpPercap = sd(gdpPercap),
              mean_pop = mean(pop),
              sd_pop = sd(pop))
kable(head(gapminder_with_extra_vars))
country year pop continent lifeExp gdpPercap mean_gdpPercap sd_gdpPercap mean_pop sd_pop
Afghanistan 1952 8425333 Asia 28.8 779 5195 18635 42283556 1.13e+08
Afghanistan 1957 9240934 Asia 30.3 821 5788 19507 47356988 1.28e+08
Afghanistan 1962 10267083 Asia 32.0 853 5729 16416 51404763 1.36e+08
Afghanistan 1967 11537966 Asia 34.0 836 5971 14063 57747361 1.53e+08
Afghanistan 1972 13079460 Asia 36.1 740 8187 19088 65180977 1.74e+08
Afghanistan 1977 14880372 Asia 38.4 786 7791 11816 72257987 1.92e+08

We can use also use mutate() to create new variables prior to (or even after) summarizing the information.

gdp_pop_bycontinents_byyear <- gap %>%
    mutate(gdp_billion = gdpPercap*pop/10^9) %>%
    group_by(continent, year) %>%
    summarize(mean_gdpPercap = mean(gdpPercap),
              sd_gdpPercap = sd(gdpPercap),
              mean_pop = mean(pop),
              sd_pop = sd(pop),
              mean_gdp_billion = mean(gdp_billion),
              sd_gdp_billion = sd(gdp_billion))
kable(head(gdp_pop_bycontinents_byyear))
continent year mean_gdpPercap sd_gdpPercap mean_pop sd_pop mean_gdp_billion sd_gdp_billion
Africa 1952 1253 983 4570010 6317450 5.99 11.4
Africa 1957 1385 1135 5093033 7076042 7.36 14.5
Africa 1962 1598 1462 5702247 7957545 8.79 17.2
Africa 1967 2050 2848 6447875 8985505 11.44 23.2
Africa 1972 2340 3287 7305376 10130833 15.07 30.4
Africa 1977 2586 4142 8328097 11585184 18.70 38.1

mutate vs. summarize

It can be confusing to decide whether to use mutate or summarize. The key distinction is whether you want the output to have one row for each group or one row for each row in the original data frame:

  • mutate: creates new columns with as many rows as the original data frame
  • summarize: creates a data frame with as many rows as groups

Note that if you use an aggregation function such as mean() within mutate() without using group_by(), you’ll simply do the summary over all the rows of the input data frame.

And if you use an aggregation function such as mean() within summarize() without using group_by(), you’ll simply create an output data frame with one row (i.e., the whole input data frame is a single group).

10.2.7 Arrange Rows with arrange

As a last step, let’s say we want to sort the rows in our data frame according to values in a certain column. We can use the arrange() function to do this. For instance, let’s organize our rows by year (recent first), and then by continent.

gapminder_with_extra_vars <- gap %>%
    group_by(continent, year) %>%
    mutate(mean_gdpPercap = mean(gdpPercap),
              sd_gdpPercap = sd(gdpPercap),
              mean_pop = mean(pop),
              sd_pop = sd(pop)) %>%
    arrange(desc(year), continent)
kable(head(gapminder_with_extra_vars))
country year pop continent lifeExp gdpPercap mean_gdpPercap sd_gdpPercap mean_pop sd_pop
Algeria 2007 33333216 Africa 72.3 6223 3089 3618 17875763 24917726
Angola 2007 12420476 Africa 42.7 4797 3089 3618 17875763 24917726
Benin 2007 8078314 Africa 56.7 1441 3089 3618 17875763 24917726
Botswana 2007 1639131 Africa 50.7 12570 3089 3618 17875763 24917726
Burkina Faso 2007 14326203 Africa 52.3 1217 3089 3618 17875763 24917726
Burundi 2007 8390505 Africa 49.6 430 3089 3618 17875763 24917726

10.3 dplyr and “Non-Standard Evaluation”

You may run across the term “non-standard evaluation”. The use of data frame variables without quotes around them is an example of this.

Why is this strange?

gap %>% select(continent, year) %>% tail()

Compare it to:

gap[ , c('continent', 'year')]
gap[ , continent]

Because continent and year are not variables our current environment, dplyr does some fancy stuff behind the scenes to save us from typing the quotes.

This is fine if you have a data analysis workflow, but if you want to write a function that, for example, selects an arbitrary set of columns, you’ll run into trouble.

## here's a helper function that computes the mean of a variable, stratifying by a grouping variable
grouped_mean <- function(data, group_var, summary_var) {
  data %>%
    group_by(group_var) %>%
    summarise(mean = mean(summary_var))
}
gap %>% grouped_mean(continent, lifeExp)
gap %>% grouped_mean('continent', 'lifeExp')

See the rlang or seplyr packages for how one can deal with this problem in this context of using functions.

10.4 Challenges

Challenge 1.

Use dplyr to create a data frame containing the median lifeExp for each continent.

Challenge 2.

Use dplyr to add a column to the gapminder dataset that contains the total population of the continent of each observation in a given year. For example, if the first observation is Afghanistan in 1952, the new column would contain the population of Asia in 1952.

Challenge 3.

Use dplyr to: (a) add a column called gdpPercap_diff that contains the difference between the observation’s gdpPercap and the mean gdpPercap of the continent in that year, (b) arrange the dataframe by the column you just created, in descending order (so that the relatively richest country-years are listed first).

hint: You might have to ungroup() before you arrange().

Acknowledgments

Some of these materials in this module were adapted from: