Chapter 9 Tidying Data

Even before we conduct analyses or calculations, we need to put our data into the correct format. The goal here is to rearrange a messy dataset into one that is tidy.

The two most important properties of tidy data are:

  1. Each column is a variable.
  2. Each row is an observation.

Tidy data is easier to work with because you have a consistent way of referring to variables (as column names) and observations (as row indices). The data then becomes easier to manipulate, visualize, and model.

For more on the concept of tidy data, read Hadley Wickham's paper here.

9.1 Wide vs. Long Formats

Tidy datasets are all alike, but every messy dataset is messy in its own way.

Hadley Wickham

Tabular datasets can be arranged in many ways. For instance, consider the data below. Each dataset displays information on heart rates observed in individuals across three different time periods, but the data are organized differently in each table.

wide <- data.frame(
  name = c("Wilbur", "Petunia", "Gregory"),
  time1 = c(67, 80, 64),
  time2 = c(56, 90, 50),
  time3 = c(70, 67, 101)
)
wide
#>      name time1 time2 time3
#> 1  Wilbur    67    56    70
#> 2 Petunia    80    90    67
#> 3 Gregory    64    50   101

long <- data.frame(
  name = c("Wilbur", "Petunia", "Gregory", "Wilbur", "Petunia", "Gregory", "Wilbur", "Petunia", "Gregory"),
  time = c(1, 1, 1, 2, 2, 2, 3, 3, 3),
  heartrate = c(67, 80, 64, 56, 90, 50, 70, 67, 10)
)
long
#>      name time heartrate
#> 1  Wilbur    1        67
#> 2 Petunia    1        80
#> 3 Gregory    1        64
#> 4  Wilbur    2        56
#> 5 Petunia    2        90
#> 6 Gregory    2        50
#> 7  Wilbur    3        70
#> 8 Petunia    3        67
#> 9 Gregory    3        10

Question: Which one of these do you think is the tidy format?

Answer: The first dataframe (the "wide" one) would not be considered tidy because values (i.e., heart rate) are spread across multiple columns.

We often refer to these different structures as "long" vs. "wide" formats:

  • In the "long" format, you usually have one column for the observed variable, and the other columns are ID variables.

  • In the "wide" format, each row is often a site/subject/patient, and you have multiple observation variables containing the same type of data. These can be either repeated observations over time or observations of multiple variables (or a mix of both). In the case above, we had the same kind of data (heart rate) entered across three different columns, corresponding to three different time periods.

You may find data input in the "wide" format to be simpler, and some other applications may prefer "wide"-format data. However, many of R’s functions have been designed assuming you have "long"-format data.

9.2 Tidying the Gapminder Data

Let's look at the structure of our original gapminder dataframe:

library(gapminder)

gap <- gapminder
head(gap)
#> # A tibble: 6 x 6
#>   country     continent  year lifeExp      pop gdpPercap
#>   <fct>       <fct>     <int>   <dbl>    <int>     <dbl>
#> 1 Afghanistan Asia       1952    28.8  8425333      779.
#> 2 Afghanistan Asia       1957    30.3  9240934      821.
#> 3 Afghanistan Asia       1962    32.0 10267083      853.
#> 4 Afghanistan Asia       1967    34.0 11537966      836.
#> 5 Afghanistan Asia       1972    36.1 13079460      740.
#> 6 Afghanistan Asia       1977    38.4 14880372      786.

Question: Is this data frame wide or long?

Answer: This data frame is somewhere in between the purely 'long' and 'wide' formats. We have three "ID variables" (continent, country, year) and three "observation variables" (pop, lifeExp, gdpPercap).

Despite not having all observations in one column, this intermediate format makes sense given that all three observation variables have different units. As we have seen, many of the functions in R are often vector-based, and you usually do not want to do mathematical operations on values with different units.

On the other hand, there are some instances in which a purely long or wide format is ideal (e.g., plotting). Likewise, sometimes you will get data on your desk that is poorly organized, and you will need to reshape it.

9.3 tidyr Functions

Thankfully, the tidyr package will help you efficiently transform your data regardless of their original format.

# Load the "tidyverse" package (necessary every new R session):
require(tidyverse)

9.3.1 gather

Until now, we have been using the nicely formatted original gapminder dataset. This dataset is not quite wide and not quite long -- it is something in the middle -- but 'real' data (i.e., our own research data) will never be so well organized. Here let's start with the wide-format version of the gapminder dataset.

gap_wide <- read.csv("data/gapminder_wide.csv", stringsAsFactors = FALSE)
head(gap_wide)
#>   continent      country gdpPercap_1952 gdpPercap_1957 gdpPercap_1962
#> 1    Africa      Algeria           2449           3014           2551
#> 2    Africa       Angola           3521           3828           4269
#> 3    Africa        Benin           1063            960            949
#> 4    Africa     Botswana            851            918            984
#> 5    Africa Burkina Faso            543            617            723
#> 6    Africa      Burundi            339            380            355
#>   gdpPercap_1967 gdpPercap_1972 gdpPercap_1977 gdpPercap_1982 gdpPercap_1987
#> 1           3247           4183           4910           5745           5681
#> 2           5523           5473           3009           2757           2430
#> 3           1036           1086           1029           1278           1226
#> 4           1215           2264           3215           4551           6206
#> 5            795            855            743            807            912
#> 6            413            464            556            560            622
#>   gdpPercap_1992 gdpPercap_1997 gdpPercap_2002 gdpPercap_2007 lifeExp_1952
#> 1           5023           4797           5288           6223         43.1
#> 2           2628           2277           2773           4797         30.0
#> 3           1191           1233           1373           1441         38.2
#> 4           7954           8647          11004          12570         47.6
#> 5            932            946           1038           1217         32.0
#> 6            632            463            446            430         39.0
#>   lifeExp_1957 lifeExp_1962 lifeExp_1967 lifeExp_1972 lifeExp_1977 lifeExp_1982
#> 1         45.7         48.3         51.4         54.5         58.0         61.4
#> 2         32.0         34.0         36.0         37.9         39.5         39.9
#> 3         40.4         42.6         44.9         47.0         49.2         50.9
#> 4         49.6         51.5         53.3         56.0         59.3         61.5
#> 5         34.9         37.8         40.7         43.6         46.1         48.1
#> 6         40.5         42.0         43.5         44.1         45.9         47.5
#>   lifeExp_1987 lifeExp_1992 lifeExp_1997 lifeExp_2002 lifeExp_2007 pop_1952
#> 1         65.8         67.7         69.2         71.0         72.3  9279525
#> 2         39.9         40.6         41.0         41.0         42.7  4232095
#> 3         52.3         53.9         54.8         54.4         56.7  1738315
#> 4         63.6         62.7         52.6         46.6         50.7   442308
#> 5         49.6         50.3         50.3         50.6         52.3  4469979
#> 6         48.2         44.7         45.3         47.4         49.6  2445618
#>   pop_1957 pop_1962 pop_1967 pop_1972 pop_1977 pop_1982 pop_1987 pop_1992
#> 1 10270856 11000948 12760499 14760787 17152804 20033753 23254956 26298373
#> 2  4561361  4826015  5247469  5894858  6162675  7016384  7874230  8735988
#> 3  1925173  2151895  2427334  2761407  3168267  3641603  4243788  4981671
#> 4   474639   512764   553541   619351   781472   970347  1151184  1342614
#> 5  4713416  4919632  5127935  5433886  5889574  6634596  7586551  8878303
#> 6  2667518  2961915  3330989  3529983  3834415  4580410  5126023  5809236
#>   pop_1997 pop_2002 pop_2007
#> 1 29072015 31287142 33333216
#> 2  9875024 10866106 12420476
#> 3  6066080  7026113  8078314
#> 4  1536536  1630347  1639131
#> 5 10352843 12251209 14326203
#> 6  6121610  7021078  8390505

The first step towards getting our nice intermediate data format is to first convert from the wide to the long format.

The function gather() will 'gather' the observation variables into a single variable. This is sometimes called "melting" your data, because it melts the table from wide to long. Those data will be melted into two variables: one for the variable names and the other for the variable values.

gap_long <- gap_wide %>%
    gather(obstype_year, obs_values, 3:38)
head(gap_long)
#>   continent      country   obstype_year obs_values
#> 1    Africa      Algeria gdpPercap_1952       2449
#> 2    Africa       Angola gdpPercap_1952       3521
#> 3    Africa        Benin gdpPercap_1952       1063
#> 4    Africa     Botswana gdpPercap_1952        851
#> 5    Africa Burkina Faso gdpPercap_1952        543
#> 6    Africa      Burundi gdpPercap_1952        339

Notice that we put three arguments into the gather() function:

  1. The name for the new ID variable (obstype_year).
  2. The name for the new amalgamated observation variable (obs_value).
  3. The indices of the old observation variables (3:38, signalling columns 3 through 38) that we want to gather into one variable. Notice that we do not want to melt down columns 1 and 2, as these are considered ID variables.

We can select observation variables using:

  • Variable indices.
  • Variable names (without quotes).
  • x:z to select all variables between x and z.
  • -y to exclude y.
  • starts_with(x, ignore.case = TRUE): All names that start with x.
  • ends_with(x, ignore.case = TRUE): All names that end with x.
  • contains(x, ignore.case = TRUE): All names that contain x.

See the select() function in dplyr for more options.

For instance, here we do the same thing with (1) the starts_with function and (2) the - operator:

# 1. With the starts_with() function:
gap_long <- gap_wide %>%
    gather(obstype_year, obs_values, starts_with('pop'),
           starts_with('lifeExp'), starts_with('gdpPercap'))
head(gap_long)
#>   continent      country obstype_year obs_values
#> 1    Africa      Algeria     pop_1952    9279525
#> 2    Africa       Angola     pop_1952    4232095
#> 3    Africa        Benin     pop_1952    1738315
#> 4    Africa     Botswana     pop_1952     442308
#> 5    Africa Burkina Faso     pop_1952    4469979
#> 6    Africa      Burundi     pop_1952    2445618

# 2. With the - operator:
gap_long <- gap_wide %>% 
  gather(obstype_year, obs_values, -continent, -country)
head(gap_long)
#>   continent      country   obstype_year obs_values
#> 1    Africa      Algeria gdpPercap_1952       2449
#> 2    Africa       Angola gdpPercap_1952       3521
#> 3    Africa        Benin gdpPercap_1952       1063
#> 4    Africa     Botswana gdpPercap_1952        851
#> 5    Africa Burkina Faso gdpPercap_1952        543
#> 6    Africa      Burundi gdpPercap_1952        339

However you choose to do it, notice that the output collapses all of the measured variables into two columns: one containing the new ID variable, the other containing the observation value for that row.

9.3.2 separate

You will notice that, in our long dataset, obstype_year actually contains two pieces of information, the observation type (pop, lifeExp, or gdpPercap) and the year.

We can use the separate() function to split the character strings into multiple variables.

gap_long_sep <- gap_long %>% 
  separate(obstype_year, into = c('obs_type','year'), sep = "_") %>% 
  mutate(year = as.integer(year))
head(gap_long_sep)
#>   continent      country  obs_type year obs_values
#> 1    Africa      Algeria gdpPercap 1952       2449
#> 2    Africa       Angola gdpPercap 1952       3521
#> 3    Africa        Benin gdpPercap 1952       1063
#> 4    Africa     Botswana gdpPercap 1952        851
#> 5    Africa Burkina Faso gdpPercap 1952        543
#> 6    Africa      Burundi gdpPercap 1952        339

9.3.3 spread

The opposite of gather() is spread(). It spreads our observation variables back out to make a wider table. We can use this function to spread our gap_long() to the original "medium" format.

gap_medium <- gap_long_sep %>% 
  spread(obs_type, obs_values)
head(gap_medium)
#>   continent country year gdpPercap lifeExp      pop
#> 1    Africa Algeria 1952      2449    43.1  9279525
#> 2    Africa Algeria 1957      3014    45.7 10270856
#> 3    Africa Algeria 1962      2551    48.3 11000948
#> 4    Africa Algeria 1967      3247    51.4 12760499
#> 5    Africa Algeria 1972      4183    54.5 14760787
#> 6    Africa Algeria 1977      4910    58.0 17152804

All we need is some quick fixes to make this dataset identical to the original gapminder dataset:

gap <- gapminder
head(gap)
#> # A tibble: 6 x 6
#>   country     continent  year lifeExp      pop gdpPercap
#>   <fct>       <fct>     <int>   <dbl>    <int>     <dbl>
#> 1 Afghanistan Asia       1952    28.8  8425333      779.
#> 2 Afghanistan Asia       1957    30.3  9240934      821.
#> 3 Afghanistan Asia       1962    32.0 10267083      853.
#> 4 Afghanistan Asia       1967    34.0 11537966      836.
#> 5 Afghanistan Asia       1972    36.1 13079460      740.
#> 6 Afghanistan Asia       1977    38.4 14880372      786.
# Rearrange columns:
gap_medium <- gap_medium %>%
  select(country, continent, year, lifeExp, pop, gdpPercap)
head(gap_medium)
#>   country continent year lifeExp      pop gdpPercap
#> 1 Algeria    Africa 1952    43.1  9279525      2449
#> 2 Algeria    Africa 1957    45.7 10270856      3014
#> 3 Algeria    Africa 1962    48.3 11000948      2551
#> 4 Algeria    Africa 1967    51.4 12760499      3247
#> 5 Algeria    Africa 1972    54.5 14760787      4183
#> 6 Algeria    Africa 1977    58.0 17152804      4910
# Arrange by country, continent, and year:
gap_medium <- gap_medium %>% 
  arrange(country,continent,year)
head(gap_medium)
#>       country continent year lifeExp      pop gdpPercap
#> 1 Afghanistan      Asia 1952    28.8  8425333       779
#> 2 Afghanistan      Asia 1957    30.3  9240934       821
#> 3 Afghanistan      Asia 1962    32.0 10267083       853
#> 4 Afghanistan      Asia 1967    34.0 11537966       836
#> 5 Afghanistan      Asia 1972    36.1 13079460       740
#> 6 Afghanistan      Asia 1977    38.4 14880372       786

What we just told you will become obsolete...

gather and spread are being replaced by pivot_longer and pivot_wider in tidyr 1.0.0, which uses ideas from the cdata package to make reshaping easier to think about. In future classes, we will migrate to those functions.

9.4 Dealing with Missing Data

A common challenge in applying quantitative tools to social science problems is dealing with missing data. You'll see a variety of ways the creators of data sets designate that a piece of data is missing - for example, NA, -99, or -77 are sometimes used to denote a missing piece of data. We recommend using NA, which has a variety of associated functions that are useful when transforming missing data.

9.4.1 na_if

You can use na_if to replace certain pieces of data with NA. Consider the case where lifeExp values below 35 are missing and simply filled in with "unknown":

gap_medium$lifeExp[gap_medium$lifeExp < 35] <- "unknown"

This is problematic for many reason, including that we can't perform simple mathematical functions on columns with both number and character values:

mean(gap_medium$lifeExp, na.rm = TRUE)
#> Warning in mean.default(gap_medium$lifeExp, na.rm = TRUE): argument is not
#> numeric or logical: returning NA
#> [1] NA

However, NAs are different in the sense that they can exist in a numeric vector, and therefore you still can perform math functions (R will omit those observations with NA). Below, we replace the "unknown" values with NA, and we can then calculate the mean lifeExp.

gap_medium <- gap_medium %>%
  mutate(lifeExp = na_if(lifeExp, "unknown"),
         lifeExp = as.double(lifeExp))

mean(gap_medium$lifeExp, na.rm = TRUE)
#> [1] 60

9.4.2 Replace NA values with replace_na

Sometimes, you'll want to replace all NA values in your data (for instance, maybe you know that the true value of anything coded as NA is actually 30). replace_na is a simple command that will replace all NAs with a new value.

gap_na_replaced <- gap_medium %>%
  mutate(lifeExp = replace_na(lifeExp, 30))

head(gap_na_replaced)
#>       country continent year lifeExp      pop gdpPercap
#> 1 Afghanistan      Asia 1952    30.0  8425333       779
#> 2 Afghanistan      Asia 1957    30.0  9240934       821
#> 3 Afghanistan      Asia 1962    30.0 10267083       853
#> 4 Afghanistan      Asia 1967    30.0 11537966       836
#> 5 Afghanistan      Asia 1972    36.1 13079460       740
#> 6 Afghanistan      Asia 1977    38.4 14880372       786

9.5 More tidyverse

dplyr and tidyr have many more functions to help you wrangle and manipulate your data. See the Data Wrangling Cheatsheet for more.

There are some other useful packages in the tidyverse:

  • ggplot2 for plotting (we will cover this in the Visualization module).
  • readr and haven for reading in data.
  • purrr for working iterations.
  • stringr, lubridate, and forcats for manipulating strings, dates, and factors, respectively.
  • Many many more! Take a peak at the tidyverse github page...

9.6 Challenges

Challenge 1.

Subset the results from Challenge #3 (of the previous chapter) to select only the country, year, and gdpPercap_diff columns. Use tidyr to put it in wide format so that countries are rows and years are columns.

Challenge 2.

Now turn the dataframe above back into the long format with three columns: country, year, and gdpPercap_diff.

Acknowledgments

Some of the materials in this module were adapted from:

library(tidyverse)
library(gapminder)