Chapter 10 Relational Data

It is rare that data analysis involves only a single table of data. Typically, you have many tables of data, and you must combine them to answer the questions that you are interested in. Collectively, multiple tables of data are called relational data because it is the relations, not just the individual datasets, that are important.

Note that when we say relational database here, we are referring to how the data are structured, not to the use of any fancy software.

10.1 Why Relational Data

As social scientists, we are often working with data across different levels of analysis. The main principle of relational data is that each table is structured around the same observational unit.

Why is this important? Check out the following data.

messy <- data.frame(
  county = c(36037, 36038, 36039, 36040, NA , 37001, 37002, 37003),
  state = c('NY', 'NY', 'NY', NA, NA, 'VA', 'VA', 'VA'),
  cnty_pop = c(3817735, 422999, 324920, 143432, NA, 3228290, 449499, 383888),
  state_pop = c(43320903, 43320903, NA, 43320903, 43320903, 7173000, 7173000, 7173000),
  region = c(1, 1, 1, 1, 1, 3, 3, 4)
)

messy
#>   county state cnty_pop state_pop region
#> 1  36037    NY  3817735  43320903      1
#> 2  36038    NY   422999  43320903      1
#> 3  36039    NY   324920        NA      1
#> 4  36040  <NA>   143432  43320903      1
#> 5     NA  <NA>       NA  43320903      1
#> 6  37001    VA  3228290   7173000      3
#> 7  37002    VA   449499   7173000      3
#> 8  37003    VA   383888   7173000      4

What a mess! How can the population of the state of New York be 43 million for one county but "missing" for another? If this is a dataset of counties, what does it mean when the "county" field is missing? If region is something like Census region, how can two counties in the same state be in different regions? And why is it that all the counties whose codes start with 36 are in New York except for one, where the state is unknown?

If we follow the principles of relational data, each type of observational unit should form a table:

  • counties contains data on counties.
  • states contains data on states.

So our data should look like this:

counties <- data.frame(
  county = c(36037, 36038, 36039, 36040, 37001, 37002, 37003),
  state = c('NY', 'NY', 'NY', 'NY', 'VA', 'VA', 'VA'),
  county_pop = c(3817735, 422999, 324920, 143432, 3228290, 449499, 383888), stringsAsFactors = F
)
counties
#>   county state county_pop
#> 1  36037    NY    3817735
#> 2  36038    NY     422999
#> 3  36039    NY     324920
#> 4  36040    NY     143432
#> 5  37001    VA    3228290
#> 6  37002    VA     449499
#> 7  37003    VA     383888

states <- data.frame(
  state = c("NY", "VA"),
  state_pop = c(43320903, 7173000),
  region = c(1, 3), stringsAsFactors = F
)

states
#>   state state_pop region
#> 1    NY  43320903      1
#> 2    VA   7173000      3

County population is a property of a county, so it lives in the county table. State population is a property of a state, so it cannot live in the county table. If we had panel data on counties, we would need separate tables for things that vary at the county level (like state) and things that vary at the county-year level (like population).

Now the ambiguity is gone. Every county has a population and a state. Every state has a population and a region. There are no missing states, no missing counties, and no conflicting definitions. The database is self-documenting.

10.2 Keys

The variables used to connect each pair of tables are called keys. A key is a variable (or set of variables) that uniquely identifies an observation; it can also be called a unique identifier.

  • Keys are complete. They never take on missing values.
  • Keys are unique. They are never duplicated across rows of a table.

In simple cases, a single variable is sufficient to identify an observation. In the example above, each county is identified with county (a numeric identifier); each state is identified with state (a two-letter string).

There are two types of keys:

  • A primary key uniquely identifies an observation in its own table. For example, counties$county is a primary key because it uniquely identifies each county in the counties table.

  • A foreign key uniquely identifies an observation in another table. For example, counties$state is a foreign key because it appears in the counties table where it matches each county to a unique state.

A primary key and the corresponding foreign key in another table form a relation.

Sometimes a table does not have an explicit primary key: each row is an observation, but no combination of variables reliably identifies it. If a table lacks a primary key, it is useful to add one with mutate() and row_number(). This is called a surrogate key.

10.3 Joins

Data stored in the form we have outlined above is considered normalized. In general, we should try to keep data normalized as far into the code pipeline as we can. Storing normalized data means your data will be easier to understand and it will be harder to make costly mistakes.

At some point, however, we are going to have to merge (or join) the tables together to produce and analyze a single dataframe.

Let's say we wanted to merge tables x and y. join allows us to combine variables from the two tables. It first matches observations by their keys, then copies across variables from one table to the other.

There are five join options:

  1. An inner join keeps observations that appear in both tables.
  2. A left join keeps all observations in x.
  3. A right join keeps all observations in y.
  4. A full join keeps all observations in x and all observations in y.
  5. An anti join keeps all observations in x that do not have a match in y.

The most commonly used join is the left_join(): you use this whenever you look up additional data from another table, because it preserves the original observations even when there is not a match. For example, a left_join() on x and y pulls in variables from y while preserving all the observations in x.

Let's say we want to combine the countries and states tables we created earlier.

counties_states <- counties %>%
  left_join(states, by = "state")

counties_states
#>   county state county_pop state_pop region
#> 1  36037    NY    3817735  43320903      1
#> 2  36038    NY     422999  43320903      1
#> 3  36039    NY     324920  43320903      1
#> 4  36040    NY     143432  43320903      1
#> 5  37001    VA    3228290   7173000      3
#> 6  37002    VA     449499   7173000      3
#> 7  37003    VA     383888   7173000      3

Notice there are two new columns: state_pop and region.

The left join should be your default join: use it unless you have a strong reason to prefer one of the others.

10.4 Defining Keys

In the example above, the two tables were joined by a single variable, and that variable has the same name in both tables. That constraint was encoded by by = "key".

You can use other values for by to connect the tables in other ways:

  1. The default, by = NULL, uses all variables that appear in both tables, what we might call a "natural join."

For example, let's say we wanted to add a column to the gapminder dataset that encodes the regime type of each country-year observation. We will get that data from the polityIV dataset.

gap <- gapminder

polity <- read.csv("data/polity_sub.csv", stringsAsFactors = F)
head(polity)
#>       country year polity2
#> 1 Afghanistan 1800      -6
#> 2 Afghanistan 1801      -6
#> 3 Afghanistan 1802      -6
#> 4 Afghanistan 1803      -6
#> 5 Afghanistan 1804      -6
#> 6 Afghanistan 1805      -6

We are now ready to join the tables. The common keys between them are country and year:

gap1 <- gapminder %>%
  left_join(polity)
#> Joining, by = c("country", "year")

head(gap1)
#> # A tibble: 6 x 7
#>   country     continent  year lifeExp      pop gdpPercap polity2
#>   <chr>       <fct>     <int>   <dbl>    <int>     <dbl>   <int>
#> 1 Afghanistan Asia       1952    28.8  8425333      779.     -10
#> 2 Afghanistan Asia       1957    30.3  9240934      821.     -10
#> 3 Afghanistan Asia       1962    32.0 10267083      853.     -10
#> 4 Afghanistan Asia       1967    34.0 11537966      836.      -7
#> 5 Afghanistan Asia       1972    36.1 13079460      740.      -7
#> 6 Afghanistan Asia       1977    38.4 14880372      786.      -7
  1. A character vector, by = c("x", "y"). This is like a natural join, but it uses only some of the common variables.

  2. A named character vector: by = c("a" = "b"). This will match variable a in table x to variable b in table y. The variables from x will be used in the output.

For example, let's add another variable to our gapminder dataset -- physical integrity rights -- from the CIRI dataset.

ciri <- read.csv("data/ciri_sub.csv", stringsAsFactors = F)
head(ciri)
#>          CTRY YEAR PHYSINT
#> 1 Afghanistan 1981       0
#> 2 Afghanistan 1982       0
#> 3 Afghanistan 1983       0
#> 4 Afghanistan 1984       0
#> 5 Afghanistan 1985       0
#> 6 Afghanistan 1986       0

Both datasets have country and year columns, but they are named differently.

gap2 <- gap1 %>%
  left_join(ciri, by = c("country" = "CTRY", "year" = "YEAR"))

head(gap2)
#> # A tibble: 6 x 8
#>   country     continent  year lifeExp      pop gdpPercap polity2 PHYSINT
#>   <chr>       <fct>     <int>   <dbl>    <int>     <dbl>   <int>   <int>
#> 1 Afghanistan Asia       1952    28.8  8425333      779.     -10      NA
#> 2 Afghanistan Asia       1957    30.3  9240934      821.     -10      NA
#> 3 Afghanistan Asia       1962    32.0 10267083      853.     -10      NA
#> 4 Afghanistan Asia       1967    34.0 11537966      836.      -7      NA
#> 5 Afghanistan Asia       1972    36.1 13079460      740.      -7      NA
#> 6 Afghanistan Asia       1977    38.4 14880372      786.      -7      NA

Notice that PHYSINT is NA in the first 6 rows because the ciri dataset does not contain observations for Afghanistan in these years. But since we used left_join(), all observations in gapminder were preserved.

We can see some values for PHYSINT if we peak at the bottom of the dataset:

tail(gap2)
#> # A tibble: 6 x 8
#>   country  continent  year lifeExp      pop gdpPercap polity2 PHYSINT
#>   <chr>    <fct>     <int>   <dbl>    <int>     <dbl>   <int>   <int>
#> 1 Zimbabwe Africa     1982    60.4  7636524      789.       4       5
#> 2 Zimbabwe Africa     1987    62.4  9216418      706.      -6       5
#> 3 Zimbabwe Africa     1992    60.4 10704340      693.      -6       5
#> 4 Zimbabwe Africa     1997    46.8 11404948      792.      -6       6
#> 5 Zimbabwe Africa     2002    40.0 11926563      672.      -4       2
#> 6 Zimbabwe Africa     2007    43.5 12311143      470.      -4       1

10.5 Duplicate Keys

So far we have assumed that the keys are unique, but that is not always the case. For example,

x <- data.frame(key = c(1, 2),
               val_y = c("x1", "x2"))

y <- data.frame(key = c(1, 2, 2, 1),
               val_x = c("y1", "y2", "y3", "y4"))

left_join(x, y, by = "key")
#>   key val_y val_x
#> 1   1    x1    y1
#> 2   1    x1    y4
#> 3   2    x2    y2
#> 4   2    x2    y3

Notice that this can sometimes cause unintended duplicates.

10.6 Challenges

Challenge 1.

Merge the polityVI and CIRI datasets, keeping all observations in polityVI. Save this merged dataframe as p1. How many observations does p1 have? Why?

Challenge 2.

Merge the gap1 dataset we created above with the ciri dataset, this time keeping all observations in ciri. Save this as gap2. How many observations does it have? What is the major problem with merging the datasets this way?