Chapter 12 Relational Data

It’s rare that data analysis involves only a single table of data. Typically, you have many tables of data, and you must combine them to answer the questions that you’re interested in. Collectively, multiple tables of data are called relational data because it is the relations, not just the individual datasets, that are important.

Note that when we say relational database here, we are referring to how the data are structured, not to the use of any fancy software.

12.1 Why Relational Data

As social scientists, we’re often working with data across different levels of analysis. The main principle of relational data is that each table is structured around the same observational unit.

Why is this important? Check out the following data.

messy <- data.frame(
  county = c(36037, 36038, 36039, 36040, NA , 37001, 37002, 37003),
  state = c('NY', 'NY', 'NY', NA, NA, 'VA', 'VA', 'VA'),
  cnty_pop = c(3817735, 422999, 324920, 143432, NA, 3228290, 449499, 383888),
  state_pop = c(43320903, 43320903, NA, 43320903, 43320903, 7173000, 7173000, 7173000),
  region = c(1, 1, 1, 1, 1, 3, 3, 4)
)

kable(messy)
county state cnty_pop state_pop region
36037 NY 3817735 43320903 1
36038 NY 422999 43320903 1
36039 NY 324920 NA 1
36040 NA 143432 43320903 1
NA NA NA 43320903 1
37001 VA 3228290 7173000 3
37002 VA 449499 7173000 3
37003 VA 383888 7173000 4

What a mess! How can the population of the state of New York be 43 million for one county but “missing” for another? If this is a dataset of counties, what does it mean when the “county” field is missing? If region is something like Census region, how can two counties in the same state be in different regions? And why is it that all the counties whose codes start with 36 are in New York except for one, where the state is unknown?

If we follow the principles of relational data, each type of observational unit should form a table.

  • counties contains data on counties.
  • states contains data on states

So our data should look like:

counties <- data.frame(
  county = c(36037, 36038, 36039, 36040, 37001, 37002, 37003),
  state = c('NY', 'NY', 'NY', 'NY', 'VA', 'VA', 'VA'),
  county_pop = c(3817735, 422999, 324920, 143432, 3228290, 449499, 383888), stringsAsFactors = F
)
kable(counties)
county state county_pop
36037 NY 3817735
36038 NY 422999
36039 NY 324920
36040 NY 143432
37001 VA 3228290
37002 VA 449499
37003 VA 383888

states <- data.frame(
  state = c("NY", "VA"),
  state_pop = c(43320903, 7173000),
  region = c(1, 3), stringsAsFactors = F
)
kable(states)
state state_pop region
NY 43320903 1
VA 7173000 3

County population is a property of a county, so it lives in the county table. State population is a property of a state, so it cannot live in the county table. If we had panel data on counties, we would need separate tables for things that vary at the county level (like state) and things that vary at the county-year level (like population).

Now the ambiguity is gone. Every county has a population and a state. Every state has a population and a region. There are no missing states, no missing counties, and no conflicting definitions. The database is self-documenting.

12.2 Keys

The variables used to connect each pair of tables are called keys. A key is a variable (or set of variables) that uniquely identifies an observation. Also called a unique identifier.

  • Keys are complete. They never take on missing values.
  • Keys are unique. They are neer duplicated across rows of a table.

In simple cases, a single variable is sufficient to identify an observation. In the example above, each county is identified with county (a numeric identifier); each state is identified with state (a two-letter string).

There are two types of keys:

  • A primary key uniquely identifies an observation in its own table. For example, counties$county is a primary key because it uniquely identifies each county in the counties table.

  • A foreign key uniquely identifies an observation in another table. For example, the counties$state is a foreign key because it appears in the counties table where it matches each county to a unique state.

A primary key and the corresponding foreign key in another table form a relation.

Sometimes a table doesn’t have an explicit primary key: each row is an observation, but no combination of variables reliably identifies it. If a table lacks a primary key, it’s useful to add one with mutate() and row_number(). This is called a surrogate key.

12.3 Joins

Data stored in the form we have outlined above is considered normalized. In general, we should try to keep data normalized as far into the code pipeline as we can. Storing normalized data means your data will be easier to understand and it will be harder to make costly mistakes.

At some point, however, we’re going to have to merge (or join) the tables together to produce a single dataframe, and conduct analysis on that dataframe.

Let’s say we wanted to merge tables x and y. A join allows you to combine variables from the two tables. It first matches observations by their keys, then copies across variables from one table to the other.

There are four join options:

  1. An inner join keeps observations that appear in both tables.
  2. A left join keeps all observations in x.
  3. A right join keeps all observations in y.
  4. A full join keeps all observations in x and all observations in y.

The most commonly used join is the left_join(): you use this whenever you look up additional data from another table, because it preserves the original observations even when there isn’t a match. For example, a left_join() on x and y pulls in variables form y while preserving all the observations on x.

Let’s say we want to combine the countries and states tables we created earlier.

counties_states <- counties %>%
  left_join(states, by = "state")

kable(counties_states)
county state county_pop state_pop region
36037 NY 3817735 43320903 1
36038 NY 422999 43320903 1
36039 NY 324920 43320903 1
36040 NY 143432 43320903 1
37001 VA 3228290 7173000 3
37002 VA 449499 7173000 3
37003 VA 383888 7173000 3

Notice there are two new columns: state_pop and region.

The left join should be your default join: use it unless you have a strong reason to prefer one of the others.

12.4 Defining Keys

In the example above, the two tables were joined by a single variable, and that variable has the same name in both tables. That constraint was encoded by by = "key".

You can use other values for by to connect the tables in other ways:

  1. The default, by = NULL, uses all variables that appear in both tables, what we might call a “natural join”.

For example, let’s say we wanted to add a column to the gapminder dataset that encodes the regime type of each country-year observation. We’ll get that data from the polityVI dataset.

gapminder <- read.csv("data/gapminder.csv", stringsAsFactors = F)
polity <- read.csv("data/polity_sub.csv", stringsAsFactors = F)
kable(head(polity))
country year polity2
Afghanistan 1800 -6
Afghanistan 1801 -6
Afghanistan 1802 -6
Afghanistan 1803 -6
Afghanistan 1804 -6
Afghanistan 1805 -6

We’re now ready to join the tables. The common keys between them are country and year:

gap1 <- gapminder %>%
  left_join(polity)
#> Joining, by = c("country", "year")

kable(head(gap1))
country year pop continent lifeExp gdpPercap polity2
Afghanistan 1952 8425333 Asia 28.8 779 -10
Afghanistan 1957 9240934 Asia 30.3 821 -10
Afghanistan 1962 10267083 Asia 32.0 853 -10
Afghanistan 1967 11537966 Asia 34.0 836 -7
Afghanistan 1972 13079460 Asia 36.1 740 -7
Afghanistan 1977 14880372 Asia 38.4 786 -7
  1. A character vector, by = c("x", "y"). This is like a natural join, but uses only some of the common variables.

  2. A named character vector: by = c("a" = "b"). This will match variable a in table x to variable b in table y. The variables from x will be used in the output.

For example, let’s add another variable to our gapminder dataset – physical integrity rights – from the CIRI dataset.

ciri <- read.csv("data/ciri_sub.csv", stringsAsFactors = F)
kable(head(ciri))
CTRY YEAR PHYSINT
Afghanistan 1981 0
Afghanistan 1982 0
Afghanistan 1983 0
Afghanistan 1984 0
Afghanistan 1985 0
Afghanistan 1986 0

Both datasets have country and year columns, but they are named differently.

gap2 <- gap1 %>%
  left_join(ciri, by = c("country" = "CTRY", "year" = "YEAR"))

kable(head(gap2))
country year pop continent lifeExp gdpPercap polity2 PHYSINT
Afghanistan 1952 8425333 Asia 28.8 779 -10 NA
Afghanistan 1957 9240934 Asia 30.3 821 -10 NA
Afghanistan 1962 10267083 Asia 32.0 853 -10 NA
Afghanistan 1967 11537966 Asia 34.0 836 -7 NA
Afghanistan 1972 13079460 Asia 36.1 740 -7 NA
Afghanistan 1977 14880372 Asia 38.4 786 -7 NA

Notice that PHYSINT is NA in the first 6 rows because the ciri dataset does not contain observations for Afghanistan in these years. But because we used left_join(), all observations in gapminder were preserved.

We can see some values for PHYSINT if we peak at the bottom of the dataset:

kable(tail(gap2))
country year pop continent lifeExp gdpPercap polity2 PHYSINT
1699 Zimbabwe 1982 7636524 Africa 60.4 789 4 5
1700 Zimbabwe 1987 9216418 Africa 62.4 706 -6 5
1701 Zimbabwe 1992 10704340 Africa 60.4 693 -6 5
1702 Zimbabwe 1997 11404948 Africa 46.8 792 -6 6
1703 Zimbabwe 2002 11926563 Africa 40.0 672 -4 2
1704 Zimbabwe 2007 12311143 Africa 43.5 470 -4 1

12.5 Duplicate Keys

So far we have assumed that the keys are unique. But that’s not always the case. For example,

x <- data.frame(key = c(1, 2),
               val_y = c("x1", "x2"))

y <- data.frame(key = c(1, 2, 2, 1),
               val_x = c("y1", "y2", "y3", "y4"))

left_join(x, y, by = "key")
#>   key val_y val_x
#> 1   1    x1    y1
#> 2   1    x1    y4
#> 3   2    x2    y2
#> 4   2    x2    y3

Notice that this can sometimes cause unintended duplicates.