Chapter 11 Relational Data
It is rare that data analysis involves only a single table of data. Typically, you have many tables of data, and you must combine them to answer the questions that you are interested in. Collectively, multiple tables of data are called relational data because it is the relations, not just the individual datasets, that are important.
Note that when we say relational database here, we are referring to how the data are structured, not to the use of any fancy software.
11.1 Why Relational Data
As social scientists, we are often working with data across different levels of analysis. The main principle of relational data is that each table is structured around the same observational unit.
Why is this important? Check out the following data.
messy <- data.frame( county = c(36037, 36038, 36039, 36040, NA , 37001, 37002, 37003), state = c('NY', 'NY', 'NY', NA, NA, 'VA', 'VA', 'VA'), cnty_pop = c(3817735, 422999, 324920, 143432, NA, 3228290, 449499, 383888), state_pop = c(43320903, 43320903, NA, 43320903, 43320903, 7173000, 7173000, 7173000), region = c(1, 1, 1, 1, 1, 3, 3, 4) ) messy #> county state cnty_pop state_pop region #> 1 36037 NY 3817735 43320903 1 #> 2 36038 NY 422999 43320903 1 #> 3 36039 NY 324920 NA 1 #> 4 36040 <NA> 143432 43320903 1 #> 5 NA <NA> NA 43320903 1 #> 6 37001 VA 3228290 7173000 3 #> 7 37002 VA 449499 7173000 3 #> 8 37003 VA 383888 7173000 4
What a mess! How can the population of the state of New York be 43 million for one county but “missing” for another? If this is a dataset of counties, what does it mean when the “county” field is missing? If region is something like Census region, how can two counties in the same state be in different regions? And why is it that all the counties whose codes start with 36 are in New York except for one, where the state is unknown?
If we follow the principles of relational data, each type of observational unit should form a table:
countiescontains data on counties.
statescontains data on states.
So our data should look like this:
counties <- data.frame( county = c(36037, 36038, 36039, 36040, 37001, 37002, 37003), state = c('NY', 'NY', 'NY', 'NY', 'VA', 'VA', 'VA'), county_pop = c(3817735, 422999, 324920, 143432, 3228290, 449499, 383888), stringsAsFactors = F ) counties #> county state county_pop #> 1 36037 NY 3817735 #> 2 36038 NY 422999 #> 3 36039 NY 324920 #> 4 36040 NY 143432 #> 5 37001 VA 3228290 #> 6 37002 VA 449499 #> 7 37003 VA 383888 states <- data.frame( state = c("NY", "VA"), state_pop = c(43320903, 7173000), region = c(1, 3), stringsAsFactors = F ) states #> state state_pop region #> 1 NY 43320903 1 #> 2 VA 7173000 3
County population is a property of a county, so it lives in the county table. State population is a property of a state, so it cannot live in the county table. If we had panel data on counties, we would need separate tables for things that vary at the county level (like state) and things that vary at the county-year level (like population).
Now the ambiguity is gone. Every county has a population and a state. Every state has a population and a region. There are no missing states, no missing counties, and no conflicting definitions. The database is self-documenting.
The variables used to connect each pair of tables are called keys. A key is a variable (or set of variables) that uniquely identifies an observation; it can also be called a unique identifier.
- Keys are complete. They never take on missing values.
- Keys are unique. They are never duplicated across rows of a table.
In simple cases, a single variable is sufficient to identify an observation. In the example above, each county is identified with county (a numeric identifier); each state is identified with state (a two-letter string).
There are two types of keys:
A primary key uniquely identifies an observation in its own table. For example,
counties$countyis a primary key because it uniquely identifies each county in the counties table.
A foreign key uniquely identifies an observation in another table. For example,
counties$stateis a foreign key because it appears in the counties table where it matches each county to a unique state.
A primary key and the corresponding foreign key in another table form a relation.
Sometimes a table does not have an explicit primary key: each row is an observation, but no combination of variables reliably identifies it. If a table lacks a primary key, it is useful to add one with
row_number(). This is called a surrogate key.
Data stored in the form we have outlined above is considered normalized. In general, we should try to keep data normalized as far into the code pipeline as we can. Storing normalized data means your data will be easier to understand and it will be harder to make costly mistakes.
At some point, however, we are going to have to merge (or join) the tables together to produce and analyze a single dataframe.
Let’s say we wanted to merge tables
y. join allows us to combine variables from the two tables. It first matches observations by their keys, then copies across variables from one table to the other.
There are five join options:
- An inner join keeps observations that appear in both tables.
- A left join keeps all observations in
- A right join keeps all observations in
- A full join keeps all observations in
xand all observations in
- An anti join keeps all observations in
xthat do not have a match in
The most commonly used join is the
left_join(): you use this whenever you look up additional data from another table, because it preserves the original observations even when there is not a match. For example, a
y pulls in variables from
y while preserving all the observations in
Let’s say we want to combine the
states tables we created earlier.
counties_states <- counties %>% left_join(states, by = "state") counties_states #> county state county_pop state_pop region #> 1 36037 NY 3817735 43320903 1 #> 2 36038 NY 422999 43320903 1 #> 3 36039 NY 324920 43320903 1 #> 4 36040 NY 143432 43320903 1 #> 5 37001 VA 3228290 7173000 3 #> 6 37002 VA 449499 7173000 3 #> 7 37003 VA 383888 7173000 3
Notice there are two new columns:
The left join should be your default join: use it unless you have a strong reason to prefer one of the others.
11.4 Defining Keys
In the example above, the two tables were joined by a single variable, and that variable has the same name in both tables. That constraint was encoded by
by = "key".
You can use other values for
by to connect the tables in other ways:
- The default,
by = NULL, uses all variables that appear in both tables, what we might call a “natural join.”
For example, let’s say we wanted to add a column to the
gapminder dataset that encodes the regime type of each country-year observation. We will get that data from the Polity IV dataset.
We are now ready to join the tables. The common keys between them are
gap1 <- gapminder %>% left_join(polity) #> Joining, by = c("country", "year") head(gap1) #> # A tibble: 6 x 7 #> country continent year lifeExp pop gdpPercap polity2 #> <chr> <fct> <int> <dbl> <int> <dbl> <int> #> 1 Afghanistan Asia 1952 28.8 8425333 779. -10 #> 2 Afghanistan Asia 1957 30.3 9240934 821. -10 #> 3 Afghanistan Asia 1962 32.0 10267083 853. -10 #> 4 Afghanistan Asia 1967 34.0 11537966 836. -7 #> 5 Afghanistan Asia 1972 36.1 13079460 740. -7 #> 6 Afghanistan Asia 1977 38.4 14880372 786. -7
A character vector,
by = c("x", "y"). This is like a natural join, but it uses only some of the common variables.
A named character vector:
by = c("a" = "b"). This will match variable
y. The variables from
xwill be used in the output.
For example, let’s add another variable to our
gapminder dataset – physical integrity rights – from the CIRI dataset.
Both datasets have country and year columns, but they are named differently.
gap2 <- gap1 %>% left_join(ciri, by = c("country" = "CTRY", "year" = "YEAR")) head(gap2) #> # A tibble: 6 x 8 #> country continent year lifeExp pop gdpPercap polity2 PHYSINT #> <chr> <fct> <int> <dbl> <int> <dbl> <int> <int> #> 1 Afghanistan Asia 1952 28.8 8425333 779. -10 NA #> 2 Afghanistan Asia 1957 30.3 9240934 821. -10 NA #> 3 Afghanistan Asia 1962 32.0 10267083 853. -10 NA #> 4 Afghanistan Asia 1967 34.0 11537966 836. -7 NA #> 5 Afghanistan Asia 1972 36.1 13079460 740. -7 NA #> 6 Afghanistan Asia 1977 38.4 14880372 786. -7 NA
NA in the first 6 rows because the
ciri dataset does not contain observations for Afghanistan in these years. But since we used
left_join(), all observations in
gapminder were preserved.
We can see some values for
PHYSINT if we peak at the bottom of the dataset:
tail(gap2) #> # A tibble: 6 x 8 #> country continent year lifeExp pop gdpPercap polity2 PHYSINT #> <chr> <fct> <int> <dbl> <int> <dbl> <int> <int> #> 1 Zimbabwe Africa 1982 60.4 7636524 789. 4 5 #> 2 Zimbabwe Africa 1987 62.4 9216418 706. -6 5 #> 3 Zimbabwe Africa 1992 60.4 10704340 693. -6 5 #> 4 Zimbabwe Africa 1997 46.8 11404948 792. -6 6 #> 5 Zimbabwe Africa 2002 40.0 11926563 672. -4 2 #> 6 Zimbabwe Africa 2007 43.5 12311143 470. -4 1
11.5 Duplicate Keys
So far we have assumed that the keys are unique, but that is not always the case. For example,
Notice that this can sometimes cause unintended duplicates.
Merge the Polity IV and CIRI datasets, keeping all observations in Polity IV. Save this merged dataframe as
p1. How many observations does
p1 have? Why?
gap1 dataset we created above with the
ciri dataset, this time keeping all observations in
ciri. Save this as
gap2. How many observations does it have? What is the major problem with merging the datasets this way?
This page is in part derived from the following sources: