Chapter 11 Relational Data
It is rare that data analysis involves only a single table of data. Typically, you have many tables of data, and you must combine them to answer the questions that you are interested in. Collectively, multiple tables of data are called relational data because it is the relations, not just the individual datasets, that are important.
Note that when we say relational database here, we are referring to how the data are structured, not to the use of any fancy software.
11.1 Why Relational Data
As social scientists, we are often working with data across different levels of analysis. The main principle of relational data is that each table is structured around the same observational unit.
Why is this important? Check out the following data.
messy <- data.frame(
county = c(36037, 36038, 36039, 36040, NA , 37001, 37002, 37003),
state = c('NY', 'NY', 'NY', NA, NA, 'VA', 'VA', 'VA'),
cnty_pop = c(3817735, 422999, 324920, 143432, NA, 3228290, 449499, 383888),
state_pop = c(43320903, 43320903, NA, 43320903, 43320903, 7173000, 7173000, 7173000),
region = c(1, 1, 1, 1, 1, 3, 3, 4)
)
messy
#> county state cnty_pop state_pop region
#> 1 36037 NY 3817735 43320903 1
#> 2 36038 NY 422999 43320903 1
#> 3 36039 NY 324920 NA 1
#> 4 36040 <NA> 143432 43320903 1
#> 5 NA <NA> NA 43320903 1
#> 6 37001 VA 3228290 7173000 3
#> 7 37002 VA 449499 7173000 3
#> 8 37003 VA 383888 7173000 4
What a mess! How can the population of the state of New York be 43 million for one county but “missing” for another? If this is a dataset of counties, what does it mean when the “county” field is missing? If region is something like Census region, how can two counties in the same state be in different regions? And why is it that all the counties whose codes start with 36 are in New York except for one, where the state is unknown?
If we follow the principles of relational data, each type of observational unit should form a table:
counties
contains data on counties.states
contains data on states.
So our data should look like this:
counties <- data.frame(
county = c(36037, 36038, 36039, 36040, 37001, 37002, 37003),
state = c('NY', 'NY', 'NY', 'NY', 'VA', 'VA', 'VA'),
county_pop = c(3817735, 422999, 324920, 143432, 3228290, 449499, 383888), stringsAsFactors = F
)
counties
#> county state county_pop
#> 1 36037 NY 3817735
#> 2 36038 NY 422999
#> 3 36039 NY 324920
#> 4 36040 NY 143432
#> 5 37001 VA 3228290
#> 6 37002 VA 449499
#> 7 37003 VA 383888
states <- data.frame(
state = c("NY", "VA"),
state_pop = c(43320903, 7173000),
region = c(1, 3), stringsAsFactors = F
)
states
#> state state_pop region
#> 1 NY 43320903 1
#> 2 VA 7173000 3
County population is a property of a county, so it lives in the county table. State population is a property of a state, so it cannot live in the county table. If we had panel data on counties, we would need separate tables for things that vary at the county level (like state) and things that vary at the county-year level (like population).
Now the ambiguity is gone. Every county has a population and a state. Every state has a population and a region. There are no missing states, no missing counties, and no conflicting definitions. The database is self-documenting.
11.2 Keys
The variables used to connect each pair of tables are called keys. A key is a variable (or set of variables) that uniquely identifies an observation; it can also be called a unique identifier.
- Keys are complete. They never take on missing values.
- Keys are unique. They are never duplicated across rows of a table.
In simple cases, a single variable is sufficient to identify an observation. In the example above, each county is identified with county (a numeric identifier); each state is identified with state (a two-letter string).
There are two types of keys:
A primary key uniquely identifies an observation in its own table. For example,
counties$county
is a primary key because it uniquely identifies each county in the counties table.A foreign key uniquely identifies an observation in another table. For example,
counties$state
is a foreign key because it appears in the counties table where it matches each county to a unique state.
A primary key and the corresponding foreign key in another table form a relation.
Sometimes a table does not have an explicit primary key: each row is an observation, but no combination of variables reliably identifies it. If a table lacks a primary key, it is useful to add one with mutate()
and row_number()
. This is called a surrogate key.
11.3 Joins
Data stored in the form we have outlined above is considered normalized. In general, we should try to keep data normalized as far into the code pipeline as we can. Storing normalized data means your data will be easier to understand and it will be harder to make costly mistakes.
At some point, however, we are going to have to merge (or join) the tables together to produce and analyze a single dataframe.
Let’s say we wanted to merge tables x
and y
. join allows us to combine variables from the two tables. It first matches observations by their keys, then copies across variables from one table to the other.
There are five join options:
- An inner join keeps observations that appear in both tables.
- A left join keeps all observations in
x
. - A right join keeps all observations in
y
. - A full join keeps all observations in
x
and all observations iny
. - An anti join keeps all observations in
x
that do not have a match iny
.
The most commonly used join is the left_join()
: you use this whenever you look up additional data from another table, because it preserves the original observations even when there is not a match. For example, a left_join()
on x
and y
pulls in variables from y
while preserving all the observations in x
.
Let’s say we want to combine the countries
and states
tables we created earlier.
counties_states <- counties %>%
left_join(states, by = "state")
counties_states
#> county state county_pop state_pop region
#> 1 36037 NY 3817735 43320903 1
#> 2 36038 NY 422999 43320903 1
#> 3 36039 NY 324920 43320903 1
#> 4 36040 NY 143432 43320903 1
#> 5 37001 VA 3228290 7173000 3
#> 6 37002 VA 449499 7173000 3
#> 7 37003 VA 383888 7173000 3
Notice there are two new columns: state_pop
and region
.
The left join should be your default join: use it unless you have a strong reason to prefer one of the others.
11.4 Defining Keys
In the example above, the two tables were joined by a single variable, and that variable has the same name in both tables. That constraint was encoded by by = "key"
.
You can use other values for by
to connect the tables in other ways:
- The default,
by = NULL
, uses all variables that appear in both tables, what we might call a “natural join.”
For example, let’s say we wanted to add a column to the gapminder
dataset that encodes the regime type of each country-year observation. We will get that data from the Polity IV dataset.
gap <- gapminder
polity <- read.csv("data/polity_sub.csv", stringsAsFactors = F)
head(polity)
#> country year polity2
#> 1 Afghanistan 1800 -6
#> 2 Afghanistan 1801 -6
#> 3 Afghanistan 1802 -6
#> 4 Afghanistan 1803 -6
#> 5 Afghanistan 1804 -6
#> 6 Afghanistan 1805 -6
We are now ready to join the tables. The common keys between them are country
and year
:
gap1 <- gapminder %>%
left_join(polity)
#> Joining, by = c("country", "year")
head(gap1)
#> # A tibble: 6 x 7
#> country continent year lifeExp pop gdpPercap polity2
#> <chr> <fct> <int> <dbl> <int> <dbl> <int>
#> 1 Afghanistan Asia 1952 28.8 8425333 779. -10
#> 2 Afghanistan Asia 1957 30.3 9240934 821. -10
#> 3 Afghanistan Asia 1962 32.0 10267083 853. -10
#> 4 Afghanistan Asia 1967 34.0 11537966 836. -7
#> 5 Afghanistan Asia 1972 36.1 13079460 740. -7
#> 6 Afghanistan Asia 1977 38.4 14880372 786. -7
A character vector,
by = c("x", "y")
. This is like a natural join, but it uses only some of the common variables.A named character vector:
by = c("a" = "b")
. This will match variablea
in tablex
to variableb
in tabley
. The variables fromx
will be used in the output.
For example, let’s add another variable to our gapminder
dataset – physical integrity rights – from the CIRI dataset.
ciri <- read.csv("data/ciri_sub.csv", stringsAsFactors = F)
head(ciri)
#> CTRY YEAR PHYSINT
#> 1 Afghanistan 1981 0
#> 2 Afghanistan 1982 0
#> 3 Afghanistan 1983 0
#> 4 Afghanistan 1984 0
#> 5 Afghanistan 1985 0
#> 6 Afghanistan 1986 0
Both datasets have country and year columns, but they are named differently.
gap2 <- gap1 %>%
left_join(ciri, by = c("country" = "CTRY", "year" = "YEAR"))
head(gap2)
#> # A tibble: 6 x 8
#> country continent year lifeExp pop gdpPercap polity2 PHYSINT
#> <chr> <fct> <int> <dbl> <int> <dbl> <int> <int>
#> 1 Afghanistan Asia 1952 28.8 8425333 779. -10 NA
#> 2 Afghanistan Asia 1957 30.3 9240934 821. -10 NA
#> 3 Afghanistan Asia 1962 32.0 10267083 853. -10 NA
#> 4 Afghanistan Asia 1967 34.0 11537966 836. -7 NA
#> 5 Afghanistan Asia 1972 36.1 13079460 740. -7 NA
#> 6 Afghanistan Asia 1977 38.4 14880372 786. -7 NA
Notice that PHYSINT
is NA
in the first 6 rows because the ciri
dataset does not contain observations for Afghanistan in these years. But since we used left_join()
, all observations in gapminder
were preserved.
We can see some values for PHYSINT
if we peak at the bottom of the dataset:
tail(gap2)
#> # A tibble: 6 x 8
#> country continent year lifeExp pop gdpPercap polity2 PHYSINT
#> <chr> <fct> <int> <dbl> <int> <dbl> <int> <int>
#> 1 Zimbabwe Africa 1982 60.4 7636524 789. 4 5
#> 2 Zimbabwe Africa 1987 62.4 9216418 706. -6 5
#> 3 Zimbabwe Africa 1992 60.4 10704340 693. -6 5
#> 4 Zimbabwe Africa 1997 46.8 11404948 792. -6 6
#> 5 Zimbabwe Africa 2002 40.0 11926563 672. -4 2
#> 6 Zimbabwe Africa 2007 43.5 12311143 470. -4 1
11.5 Duplicate Keys
So far we have assumed that the keys are unique, but that is not always the case. For example,
x <- data.frame(key = c(1, 2),
val_y = c("x1", "x2"))
y <- data.frame(key = c(1, 2, 2, 1),
val_x = c("y1", "y2", "y3", "y4"))
left_join(x, y, by = "key")
#> key val_y val_x
#> 1 1 x1 y1
#> 2 1 x1 y4
#> 3 2 x2 y2
#> 4 2 x2 y3
Notice that this can sometimes cause unintended duplicates.
11.6 Challenges
Challenge 1.
Merge the Polity IV and CIRI datasets, keeping all observations in Polity IV. Save this merged dataframe as p1
. How many observations does p1
have? Why?
Challenge 2.
Merge the gap1
dataset we created above with the ciri
dataset, this time keeping all observations in ciri
. Save this as gap2
. How many observations does it have? What is the major problem with merging the datasets this way?
Acknowledgements
This page is in part derived from the following sources: