Chapter 20 Assignments

20.1 Assignment 1 Solutions

  • Assigned: Oct 3, 2019.
  • Due: Oct 10, 2019 at 12:29pm.

For this assignment, you will confirm that everything is installed and setup correctly, and you understand how to interact with R Studio and R Markdown.

Your answers (to this assignment only) will be posted on our course website.

1. Using R Markdown

In the space below, insert a picture of yourself, and complete the following information:

  1. Name: Daenerys Targaryen
  2. Department and degree program: Queen of the Andals and the First Men, Protector of the Seven Kingdoms, the Mother of Dragons, the Khaleesi of the Great Grass Sea, the Unburnt, the Breaker of Chains.
  3. Year in the program: First.
  4. One-sentence description of academic interests: I am interested in slavery, intercontinental conflict, and pyrology.
  5. Some non-academic interests: Dragons, Jon Snow, eating raw hearts.
  6. R version installed on your computer (Open a command line window (‘terminal’ or, on windows, ‘git bash’), and enter the following command R --version): 3.6.1
  7. R Studio version installed on your computer (Open RStudio and, in the navigation menu, click on RStudio –> About RStudio): 1.1.456
  8. Primary computer operating system (Mac OS, Windows, Linux, etc): Mac OS 10.13.6.
  9. Programming experience (How would you describe your previous programming experience?): None.

2. Checking packages

Create an R chunk below, where you load the tidyverse library.

library(tidyverse)

3. Knit and submit.

Knit the R Markdown file to PDF. Submit BOTH the .Rmd file and the PDF file to Canvas.

If you get an error trying to knit, read the error and make sure that your R code is correct. If that doesn’t work, confirm you’ve correctly installed the requisite packages (knit, rmarkdown). If you still can’t get it to work, paste the error on Canvas.

20.2 Assignment 2 Solutions

  • Assigned: Oct 10, 2019.
  • Due: Oct 17, 2019 at 12:29pm.

For this assignment, you’ll use what you know about R syntax and data structures to perform some common data operations.

1. Basics

1.1 Fix the following syntax errors. Enter your corrected code in the second chunk.

# 1
states <- ("California", "Illinois", "Ohio")

# 2
countries <- c("Iran", "Indonesia," "India", "Italy")

# 3
df <- data.frame(age = c(21, 66, 35)
                 party = c('rep', 'dem', 'rep'))

# 4
my-vector <- c("apples", "oranges", "kiwis")

# 5
artists <- list(names = c("Picasso", "Kahlo",
                genre = c("cubist", "surrealist"))
# PUT YOUR CORRECTED CODE HERE

# 1
states <- c("California", "Illinois", "Ohio")

# 2
countries <- c("Iran", "Indonesia", "India", "Italy")

# 3
df <- data.frame(age = c(21, 66, 35),
                 party = c('rep', 'dem', 'rep'))

# 4
my_vector <- c("apples", "oranges", "kiwis")

# 5
artists <- list(names = c("Picasso", "Kahlo"),
                genre = c("cubist", "surrealist"))

1.2 How many arguments does the order() function pass? What are they?

2. Vectors and Lists

2.1 Create three vectors:

  • a character vector, titles, that contain the names of 3 of your favorite movies

  • a numeric vector, year, that contains the years in which those movies were produced

  • a boolean vector bechdel that TRUE/FALSE according to whether those movies pass the bechdel test

titles <- c("Dog Day Afternoon", "The Graduate", "Breakfast Club")
year <- c(1975, 1967, 1985)
bechdel <- c(TRUE, FALSE, TRUE)

2.2 Put those three vectors in a list, called movies.

movies <- list(titles, year, bechdel)

2.3 Print the structure of the list movies.

str(movies)
#> List of 3
#>  $ : chr [1:3] "Dog Day Afternoon" "The Graduate" "Breakfast Club"
#>  $ : num [1:3] 1975 1967 1985
#>  $ : logi [1:3] TRUE FALSE TRUE

3. Factors

3.1 Here’s some code that prints a simple barplot:

f <- factor(c("low","medium","high","medium","high","medium"))
table(f)
#> f
#>   high    low medium 
#>      2      1      3
barplot(table(f))

How would you relevel f to be in the correct order?

f <- factor(f, levels = c("low", "medium", "high"))

# Test your code
barplot(table(f))

4. Dataframes

4.1 Coerce the movies object you made above from a list to a dataframe. Call it movies_df.

movies_df <- as.data.frame(movies)

4.2 Add appropriate column names to movies_df.

names(movies) <- c("film", "year", "bechtel")

20.3 Assignment 3 Solutions

  • Assigned: Oct 17, 2019.
  • Due: Oct 24, 2019 at 12:29pm.

For this assignment, you’ll be working on some real life data! I’ve prepared for your a basic country-year dataset, with the following variables:

  • Country name
  • Country numerical code
  • Year
  • UN Ideal point
  • Polity2 score of regime type (from Polity VI)
  • Physical Integrity Rights score (from CIRI dataset)
  • Speech Rights score (from CIRI)
  • GDP per capita (from World Bank)
  • Population (from World Bank)
  • Political Terror Scale using Amnesty International reports (from Political Terror Scale project)
  • Composite Index of Military Capabilities (Correlates of War)
  • Region

1. R Projects and Importing

1.1 Using getwd(), print your working directory below.

getwd()
#> [1] "/Users/rochelleterman/Desktop/course-site"

1.2 Read country-year.csv into R, using a relative path. Store it in an object called dat.

dat <- read.csv("data/country-year.csv")

2. Dimensions and Names

2.1 How many rows and columns are in the dataset?

dim(dat)
#> [1] 6416   13

2.2 Print the column names.

names(dat)
#>  [1] "X"          "year"       "ccode"      "country"    "idealpoint"
#>  [6] "polity2"    "physint"    "speech"     "gdp.pc.wdi" "pop.wdi"   
#> [11] "amnesty"    "cinc"       "region"

2.3 Remove the X column from the dataset.

dat$X <- NULL

2.4 One of the variables is called “gdp.pc.wdi”. This stands for “Gross Domestic Product Per Capita, from the World Bank Development Indicators”. Change this variable name in the dataset from " “gdp.pc.wdi” to “GDP”

names(dat)[8] <- "GDP"

3. Summarizing

3.1 How many years are covered in the dataset?

length(unique(dat$year))
#> [1] 36

3.2 How many unique countries are covered in the dataset?

length(unique(dat$country))
#> [1] 196

3.3 What is the range of polity2? How many NAs are in this column?

summary(dat$polity2)
#>    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
#>     -10      -6       5       2       9      10    1214

4. Subsetting

4.1 Subset dat so that it returns the third column AS A VECTOR (Do not print the object; store it in a variable.)

sub <- dat[[3]]
#OR
sub <- dat[,3]
#OR
names(dat)[3]
#> [1] "country"
sub <- dat$country

4.2 Fix each of the following common data frame subsetting errors:

  1. Extract observations collected for the year 1980
dat[dat$year = 1980,]

# Corrected
dat[dat$year == 1980,]
  1. Extract all columns except 1 through to 4
dat[,-1:4]

# Corrected
dat[,-c(1:4)]
  1. Extract the rows where the polity2 score is greater than 5
dat[dat$polity2 > 5]

# corrected
dat[dat$polity2 > 5, ]
  1. Extract the first row, and the third and fourth columns (country and idealpoint).
dat[1, 3, 4]

# Corrected
dat[1, c(3, 4)]
  1. Extract rows that contain information for the years 2002 and 2007
dat[dat$year == 2002 | 2007,]

# Corrected
dat[dat$year == 2002 | dat$year == 2007,]

4.3 What does summary(dat$polity2[dat$region =="Africa"]) do? Explain below in your own words.

It calculates some summary statistics for polity2 scores from observations in Africa.

4.4 Subset the data to include only observations from years 1990-2000 (inclusive). Put the subsetted data in a new variable called dat.1990.2000

dat.1990.2000 <- dat[dat$year >= 1990 & dat$year<=2000,]

4.5 Using mean() function, tell me the average GDP of observations from 1990 to 2000.

mean(dat.1990.2000$GDP, na.rm = T)
#> [1] 6611

4.6 You just calculated the average GDP for years 1990-2000. Now calculate the average GDP from 2001 onwards. Tell me how much larger it is (in percentage).

dat.2001.plus <- dat[dat$year > 2000,]
mean1 <- mean(dat.1990.2000$GDP, na.rm = T)
mean2 <- mean(dat.2001.plus$GDP, na.rm = T)
(mean2 - mean1) / mean1
#> [1] 0.825

4.7 Look up the helpfile for the function is.na(). Using this function, replace all the NA values in the polity2 column of dat with 0.

?is.na
dat$polity2[is.na(dat$polity2)] <- 0
summary(dat$polity2)
#>    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
#>  -10.00   -5.00    0.00    1.46    8.00   10.00

20.4 Assignment 4 Solutions

  • Assigned: Oct 24, 2019
  • Due: Nov 5, 2019 at 12:29pm.

For this problem set, we’ll be working with the country-year data introduced in the last assignment. As a reminder, the dataset contains the following variable:

  • year: Year.
  • ccode: Country numerical code.
  • country: Country name.
  • idealpoint UN Ideal point.
  • polity2: Polity2 score of regime type (from Polity VI).
  • physint: Physical Integrity Rights score (from CIRI dataset).
  • speech: Speech Rights score (from CIRI).
  • gdp.pc.wdi: GDP per capita (from World Bank).
  • pop.wdi: Population (from World Bank).
  • amnesty: Political Terror Scale using Amnesty International reports (from Political Terror Scale project).
  • cinc: Composite Index of Military Capabilities (Correlates of War).
  • region: Geographic region.

We’ll be merging this country_year data with new data about U.S. news coverage of women around the world (excluding the United States). In this new dataset, the unit of observation is article. That is, each row represents an individual article, with columns for:

  • publication: NYT or Washington Post.
  • year: Year article was published.
  • title: Title of the article.
  • country: Country the article is mainly about.
  • region: Region where country is located.
  • ccode: Numerical code for country.

1. Loading, subsetting, summarizing

1.1 Load the csv found in data/articles.csv into R. Be sure to set stringsAsFactors to FALSE. Store the data-frame to an object called articles and tell me the variable names.

library(tidyverse)
articles <- read.csv("data/articles.csv", stringsAsFactors = F)
names(articles)
#> [1] "publication" "year"        "title"       "country"     "region"     
#> [6] "ccode"

1.2 How many countries are covered in the dataset?

length(unique(articles$country))
#> [1] 147

1.3 The variable ccode reports a numerical ID corresponding to a given country. Print the names of the country or countries without a ccode (i.e. those countries where the ccode is NA.)

unique(articles$country[is.na(articles$ccode)])
#> [1] "Palestine"

1.4 Remove all articles where the ccode variable is NA. How many observations are left with?

articles_no_na <- articles[!is.na(articles$ccode), ]
nrow(articles_no_na)
#> [1] 4494

2. Counting Frequencies and Merging

2.1 Create a new data frame called articles_country_year that tells us the number of articles per ccode (i.e. country code), per year.

The final data frame articles_country_year should contain three columns: ccode, year, and number_articles.

Print the first 6 rows of the articles_country_year.

Hint: The count function – part of the plyr package – might be helpful.

articles_country_year <- articles_no_na %>% 
  dplyr::count(ccode, year) %>%
  select(ccode, year, number_articles = n)

kable(head(articles_country_year))
ccode year number_articles
20 1980 4
20 1981 9
20 1982 4
20 1983 4
20 1984 6
20 1985 1

2.2. Load data/country-year.csv (this is the country-year data we worked with during the last assignment.)

country_year <- read.csv("data/country-year.csv", stringsAsFactors = F)

2.3 Subset country_year such that it has the same year range as articles_country_year.

range(articles_country_year$year)
#> [1] 1980 2014
range(country_year$year)
#> [1] 1979 2014

country_year <-country_year %>% 
  filter(year > 1979)

2.4 Merge (i.e. join) articles_country_year and country_year into a new dataframe called merged.

When you’re done, merged should have all the rows and columns of the country_year dataset, along with a new column called number_articles.

Print the first 6 rows of this new dataframe merged.

merged <- country_year %>% 
  left_join(articles_country_year)
#> Joining, by = c("year", "ccode")

kable(head(merged))
X year ccode country idealpoint polity2 physint speech gdp.pc.wdi pop.wdi amnesty cinc region number_articles
156 1980 700 Afghanistan -1.560 NA NA NA 276 13180431 5 0.001 MENA NA
157 1980 540 Angola -1.176 -7 NA NA NA 7637141 3 0.001 Africa NA
158 1980 339 Albania -1.564 -9 NA NA NA 2671997 3 0.001 EECA NA
159 1980 696 United Arab Emirates -0.315 -8 NA NA 42962 1014825 NA 0.001 MENA NA
160 1980 160 Argentina 0.128 -9 NA NA 2737 28120135 5 0.007 LA NA
161 1980 900 Australia 1.423 10 NA NA 10188 14692000 NA 0.007 West 3

2.5 In merged, replace all instances of NA in the number_articles column to 0.

# solution 1 - base R
merged$number_articles[is.na(merged$number_articles)] <- 0

# solution 2 - tidyr
merged$number_articles <- replace_na(merged$number_articles, 0)

# solution 3 - dplyr
merged <- merged %>% 
  mutate(number_articles = ifelse(is.na(number_articles), 0, number_articles))

# test
summary(merged$number_articles)
#>    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
#>     0.0     0.0     0.0     0.7     0.0    99.0

2.6 Which country-year observation has the most number of articles? Write code that prints the year, country name, and number of articles for this observation.

# solution #1 -- base R
merged[which.max(merged$number_articles),c("year", "country", "number_articles")]
#>      year country number_articles
#> 5950 2013   India              99

# solution #2 -- tidyverse
merged %>% 
  top_n(1, number_articles) %>%
  select(year, country, number_articles)
#>   year country number_articles
#> 1 2013   India              99

3. Group-wise Operations

3.1 Using the merged data and our split-apply-combine strategies, print the total number of articles about women per region.

n_region <- merged %>%
  group_by(region) %>%
  summarise(count = sum(number_articles, na.rm = T))

n_region
#> # A tibble: 7 x 2
#>   region count
#>   <chr>  <dbl>
#> 1 Africa   464
#> 2 Asia    1288
#> 3 EECA     251
#> 4 LA       328
#> 5 MENA     940
#> 6 West    1159
#> # … with 1 more row

4. Long v. wide formats

4.1 Create a piped operation on merged that does the following:

1. Subsets the dataframe to select year, country, and number_articles columns. 2. Filters the dataframe to select only observations in the MENA region. 3. Spreads the dataframe so that each country is a column, and the cells represent `number_articles.

Print the first 6 rows of this transformed data frame.

wide <- merged %>%
  filter(region == "MENA") %>%
  select(year, country, number_articles) %>%
  spread(country, number_articles, fill = 0)

kable(head(wide))
year Afghanistan Algeria Bahrain Egypt Iran Iraq Israel Jordan Kuwait Lebanon Libya Morocco Oman Palestine Qatar Saudi Arabia South Sudan Sudan Syria Tunisia Turkey United Arab Emirates Yemen Arab Republic Yemen People’s Republic
1980 0 0 0 0 4 0 9 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0
1981 1 0 0 2 1 0 2 0 0 0 0 1 0 0 0 1 0 0 0 0 1 0 0 0
1982 0 0 0 1 2 1 6 0 1 0 0 0 0 0 0 1 0 0 0 0 1 0 0 0
1983 0 0 0 0 0 0 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
1984 0 0 0 3 0 0 3 0 1 2 0 0 0 0 0 0 0 0 1 0 0 0 0 0
1985 0 0 0 6 0 0 1 0 1 3 0 0 0 0 0 0 0 0 0 0 0 0 0 0

4.2 Transform the dataset you created above back into long format, with three variables: year, country, and number_articles

Print the first 6 rows of this transformed data frame.

long <- wide %>% 
  gather(country, number_articles, -year)

kable(head(long))
year country number_articles
1980 Afghanistan 0
1981 Afghanistan 1
1982 Afghanistan 0
1983 Afghanistan 0
1984 Afghanistan 0
1985 Afghanistan 0

20.4.0.1 Extra Credit

This question is not required. But it you want an extra challenge….

Transform the country_year data into an undirected dyadic dataset. Here, the unit of observation should be the dyad-year, with five columns:

  1. ccode_1: Country 1 ccode
  2. country_1: Country 1 name
  3. ccode_2: Country 2 ccode
  4. country_2: Country 2 name
  5. year: Year of observation
  6. gdp_diff: Absolute difference of gdp between dyad.

This should be undirected dyadic dataset, meaning USA-Canada-1980 is the same as Canada-USA-1980, and we shouldn’t have duplicate rows for the same dyad.

Try to do it all in 1 piped sequence. Then tell me the dyad-year with the greatest wealth disparity.

dyad <- country_year %>% 
  expand(ccode_1=ccode, ccode_2=ccode) %>% # make two columns of states
  filter(ccode_1 > ccode_2) %>% # from directed to undirected dyads
  left_join(., country_year, by=c("ccode_1"="ccode")) %>% # get state1 info
  left_join(., country_year, by=c("year", "ccode_2"="ccode")) %>% # get state2 info 
  mutate(gdp_diff = abs(gdp.pc.wdi.x - gdp.pc.wdi.y)) %>% # take absolute difference in gdp
  select(ccode_1, country_1 = country.x, ccode_2, country_2 = country.y, year, gdp_diff) %>%
  arrange(desc(gdp_diff))

kable(head(dyad))
ccode_1 country_1 ccode_2 country_2 year gdp_diff
516 Burundi 221 Monaco 2008 193705
450 Liberia 221 Monaco 2008 193661
531 Eritrea 221 Monaco 2008 193636
553 Malawi 221 Monaco 2008 193590
490 Democratic Republic of the Congo 221 Monaco 2008 193566
530 Ethiopia 221 Monaco 2008 193565

20.5 Assignment 5 Solutions

  • Assigned: Nov 5, 2019
  • Due: Nov 12, 2019 at 12:29pm.

For this problem set, we’ll be working with the country-year data introduced in the last assignment. As a reminder, the dataset contains the following variables:

  • year: Year.
  • ccode: Country numerical code.
  • country: Country name.
  • idealpoint UN Ideal point.
  • polity2: Polity2 score of regime type (from Polity IV).
  • physint: Physical Integrity Rights score (from CIRI dataset).
  • speech: Speech Rights score (from CIRI).
  • gdp.pc.wdi: GDP per capita (from World Bank).
  • pop.wdi: Population (from World Bank).
  • amnesty: Political Terror Scale using Amnesty International reports (from Political Terror Scale project).
  • cinc: Composite Index of Military Capabilities (Correlates of War).
  • region: Geographic region.

1. Getting Started

1.1 Read data/country-year.csv into R, using a relative path. Store it in an object called dat.

library(tidyverse)
library(stargazer)
dat <- read.csv("Data/country-year.csv", stringsAsFactors = F)

2. Plotting

2.1 Write code that reproduces “plots/Plot_1.jpeg”. (No need to write the file.)

# Density of population
d <- density(log(dat$pop.wdi), na.rm = T, bw = .2)
plot(d, main = "Summary of Population (Logged)") 
abline(h = max(d$y), v = 16.09, lty = 2) 

2.2 Write code that reproduces “plots/Plot_2.jpeg”. (No need to write the file.)

# get summary data
country_means <- dat %>%
  filter(!is.na(region)) %>%
  group_by(country) %>%
  summarise(gdp = mean(gdp.pc.wdi, na.rm = T),
            polity = mean(polity2, na.rm = T),
            cinc = mean(cinc, na.rm = T),
            region = region[1])

# plot
ggplot(country_means, aes(x = polity, y = log10(gdp))) +
  geom_point(aes(color = region)) +
  #scale_y_log10() +
  geom_smooth(color="red", fill="red") + 
  ylab("Mean GDP (Logged) ") +
  xlab("Mean Polity Score") +
  ggtitle("Average Polity Score by GDP, 1979-2014") 
#> `geom_smooth()` using method = 'loess' and formula 'y ~ x'
#> Warning: Removed 33 rows containing non-finite values (stat_smooth).
#> Warning: Removed 33 rows containing missing values (geom_point).

2.3 Write code that reproduces “plots/Plot_3.jpeg”. (No need to write the file.)

Hint: The fall-inspired colors are #771C19, #E25033, #F27314, #F8A31B

# military capabilities
top_cinc <- country_means %>%
  top_n(10, cinc)

# Fall theme
rhg_cols = c("#771C19","#E25033","#F27314", "#F8A31B")

# plot
ggplot(top_cinc, aes(reorder(country, cinc), cinc, fill = region)) +
  geom_col() +
  theme(axis.text.x=element_text(size = 7, angle=45, hjust=1)) +
  ylab("Composite Index of Military Capabilities") +
  xlab("Country") +
  ggtitle("Top 10 Military Capabilities, in Fall Colors") +
  scale_fill_manual(values = rhg_cols)

2.4 Write code that reproduces “plots/Plot_4.jpeg”. (No need to write the file.)

# prepare data
year_means <- dat %>%
  filter(!is.na(region)) %>%
  group_by(year, region) %>%
  summarise(gdp = mean(gdp.pc.wdi, na.rm = T),
            polity = mean(polity2, na.rm = T),
            physint = mean(physint, na.rm = T)) 

# plot
ggplot(year_means, aes(x = year, y = polity, color = region)) +
  geom_line(size=2) +
  ylab("Average Polity Score") +
  xlab("Year") +
  ggtitle("Average Polity Score Over Time")
#> Warning: Removed 6 rows containing missing values (geom_path).

3. Models

3.1 Write code that reproduces the model summary table “reg_table.txt” (and writes the file).

mod.1 <- lm(physint ~ polity2, data = dat)
mod.2 <- lm(physint ~ polity2 + log(gdp.pc.wdi), data = dat)
mod.3 <- lm(physint ~ polity2 + log(gdp.pc.wdi) + region, data = dat)

stargazer(mod.1, mod.2, mod.3, title = "Regression Results", type = "text", 
          covariate.labels  = c("Polity2", "GDP per capita, logged", "Asia", "Eastern Europe", "Latin America", "MENA", "West", "Constant"), 
          dep.var.labels = "DV: Physical Integrity",
          omit = "Constant", 
          keep.stat="n", style = "ajps",
          out = "reg_table.txt")
#> 
#> Regression Results
#> --------------------------------------------------
#>                          DV: Physical Integrity   
#>                        Model 1  Model 2   Model 3 
#> --------------------------------------------------
#> Polity2                0.124*** 0.064*** 0.033*** 
#>                        (0.004)  (0.005)   (0.005) 
#> GDP per capita, logged          0.503*** 0.493*** 
#>                                 (0.021)   (0.028) 
#> Asia                                     -1.170***
#>                                           (0.101) 
#> Eastern Europe                            -0.033  
#>                                           (0.109) 
#> Latin America                            -0.769***
#>                                           (0.097) 
#> MENA                                     -1.460***
#>                                           (0.116) 
#> West                                     0.604*** 
#>                                           (0.138) 
#> N                        4336     4141     4141   
#> --------------------------------------------------
#> ***p < .01; **p < .05; *p < .1

20.6 Assignment 6 Solutions

  • Assigned: Nov 12, 2019
  • Due: Nov 21, 2019 at 12:29pm.

In this unit, we’ll use R to turn a bunch of loose text documents into a real-life database. (Note: This database was created for a project by R. Terman and E. Voeten, and was processed using much the same process as you’ll be learning here.)

The problem set will leverage your new R skills, especially working with strings; writing functions; using iteration; and thinking like a programmer.

Important: The code has been scaffolded for you, meaning that you have to fill in the blanks. Once you’re ready to submit the assignment, you have to remove the eval = F from the R chunk header. If you don’t, the chunk won’t execute when you knit the Rmarkdown file.

About the Data

We’ll be creating a database from Universal Period Review outcome reports.

The Universal Periodic Review (UPR) is a process run by the United Nations Human Rights Council, which involves a periodic review of the human rights records of all 193 UN Member States.

Reviews take place through an interactive discussion between the State under review and other UN Member States. During this discussion any UN Member State can pose questions, comments and/or make recommendations to the States under review. States under review can then respond, stating which recommendations they reject, accept, will consider, etc. Reports are then drawn up detailing this discussion.

We will be analyzing outcome reports from the 2014 Universal Period Reviews of 42 countries, which we retrieved here and formatted as text documents.

The goal is to convert these semi-structured texts to a tabular dataset of recommendations with the following variables:

  1. Text of recommendation (text)
  2. Country to which the recommendation is directed (to)
  3. Country that is making the recommendation (from)
  4. The year when the review took place (year)
  5. The response to the recommendation, i.e. whether the reviewed country rejects, accepts, etc (decision)

In other words, we want to turn this:

into this:

Take a few minutes to look at the files, which are located in data/txts, and get a sense for how they’re structured.

Then run the following code to get started.

library(readtext)
library(stringr)
library(tidyverse)

# read all texts
all_texts <- readtext("data/txts/")

1. Extract One Document

We’re going to start off working with just one document. We’ll then use that code to iterate over all the documents.

task:

  • Extract one document.
  • Collect information on the country and year.
  • Extract the section we’re interested in.
  • Turn each line (i.e. recommendation) into an item in a vector.

Let’s start off working with cotedivoire2014.txt (the third file).

text <- all_texts$text[3]
file_name <- all_texts$doc_id[3]

1.1 Assign country and year variables.

You’ll notice that the file_name consists of the name of the reviewed country and the year. Slice file_name to create 2 new variables, country, and year.

Be careful! Remember that we are going to apply this to the other file names later. However you slice “cotedivoire2014.txt”, it needs to work for the other files in the data/txts directory.

country <- str_sub(file_name, 1, -9)
year <-str_sub(file_name, -8, -5)

1.2 Get the Recommendations Section

Note that the section we want starts with "II. Conclusions and/or recommendations\n". What function would you use to get everything after this substring? Fill in the blank below and assign the value to a new variable called rec_text.

sections = str_split(text, "II. Conclusions and/or recommendations\n")[[1]]
rec_text = sections[2] # get second item -- everything after.

1.3 Turn it into a vector

Using a stringr function, turn the string above into a vector of lines, and store it in a variable called recs. Remember that a new line is represented by \n.

recs <- str_split(rec_text, "\n")[[1]]
recs[1:5]
#> [1] "127. The recommendations listed below enjoy the support of C™te dÕIvoire: "                                                                                                  
#> [2] "127.1 Consider the accession to core human rights instruments (Lesotho); and to other main international human rights treaties that it is not yet a party to (Philippines); "
#> [3] "127.2 Make efforts towards the ratification of the OP-CAT (Chile); "                                                                                                         
#> [4] "127.3 Ratify the OP-CAT (Ghana, Tunisia), as recommended previously in 2009 (Czech Republic) and take policy measures to prevent torture and ill-treatment (Estonia); "      
#> [5] "127.4 Accede to the OP-CAT as soon as possible (Uruguay); "

2. Chunk Recomendations

These texts have 3 sections each. 1. The first section contains those recommendations the country supports. 2. The second section contains recs the country will examine. 3. The third contains recommendations the country explicitely rejects.

task:

  • parse recommendations into three piles, corresponding to accepted recs, examined recs, and rejected recs.
  • combine these piles back into a dataframe, containing the text of the recommendation and its corresponding decision.
  • add additional columns for to country and year.

2.1: Find the paragraph numbers

Each section starts with a main paragraph number (e.g. 127). The individual recommendations are then noted as subparagraphs (e.g. 127.1, 127.2 etc.).

All the accepted recommendations have the same main paragraph number (127). Next come the recommendations which will be examined, whose main paragraph number is just the next integer (128). After that are the rejected recommendations, with the next integer as their main paragraph number (129).

We can’t know the paragraph numbers beforehand, because each file is different. But we can leverage our knowledge of the structure of the documents to get them.

Fill in the blanks below to create 3 variables containing the 3 paragraph numbers.

para1 = str_extract(recs[1], "\\d+")
para1 = as.numeric(para1)
para2 = para1 + 1
para3 = para2 + 1

2.2 Parse the text

Now create 3 new vectors: accept_recs, examine_recs, reject_recs. Each vector should contain the recommendations assigned to its corresponding section.

hint: How do you know if a line belongs to a section? It starts with the main paragraph number for that section. So use the str_starts function.

# subset recommendations
accept_recs = recs[str_starts(recs, as.character(para1))]
# accept_recs = str_subset(recs, str_c("^", as.character(para1)))
examine_recs = recs[str_starts(recs, as.character(para2))]
reject_recs = recs[str_starts(recs, as.character(para3))]

# remove the first item from each list, which just demarcates the sections
accept_recs = accept_recs[-1]
examine_recs = examine_recs[-1]
reject_recs = reject_recs[-1]

2.3 Tranform to Dataframe

The following code combines the three vectors back into a dataframe with two column: text (of the recommendation), and decicion (whether the recommendation was accepted, examined, or rejected)

recs_df <- list(accept = accept_recs,
                    examine = examine_recs,
                    reject = reject_recs)

recs_df <- stack(recs_df) %>%
  select("text" = values, "decision" = ind)

Your job is to add 2 new columns to this dataframe: to should contain the country under review, and year should contain the year under review. Note that we already created these variables above, in question 1.1

recs_df <- recs_df %>%
  mutate(to = country,
         year = year)

head(recs_df)
#>                                                                                                                                                                           text
#> 1 127.1 Consider the accession to core human rights instruments (Lesotho); and to other main international human rights treaties that it is not yet a party to (Philippines); 
#> 2                                                                                                          127.2 Make efforts towards the ratification of the OP-CAT (Chile); 
#> 3       127.3 Ratify the OP-CAT (Ghana, Tunisia), as recommended previously in 2009 (Czech Republic) and take policy measures to prevent torture and ill-treatment (Estonia); 
#> 4                                                                                                                   127.4 Accede to the OP-CAT as soon as possible (Uruguay); 
#> 5                                                                                                                             127.5 Consider ratifying OP-CAT (Burkina Faso); 
#> 6                             127.6 Ratify the International Convention on the Protection of the Rights of All Migrant Workers and Members of Their Families (ICRMW) (Ghana); 
#>   decision          to year
#> 1   accept cotedivoire 2014
#> 2   accept cotedivoire 2014
#> 3   accept cotedivoire 2014
#> 4   accept cotedivoire 2014
#> 5   accept cotedivoire 2014
#> 6   accept cotedivoire 2014

3. Get Recommending Country

task - extract the substring representing the recommending country. - add this information to our dataframe.

3.1 Extract recommending country

Take a look at several recommendation texts to get an idea of their format.

head(recs_df$text)
#> [1] "127.1 Consider the accession to core human rights instruments (Lesotho); and to other main international human rights treaties that it is not yet a party to (Philippines); "
#> [2] "127.2 Make efforts towards the ratification of the OP-CAT (Chile); "                                                                                                         
#> [3] "127.3 Ratify the OP-CAT (Ghana, Tunisia), as recommended previously in 2009 (Czech Republic) and take policy measures to prevent torture and ill-treatment (Estonia); "      
#> [4] "127.4 Accede to the OP-CAT as soon as possible (Uruguay); "                                                                                                                  
#> [5] "127.5 Consider ratifying OP-CAT (Burkina Faso); "                                                                                                                            
#> [6] "127.6 Ratify the International Convention on the Protection of the Rights of All Migrant Workers and Members of Their Families (ICRMW) (Ghana); "

Notice that they’re all formatted the same way, with the recommending country in parenthesis at the end, in between parentheses.

Using your string skills, find a way to pull out the recommending country from the first recommendation (stored in first_rec below).

first_rec = recs_df$text[1]
rec_after_paran <- str_split(first_rec, "\\(")[[1]]
rec_after_paran <- tail(rec_after_paran, 1)
first_rec_country = str_split(rec_after_paran, "\\)")[[1]]
first_rec_country <- first_rec_country[1]

# this should be 'Philipines'.
first_rec_country
#> [1] "Philippines"

3.2 Create a Function

Create a function called get_country that passes an individual recommendation text and returns the recommending country.

get_country <- function(rec){
  rec_after_paran <- str_split(rec, "\\(")[[1]]
  rec_after_paran <- tail(rec_after_paran, 1) # get last item
  first_rec_country = str_split(rec_after_paran, "\\)")[[1]]
  first_rec_country <- first_rec_country[1] # get first item
  
  return(first_rec_country)
}

# test your code
get_country(first_rec)
#> [1] "Philippines"

3.3 Add from column

Using your map and dplyr skills, add a column to recs_df that contains the country issuing each recommendation.

recs_df <- recs_df %>% mutate(
  from = map_chr(text, get_country)
)

4. Repeat for all documents

We just wrote code that takes one document and turns it into a dataset!

The problem is we have 11 documents!

task

  • combine the code we’ve written so far to create a function
  • apply that function to all files to create a single dataset.

4.1 Make a function

Combine the functions that you wrote above to create a single function that passes a row number of all_texts (i.e. an integer), and returns a dataframe of fully parsed recommendations in that file.

parse_file <- function(i){
    # get filename and text
    text <- all_texts$text[i]
    file_name <- all_texts$doc_id[i]
    
    # get to country and year
    country <- str_sub(file_name, 1, -9)
    year <-str_sub(file_name, -8, -5)
    
    # get vector of recs
    sections = str_split(text, "II. Conclusions and/or recommendations\n")[[1]]
    rec_text = sections[2] # get second item -- everything after.
    recs = str_split(rec_text, "\n")[[1]]
    
    # get paragraph numbers
    para1 = str_extract(recs[1], "\\d+")
    para1 = as.numeric(para1)
    para2 = para1 + 1
    para3 = para2 + 1
    
    # chunk recommendations
    accept_recs = recs[str_starts(recs, as.character(para1))]
    examine_recs = recs[str_starts(recs, as.character(para2))]
    reject_recs = recs[str_starts(recs, as.character(para3))]
    
    # remove the first item from each list, which just demarcates the sections
    accept_recs = accept_recs[-1]
    examine_recs = examine_recs[-1]
    reject_recs = reject_recs[-1]
    
    # transform to dataframe
    recs_df <- list(accept = accept_recs,
                        examine = examine_recs,
                        reject = reject_recs)
    
    recs_df <- stack(recs_df) %>%
      select("text" = values, "decision" = ind)
    
    recs_df <- recs_df %>%
      mutate(to = country,
             year = year)
    
    # add from column
    recs_df <- recs_df %>% mutate(
      from = map_chr(text, get_country)
    )
}

4.2 Map the function

Apply the function you created above to all rows in all_texts using your map_ skills. The final output should contain a dataframe of all the recommendations from all the files.__

all_recs <- map_dfr(1:nrow(all_texts), parse_file)

4.3 Print Dimensions and Write a csv

Print the dimensions and export the full dataframe into a csv. You’re done!

dim(all_recs) # should be 1709 x 5
# write.csv(all_recs, "upr-recs.csv")

20.7 Assignment 7 Solutions

  • Assigned: Nov 21, 2019
  • Due: Dec 5, 2019 at 12:29pm.

In this week’s lecture, we introduced some tools to collect pieces of data from individual presidential documents. For this assignment, we will be looking at all documents in the database that contain the string “space exploration.” Our goals in this problem set are:

  1. To scrape all documents returned from this search query
  2. To organize this data into a dataframe and ultimately output a CSV file.

Below, I’ve given you the code for a function that passes the URL of an individual document, scrapes the information from that document, and returns that information in a list.

But this is all I will be providing for you. You must complete the rest of the task yourself. Specifically, you should:

  1. Write code that scrapes all documents, organizes the information in a dataframe, and writes a csv file.
  2. The end goal should be a dataset identical to the one I’ve provided for you in data/space.csv.
  3. Split the code up into discrete steps, each with their own corresponding Rmarkdown chunk.
  4. Document (i.e. describe) each step in clear but concise Rmarkdown prose.
  5. The final chunk should:
  • print the structure (str) of the final data frame.
  • write the dataframe to a csv file.

Good luck!

library(tidyverse)
library(rvest)
library(stringr)
library(purrr)
library(lubridate)

scrape_docs <- function(URL){
  doc <- read_html(URL)

  speaker <- html_nodes(doc, ".diet-title a") %>% 
    html_text()
  
  date <- html_nodes(doc, ".date-display-single") %>%
    html_text() %>%
    mdy()
  
  title <- html_nodes(doc, "h1") %>%
    html_text()
  
  text <- html_nodes(doc, "div.field-docs-content") %>%
    html_text()
  
  all_info <- list(speaker = speaker, date = date, title = title, text = text)
  
  return(all_info)
}

Solution

There are likely many ways to achieve this task. Here’s one solution:

1. Write function scrape_urls to scrape URLs of individual search results.

The following function passes a page of search results, and returns a vector of URLs, each URL corresponding to an individual document.

scrape_urls <- function(path) {
  
  html <- read_html(path) #Download HTML of webpage
  
  links <- html_nodes(html, ".views-field-title a") %>% #select element with document URLs
                html_attr("href")
  
  return(links) #output results
}

scrape_test <- scrape_urls("https://www.presidency.ucsb.edu/advanced-search?field-keywords=%22space+exploration%22&field-keywords2=&field-keywords3=&from%5Bdate%5D=&to%5Bdate%5D=&person2=&items_per_page=100")

scrape_test[1:10]
#>  [1] "/documents/special-message-the-congress-relative-space-science-and-exploration"                    
#>  [2] "/documents/statement-the-president-support-the-administration-bill-relative-space-science-and"     
#>  [3] "/documents/statement-the-president-upon-signing-the-national-aeronautics-and-space-act-1958"       
#>  [4] "/documents/annual-budget-message-the-congress-fiscal-year-1960"                                    
#>  [5] "/documents/letter-t-keith-glennan-administrator-national-aeronautics-and-space-administration"     
#>  [6] "/documents/the-presidents-news-conference-225"                                                     
#>  [7] "/documents/the-presidents-news-conference-augusta-georgia-0"                                       
#>  [8] "/documents/annual-message-the-congress-the-state-the-union-6"                                      
#>  [9] "/documents/special-message-the-congress-recommending-amendments-the-national-aeronautics-and-space"
#> [10] "/documents/special-message-the-congress-transfers-from-the-department-defense-the-national"

2. Iterate over results pager to collect all URLs

scrape_urls collects all of the relative URLs from the first page of our search results (100 documents). While this is a good start, we have 4 pages of search results (310 results total) and need to collect the URLs of ALL results, from ALL result pages.

First, let’s grab the path of all 4 result pages, and store that result in an object called all_pages:

all_pages <- str_c("https://www.presidency.ucsb.edu/advanced-search?field-keywords=%22space%20exploration%22&field-keywords2=&field-keywords3=&from%5Bdate%5D=&to%5Bdate%5D=&person2=&items_per_page=100&page=", 0:3)

Now, we can use scrape_urls to collect the URLs from all the pages of search results. We store the results as a character vector called all_urls.

all_urls <- map(all_pages, scrape_urls) %>%
  unlist

# uncomment to test -- should be 310
length(all_urls)
#> [1] 310

3. Modify to Full Path

The HREF we got above is what’s called a relative URL: i.e., it looks like this:

/documents/special-message-the-congress-relative-space-science-and-exploration

as opposed to having a full path, like:

http://www.presidency.ucsb.edu/documents/special-message-the-congress-relative-space-science-and-exploration

The following code converts the relative paths to full paths, and saves them in an object called all_full_urls.

all_full_urls <- str_c("https://www.presidency.ucsb.edu", all_urls)
all_full_urls[1:10]
#>  [1] "https://www.presidency.ucsb.edu/documents/special-message-the-congress-relative-space-science-and-exploration"                    
#>  [2] "https://www.presidency.ucsb.edu/documents/statement-the-president-support-the-administration-bill-relative-space-science-and"     
#>  [3] "https://www.presidency.ucsb.edu/documents/statement-the-president-upon-signing-the-national-aeronautics-and-space-act-1958"       
#>  [4] "https://www.presidency.ucsb.edu/documents/annual-budget-message-the-congress-fiscal-year-1960"                                    
#>  [5] "https://www.presidency.ucsb.edu/documents/letter-t-keith-glennan-administrator-national-aeronautics-and-space-administration"     
#>  [6] "https://www.presidency.ucsb.edu/documents/the-presidents-news-conference-225"                                                     
#>  [7] "https://www.presidency.ucsb.edu/documents/the-presidents-news-conference-augusta-georgia-0"                                       
#>  [8] "https://www.presidency.ucsb.edu/documents/annual-message-the-congress-the-state-the-union-6"                                      
#>  [9] "https://www.presidency.ucsb.edu/documents/special-message-the-congress-recommending-amendments-the-national-aeronautics-and-space"
#> [10] "https://www.presidency.ucsb.edu/documents/special-message-the-congress-transfers-from-the-department-defense-the-national"

4. Scrape Documents

Now that we have the full paths to each document, we’re ready to scrape each document.

We’ll use the scrape_docs function (given above), which accepts a URL of an individual record, scrapes the page, and returns a list containing the document’s date, speaker, title, and full text.

Using this function, we’ll iterate over all_full_urls to collect information on all the documents. We save the result as a dataframe, with each row representing a document.

Note: This might take a few minutes.

final_df <- map(all_full_urls, scrape_docs) %>%
  bind_rows()