Chapter 20 Assignments

20.1 Assignment 1 Solutions

Assigned: Oct 3, 2019.
Due: Oct 10, 2019 at 12:29pm.

For this assignment, you will confirm that everything is installed and setup correctly, and you understand how to interact with R Studio and R Markdown.

Your answers (to this assignment only) will be posted on our course website.

1. Using R Markdown

In the space below, insert a picture of yourself, and complete the following information:

Name: Daenerys Targaryen
Department and degree program: Queen of the Andals and the First Men, Protector of the Seven Kingdoms, the Mother of Dragons, the Khaleesi of the Great Grass Sea, the Unburnt, the Breaker of Chains.
Year in the program: First.
One-sentence description of academic interests: I am interested in slavery, intercontinental conflict, and pyrology.
Some non-academic interests: Dragons, Jon Snow, eating raw hearts.
R version installed on your computer (Open a command line window (‘terminal’ or, on windows, ‘git bash’), and enter the following command R --version): 3.6.1
R Studio version installed on your computer (Open RStudio and, in the navigation menu, click on RStudio –> About RStudio): 1.1.456
Primary computer operating system (Mac OS, Windows, Linux, etc): Mac OS 10.13.6.
Programming experience (How would you describe your previous programming experience?): None.

2. Checking packages

Create an R chunk below, where you load the tidyverse library.

library(tidyverse)

3. Knit and submit.

Knit the R Markdown file to PDF. Submit BOTH the .Rmd file and the PDF file to Canvas.

If you get an error trying to knit, read the error and make sure that your R code is correct. If that doesn’t work, confirm you’ve correctly installed the requisite packages (knit, rmarkdown). If you still can’t get it to work, paste the error on Canvas.

20.2 Assignment 2 Solutions

Assigned: Oct 10, 2019.
Due: Oct 17, 2019 at 12:29pm.

For this assignment, you’ll use what you know about R syntax and data structures to perform some common data operations.

1. Basics

1.1 Fix the following syntax errors. Enter your corrected code in the second chunk.

# 1
states <- ("California", "Illinois", "Ohio")

# 2
countries <- c("Iran", "Indonesia," "India", "Italy")

# 3
df <- data.frame(age = c(21, 66, 35)
                 party = c('rep', 'dem', 'rep'))

# 4
my-vector <- c("apples", "oranges", "kiwis")

# 5
artists <- list(names = c("Picasso", "Kahlo",
                genre = c("cubist", "surrealist"))

# PUT YOUR CORRECTED CODE HERE

# 1
states <- c("California", "Illinois", "Ohio")

# 2
countries <- c("Iran", "Indonesia", "India", "Italy")

# 3
df <- data.frame(age = c(21, 66, 35),
                 party = c('rep', 'dem', 'rep'))

# 4
my_vector <- c("apples", "oranges", "kiwis")

# 5
artists <- list(names = c("Picasso", "Kahlo"),
                genre = c("cubist", "surrealist"))

1.2 How many arguments does the order() function pass? What are they?

2. Vectors and Lists

2.1 Create three vectors:

a character vector, titles, that contain the names of 3 of your favorite movies
a numeric vector, year, that contains the years in which those movies were produced
a boolean vector bechdel that TRUE/FALSE according to whether those movies pass the bechdel test

titles <- c("Dog Day Afternoon", "The Graduate", "Breakfast Club")
year <- c(1975, 1967, 1985)
bechdel <- c(TRUE, FALSE, TRUE)

2.2 Put those three vectors in a list, called movies.

movies <- list(titles, year, bechdel)

2.3 Print the structure of the list movies.

str(movies)
#> List of 3
#>  $ : chr [1:3] "Dog Day Afternoon" "The Graduate" "Breakfast Club"
#>  $ : num [1:3] 1975 1967 1985
#>  $ : logi [1:3] TRUE FALSE TRUE

3. Factors

3.1 Here’s some code that prints a simple barplot:

f <- factor(c("low","medium","high","medium","high","medium"))
table(f)
#> f
#>   high    low medium 
#>      2      1      3
barplot(table(f))

How would you relevel f to be in the correct order?

f <- factor(f, levels = c("low", "medium", "high"))

# Test your code
barplot(table(f))

4. Dataframes

4.1 Coerce the movies object you made above from a list to a dataframe. Call it movies_df.

movies_df <- as.data.frame(movies)

4.2 Add appropriate column names to movies_df.

names(movies) <- c("film", "year", "bechtel")

20.3 Assignment 3 Solutions

Assigned: Oct 17, 2019.
Due: Oct 24, 2019 at 12:29pm.

For this assignment, you’ll be working on some real life data! I’ve prepared for your a basic country-year dataset, with the following variables:

Country name
Country numerical code
Year
UN Ideal point
Polity2 score of regime type (from Polity VI)
Physical Integrity Rights score (from CIRI dataset)
Speech Rights score (from CIRI)
GDP per capita (from World Bank)
Population (from World Bank)
Political Terror Scale using Amnesty International reports (from Political Terror Scale project)
Composite Index of Military Capabilities (Correlates of War)
Region

1. R Projects and Importing

1.1 Using getwd(), print your working directory below.

getwd()
#> [1] "/Users/rochelleterman/Desktop/course-site"

1.2 Read country-year.csv into R, using a relative path. Store it in an object called dat.

dat <- read.csv("data/country-year.csv")

2. Dimensions and Names

2.1 How many rows and columns are in the dataset?

dim(dat)
#> [1] 6416   13

2.2 Print the column names.

names(dat)
#>  [1] "X"          "year"       "ccode"      "country"    "idealpoint"
#>  [6] "polity2"    "physint"    "speech"     "gdp.pc.wdi" "pop.wdi"   
#> [11] "amnesty"    "cinc"       "region"

2.3 Remove the X column from the dataset.

dat$X <- NULL

2.4 One of the variables is called “gdp.pc.wdi”. This stands for “Gross Domestic Product Per Capita, from the World Bank Development Indicators”. Change this variable name in the dataset from " “gdp.pc.wdi” to “GDP”

names(dat)[8] <- "GDP"

3. Summarizing

3.1 How many years are covered in the dataset?

length(unique(dat$year))
#> [1] 36

3.2 How many unique countries are covered in the dataset?

length(unique(dat$country))
#> [1] 196

3.3 What is the range of polity2? How many NAs are in this column?

summary(dat$polity2)
#>    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
#>     -10      -6       5       2       9      10    1214

4. Subsetting

4.1 Subset dat so that it returns the third column AS A VECTOR (Do not print the object; store it in a variable.)

sub <- dat[[3]]
#OR
sub <- dat[,3]
#OR
names(dat)[3]
#> [1] "country"
sub <- dat$country

4.2 Fix each of the following common data frame subsetting errors:

Extract observations collected for the year 1980

dat[dat$year = 1980,]

# Corrected
dat[dat$year == 1980,]

Extract all columns except 1 through to 4

dat[,-1:4]

# Corrected
dat[,-c(1:4)]

Extract the rows where the polity2 score is greater than 5

dat[dat$polity2 > 5]

# corrected
dat[dat$polity2 > 5, ]

Extract the first row, and the third and fourth columns (country and idealpoint).

dat[1, 3, 4]

# Corrected
dat[1, c(3, 4)]

Extract rows that contain information for the years 2002 and 2007

dat[dat$year == 2002 | 2007,]

# Corrected
dat[dat$year == 2002 | dat$year == 2007,]

4.3 What does summary(dat$polity2[dat$region =="Africa"]) do? Explain below in your own words.

It calculates some summary statistics for polity2 scores from observations in Africa.

4.4 Subset the data to include only observations from years 1990-2000 (inclusive). Put the subsetted data in a new variable called dat.1990.2000

dat.1990.2000 <- dat[dat$year >= 1990 & dat$year<=2000,]

4.5 Using mean() function, tell me the average GDP of observations from 1990 to 2000.

mean(dat.1990.2000$GDP, na.rm = T)
#> [1] 6611

4.6 You just calculated the average GDP for years 1990-2000. Now calculate the average GDP from 2001 onwards. Tell me how much larger it is (in percentage).

dat.2001.plus <- dat[dat$year > 2000,]
mean1 <- mean(dat.1990.2000$GDP, na.rm = T)
mean2 <- mean(dat.2001.plus$GDP, na.rm = T)
(mean2 - mean1) / mean1
#> [1] 0.825

4.7 Look up the helpfile for the function is.na(). Using this function, replace all the NA values in the polity2 column of dat with 0.

?is.na
dat$polity2[is.na(dat$polity2)] <- 0
summary(dat$polity2)
#>    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
#>  -10.00   -5.00    0.00    1.46    8.00   10.00

20.4 Assignment 4 Solutions

Assigned: Oct 24, 2019
Due: Nov 5, 2019 at 12:29pm.

For this problem set, we’ll be working with the country-year data introduced in the last assignment. As a reminder, the dataset contains the following variable:

year: Year.
ccode: Country numerical code.
country: Country name.
idealpoint UN Ideal point.
polity2: Polity2 score of regime type (from Polity VI).
physint: Physical Integrity Rights score (from CIRI dataset).
speech: Speech Rights score (from CIRI).
gdp.pc.wdi: GDP per capita (from World Bank).
pop.wdi: Population (from World Bank).
amnesty: Political Terror Scale using Amnesty International reports (from Political Terror Scale project).
cinc: Composite Index of Military Capabilities (Correlates of War).
region: Geographic region.

We’ll be merging this country_year data with new data about U.S. news coverage of women around the world (excluding the United States). In this new dataset, the unit of observation is article. That is, each row represents an individual article, with columns for:

publication: NYT or Washington Post.
year: Year article was published.
title: Title of the article.
country: Country the article is mainly about.
region: Region where country is located.
ccode: Numerical code for country.

1. Loading, subsetting, summarizing

1.1 Load the csv found in data/articles.csv into R. Be sure to set stringsAsFactors to FALSE. Store the data-frame to an object called articles and tell me the variable names.

library(tidyverse)
articles <- read.csv("data/articles.csv", stringsAsFactors = F)
names(articles)
#> [1] "publication" "year"        "title"       "country"     "region"     
#> [6] "ccode"

1.2 How many countries are covered in the dataset?

length(unique(articles$country))
#> [1] 147

1.3 The variable ccode reports a numerical ID corresponding to a given country. Print the names of the country or countries without a ccode (i.e. those countries where the ccode is NA.)

unique(articles$country[is.na(articles$ccode)])
#> [1] "Palestine"

1.4 Remove all articles where the ccode variable is NA. How many observations are left with?

articles_no_na <- articles[!is.na(articles$ccode), ]
nrow(articles_no_na)
#> [1] 4494

2. Counting Frequencies and Merging

2.1 Create a new data frame called articles_country_year that tells us the number of articles per ccode (i.e. country code), per year.

The final data frame articles_country_year should contain three columns: ccode, year, and number_articles.

Print the first 6 rows of the articles_country_year.

Hint: The count function – part of the plyr package – might be helpful.

articles_country_year <- articles_no_na %>% 
  dplyr::count(ccode, year) %>%
  select(ccode, year, number_articles = n)

kable(head(articles_country_year))

ccode	year	number_articles
20	1980	4
20	1981	9
20	1982	4
20	1983	4
20	1984	6
20	1985	1

2.2. Load data/country-year.csv (this is the country-year data we worked with during the last assignment.)

country_year <- read.csv("data/country-year.csv", stringsAsFactors = F)

2.3 Subset country_year such that it has the same year range as articles_country_year.

range(articles_country_year$year)
#> [1] 1980 2014
range(country_year$year)
#> [1] 1979 2014

country_year <-country_year %>% 
  filter(year > 1979)

2.4 Merge (i.e. join) articles_country_year and country_year into a new dataframe called merged.

When you’re done, merged should have all the rows and columns of the country_year dataset, along with a new column called number_articles.

Print the first 6 rows of this new dataframe merged.

merged <- country_year %>% 
  left_join(articles_country_year)
#> Joining, by = c("year", "ccode")

kable(head(merged))

X	year	ccode	country	idealpoint	polity2	physint	speech	gdp.pc.wdi	pop.wdi	amnesty	cinc	region	number_articles
156	1980	700	Afghanistan	-1.560	NA	NA	NA	276	13180431	5	0.001	MENA	NA
157	1980	540	Angola	-1.176	-7	NA	NA	NA	7637141	3	0.001	Africa	NA
158	1980	339	Albania	-1.564	-9	NA	NA	NA	2671997	3	0.001	EECA	NA
159	1980	696	United Arab Emirates	-0.315	-8	NA	NA	42962	1014825	NA	0.001	MENA	NA
160	1980	160	Argentina	0.128	-9	NA	NA	2737	28120135	5	0.007	LA	NA
161	1980	900	Australia	1.423	10	NA	NA	10188	14692000	NA	0.007	West	3

2.5 In merged, replace all instances of NA in the number_articles column to 0.

# solution 1 - base R
merged$number_articles[is.na(merged$number_articles)] <- 0

# solution 2 - tidyr
merged$number_articles <- replace_na(merged$number_articles, 0)

# solution 3 - dplyr
merged <- merged %>% 
  mutate(number_articles = ifelse(is.na(number_articles), 0, number_articles))

# test
summary(merged$number_articles)
#>    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
#>     0.0     0.0     0.0     0.7     0.0    99.0

2.6 Which country-year observation has the most number of articles? Write code that prints the year, country name, and number of articles for this observation.

# solution #1 -- base R
merged[which.max(merged$number_articles),c("year", "country", "number_articles")]
#>      year country number_articles
#> 5950 2013   India              99

# solution #2 -- tidyverse
merged %>% 
  top_n(1, number_articles) %>%
  select(year, country, number_articles)
#>   year country number_articles
#> 1 2013   India              99

3. Group-wise Operations

3.1 Using the merged data and our split-apply-combine strategies, print the total number of articles about women per region.

n_region <- merged %>%
  group_by(region) %>%
  summarise(count = sum(number_articles, na.rm = T))

n_region
#> # A tibble: 7 x 2
#>   region count
#>   <chr>  <dbl>
#> 1 Africa   464
#> 2 Asia    1288
#> 3 EECA     251
#> 4 LA       328
#> 5 MENA     940
#> 6 West    1159
#> # … with 1 more row

4. Long v. wide formats

4.1 Create a piped operation on merged that does the following:

1. Subsets the dataframe to select year, country, and number_articles columns. 2. Filters the dataframe to select only observations in the MENA region. 3. Spreads the dataframe so that each country is a column, and the cells represent `number_articles.

Print the first 6 rows of this transformed data frame.

wide <- merged %>%
  filter(region == "MENA") %>%
  select(year, country, number_articles) %>%
  spread(country, number_articles, fill = 0)

kable(head(wide))

year	Afghanistan	Egypt	Iran	Iraq	Israel	Kuwait	Lebanon	Morocco	Saudi Arabia	Syria	Turkey
1980	0	0	4	0	9	0	0	0	1	0	0
1981	1	2	1	0	2	0	0	1	1	0	1
1982	0	1	2	1	6	1	0	0	1	0	1
1983	0	0	0	0	2	0	0	0	0	0	0
1984	0	3	0	0	3	1	2	0	0	1	0
1985	0	6	0	0	1	1	3	0	0	0	0

4.2 Transform the dataset you created above back into long format, with three variables: year, country, and number_articles

Print the first 6 rows of this transformed data frame.

long <- wide %>% 
  gather(country, number_articles, -year)

kable(head(long))

year	country	number_articles
1980	Afghanistan	0
1981	Afghanistan	1
1982	Afghanistan	0
1983	Afghanistan	0
1984	Afghanistan	0
1985	Afghanistan	0

20.4.0.1 Extra Credit

This question is not required. But it you want an extra challenge….

Transform the country_year data into an undirected dyadic dataset. Here, the unit of observation should be the dyad-year, with five columns:

ccode_1: Country 1 ccode
country_1: Country 1 name
ccode_2: Country 2 ccode
country_2: Country 2 name
year: Year of observation
gdp_diff: Absolute difference of gdp between dyad.

This should be undirected dyadic dataset, meaning USA-Canada-1980 is the same as Canada-USA-1980, and we shouldn’t have duplicate rows for the same dyad.

Try to do it all in 1 piped sequence. Then tell me the dyad-year with the greatest wealth disparity.

dyad <- country_year %>% 
  expand(ccode_1=ccode, ccode_2=ccode) %>% # make two columns of states
  filter(ccode_1 > ccode_2) %>% # from directed to undirected dyads
  left_join(., country_year, by=c("ccode_1"="ccode")) %>% # get state1 info
  left_join(., country_year, by=c("year", "ccode_2"="ccode")) %>% # get state2 info 
  mutate(gdp_diff = abs(gdp.pc.wdi.x - gdp.pc.wdi.y)) %>% # take absolute difference in gdp
  select(ccode_1, country_1 = country.x, ccode_2, country_2 = country.y, year, gdp_diff) %>%
  arrange(desc(gdp_diff))

kable(head(dyad))

ccode_1	country_1	ccode_2	country_2	year	gdp_diff
516	Burundi	221	Monaco	2008	193705
450	Liberia	221	Monaco	2008	193661
531	Eritrea	221	Monaco	2008	193636
553	Malawi	221	Monaco	2008	193590
490	Democratic Republic of the Congo	221	Monaco	2008	193566
530	Ethiopia	221	Monaco	2008	193565

20.5 Assignment 5 Solutions

Assigned: Nov 5, 2019
Due: Nov 12, 2019 at 12:29pm.

For this problem set, we’ll be working with the country-year data introduced in the last assignment. As a reminder, the dataset contains the following variables:

year: Year.
ccode: Country numerical code.
country: Country name.
idealpoint UN Ideal point.
polity2: Polity2 score of regime type (from Polity IV).
physint: Physical Integrity Rights score (from CIRI dataset).
speech: Speech Rights score (from CIRI).
gdp.pc.wdi: GDP per capita (from World Bank).
pop.wdi: Population (from World Bank).
amnesty: Political Terror Scale using Amnesty International reports (from Political Terror Scale project).
cinc: Composite Index of Military Capabilities (Correlates of War).
region: Geographic region.

1. Getting Started

1.1 Read data/country-year.csv into R, using a relative path. Store it in an object called dat.

library(tidyverse)
library(stargazer)
dat <- read.csv("Data/country-year.csv", stringsAsFactors = F)

2. Plotting

2.1 Write code that reproduces “plots/Plot_1.jpeg”. (No need to write the file.)

# Density of population
d <- density(log(dat$pop.wdi), na.rm = T, bw = .2)
plot(d, main = "Summary of Population (Logged)") 
abline(h = max(d$y), v = 16.09, lty = 2)

2.2 Write code that reproduces “plots/Plot_2.jpeg”. (No need to write the file.)

# get summary data
country_means <- dat %>%
  filter(!is.na(region)) %>%
  group_by(country) %>%
  summarise(gdp = mean(gdp.pc.wdi, na.rm = T),
            polity = mean(polity2, na.rm = T),
            cinc = mean(cinc, na.rm = T),
            region = region[1])

# plot
ggplot(country_means, aes(x = polity, y = log10(gdp))) +
  geom_point(aes(color = region)) +
  #scale_y_log10() +
  geom_smooth(color="red", fill="red") + 
  ylab("Mean GDP (Logged) ") +
  xlab("Mean Polity Score") +
  ggtitle("Average Polity Score by GDP, 1979-2014") 
#> `geom_smooth()` using method = 'loess' and formula 'y ~ x'
#> Warning: Removed 33 rows containing non-finite values (stat_smooth).
#> Warning: Removed 33 rows containing missing values (geom_point).

2.3 Write code that reproduces “plots/Plot_3.jpeg”. (No need to write the file.)

Hint: The fall-inspired colors are #771C19, #E25033, #F27314, #F8A31B

# military capabilities
top_cinc <- country_means %>%
  top_n(10, cinc)

# Fall theme
rhg_cols = c("#771C19","#E25033","#F27314", "#F8A31B")

# plot
ggplot(top_cinc, aes(reorder(country, cinc), cinc, fill = region)) +
  geom_col() +
  theme(axis.text.x=element_text(size = 7, angle=45, hjust=1)) +
  ylab("Composite Index of Military Capabilities") +
  xlab("Country") +
  ggtitle("Top 10 Military Capabilities, in Fall Colors") +
  scale_fill_manual(values = rhg_cols)

2.4 Write code that reproduces “plots/Plot_4.jpeg”. (No need to write the file.)

# prepare data
year_means <- dat %>%
  filter(!is.na(region)) %>%
  group_by(year, region) %>%
  summarise(gdp = mean(gdp.pc.wdi, na.rm = T),
            polity = mean(polity2, na.rm = T),
            physint = mean(physint, na.rm = T)) 

# plot
ggplot(year_means, aes(x = year, y = polity, color = region)) +
  geom_line(size=2) +
  ylab("Average Polity Score") +
  xlab("Year") +
  ggtitle("Average Polity Score Over Time")
#> Warning: Removed 6 rows containing missing values (geom_path).

3. Models

3.1 Write code that reproduces the model summary table “reg_table.txt” (and writes the file).

mod.1 <- lm(physint ~ polity2, data = dat)
mod.2 <- lm(physint ~ polity2 + log(gdp.pc.wdi), data = dat)
mod.3 <- lm(physint ~ polity2 + log(gdp.pc.wdi) + region, data = dat)

stargazer(mod.1, mod.2, mod.3, title = "Regression Results", type = "text", 
          covariate.labels  = c("Polity2", "GDP per capita, logged", "Asia", "Eastern Europe", "Latin America", "MENA", "West", "Constant"), 
          dep.var.labels = "DV: Physical Integrity",
          omit = "Constant", 
          keep.stat="n", style = "ajps",
          out = "reg_table.txt")
#> 
#> Regression Results
#> --------------------------------------------------
#>                          DV: Physical Integrity   
#>                        Model 1  Model 2   Model 3 
#> --------------------------------------------------
#> Polity2                0.124*** 0.064*** 0.033*** 
#>                        (0.004)  (0.005)   (0.005) 
#> GDP per capita, logged          0.503*** 0.493*** 
#>                                 (0.021)   (0.028) 
#> Asia                                     -1.170***
#>                                           (0.101) 
#> Eastern Europe                            -0.033  
#>                                           (0.109) 
#> Latin America                            -0.769***
#>                                           (0.097) 
#> MENA                                     -1.460***
#>                                           (0.116) 
#> West                                     0.604*** 
#>                                           (0.138) 
#> N                        4336     4141     4141   
#> --------------------------------------------------
#> ***p < .01; **p < .05; *p < .1

20.6 Assignment 6 Solutions

Assigned: Nov 12, 2019
Due: Nov 21, 2019 at 12:29pm.

In this unit, we’ll use R to turn a bunch of loose text documents into a real-life database. (Note: This database was created for a project by R. Terman and E. Voeten, and was processed using much the same process as you’ll be learning here.)

The problem set will leverage your new R skills, especially working with strings; writing functions; using iteration; and thinking like a programmer.

Important: The code has been scaffolded for you, meaning that you have to fill in the blanks. Once you’re ready to submit the assignment, you have to remove the eval = F from the R chunk header. If you don’t, the chunk won’t execute when you knit the Rmarkdown file.

About the Data

We’ll be creating a database from Universal Period Review outcome reports.

The Universal Periodic Review (UPR) is a process run by the United Nations Human Rights Council, which involves a periodic review of the human rights records of all 193 UN Member States.

Reviews take place through an interactive discussion between the State under review and other UN Member States. During this discussion any UN Member State can pose questions, comments and/or make recommendations to the States under review. States under review can then respond, stating which recommendations they reject, accept, will consider, etc. Reports are then drawn up detailing this discussion.

We will be analyzing outcome reports from the 2014 Universal Period Reviews of 42 countries, which we retrieved here and formatted as text documents.

The goal is to convert these semi-structured texts to a tabular dataset of recommendations with the following variables:

Text of recommendation (text)
Country to which the recommendation is directed (to)
Country that is making the recommendation (from)
The year when the review took place (year)
The response to the recommendation, i.e. whether the reviewed country rejects, accepts, etc (decision)

In other words, we want to turn this:

into this:

Take a few minutes to look at the files, which are located in data/txts, and get a sense for how they’re structured.

Then run the following code to get started.

library(readtext)
library(stringr)
library(tidyverse)

# read all texts
all_texts <- readtext("data/txts/")

1. Extract One Document

We’re going to start off working with just one document. We’ll then use that code to iterate over all the documents.

task:

Extract one document.
Collect information on the country and year.
Extract the section we’re interested in.
Turn each line (i.e. recommendation) into an item in a vector.

Let’s start off working with cotedivoire2014.txt (the third file).

text <- all_texts$text[3]
file_name <- all_texts$doc_id[3]

1.1 Assign country and year variables.

You’ll notice that the file_name consists of the name of the reviewed country and the year. Slice file_name to create 2 new variables, country, and year.

Be careful! Remember that we are going to apply this to the other file names later. However you slice “cotedivoire2014.txt”, it needs to work for the other files in the data/txts directory.

country <- str_sub(file_name, 1, -9)
year <-str_sub(file_name, -8, -5)

1.2 Get the Recommendations Section

Note that the section we want starts with "II. Conclusions and/or recommendations\n". What function would you use to get everything after this substring? Fill in the blank below and assign the value to a new variable called rec_text.

sections = str_split(text, "II. Conclusions and/or recommendations\n")[[1]]
rec_text = sections[2] # get second item -- everything after.

1.3 Turn it into a vector

Using a stringr function, turn the string above into a vector of lines, and store it in a variable called recs. Remember that a new line is represented by \n.

recs <- str_split(rec_text, "\n")[[1]]
recs[1:5]
#> [1] "127. The recommendations listed below enjoy the support of C™te dÕIvoire: "                                                                                                  
#> [2] "127.1 Consider the accession to core human rights instruments (Lesotho); and to other main international human rights treaties that it is not yet a party to (Philippines); "
#> [3] "127.2 Make efforts towards the ratification of the OP-CAT (Chile); "                                                                                                         
#> [4] "127.3 Ratify the OP-CAT (Ghana, Tunisia), as recommended previously in 2009 (Czech Republic) and take policy measures to prevent torture and ill-treatment (Estonia); "      
#> [5] "127.4 Accede to the OP-CAT as soon as possible (Uruguay); "

2. Chunk Recomendations

These texts have 3 sections each. 1. The first section contains those recommendations the country supports. 2. The second section contains recs the country will examine. 3. The third contains recommendations the country explicitely rejects.

task:

parse recommendations into three piles, corresponding to accepted recs, examined recs, and rejected recs.
combine these piles back into a dataframe, containing the text of the recommendation and its corresponding decision.
add additional columns for to country and year.

2.1: Find the paragraph numbers

Each section starts with a main paragraph number (e.g. 127). The individual recommendations are then noted as subparagraphs (e.g. 127.1, 127.2 etc.).

All the accepted recommendations have the same main paragraph number (127). Next come the recommendations which will be examined, whose main paragraph number is just the next integer (128). After that are the rejected recommendations, with the next integer as their main paragraph number (129).

We can’t know the paragraph numbers beforehand, because each file is different. But we can leverage our knowledge of the structure of the documents to get them.

Fill in the blanks below to create 3 variables containing the 3 paragraph numbers.

para1 = str_extract(recs[1], "\\d+")
para1 = as.numeric(para1)
para2 = para1 + 1
para3 = para2 + 1

2.2 Parse the text

Now create 3 new vectors: accept_recs, examine_recs, reject_recs. Each vector should contain the recommendations assigned to its corresponding section.

hint: How do you know if a line belongs to a section? It starts with the main paragraph number for that section. So use the str_starts function.

# subset recommendations
accept_recs = recs[str_starts(recs, as.character(para1))]
# accept_recs = str_subset(recs, str_c("^", as.character(para1)))
examine_recs = recs[str_starts(recs, as.character(para2))]
reject_recs = recs[str_starts(recs, as.character(para3))]

# remove the first item from each list, which just demarcates the sections
accept_recs = accept_recs[-1]
examine_recs = examine_recs[-1]
reject_recs = reject_recs[-1]

2.3 Tranform to Dataframe

The following code combines the three vectors back into a dataframe with two column: text (of the recommendation), and decicion (whether the recommendation was accepted, examined, or rejected)

recs_df <- list(accept = accept_recs,
                    examine = examine_recs,
                    reject = reject_recs)

recs_df <- stack(recs_df) %>%
  select("text" = values, "decision" = ind)

Your job is to add 2 new columns to this dataframe: to should contain the country under review, and year should contain the year under review. Note that we already created these variables above, in question 1.1

recs_df <- recs_df %>%
  mutate(to = country,
         year = year)

head(recs_df)
#>                                                                                                                                                                           text
#> 1 127.1 Consider the accession to core human rights instruments (Lesotho); and to other main international human rights treaties that it is not yet a party to (Philippines); 
#> 2                                                                                                          127.2 Make efforts towards the ratification of the OP-CAT (Chile); 
#> 3       127.3 Ratify the OP-CAT (Ghana, Tunisia), as recommended previously in 2009 (Czech Republic) and take policy measures to prevent torture and ill-treatment (Estonia); 
#> 4                                                                                                                   127.4 Accede to the OP-CAT as soon as possible (Uruguay); 
#> 5                                                                                                                             127.5 Consider ratifying OP-CAT (Burkina Faso); 
#> 6                             127.6 Ratify the International Convention on the Protection of the Rights of All Migrant Workers and Members of Their Families (ICRMW) (Ghana); 
#>   decision          to year
#> 1   accept cotedivoire 2014
#> 2   accept cotedivoire 2014
#> 3   accept cotedivoire 2014
#> 4   accept cotedivoire 2014
#> 5   accept cotedivoire 2014
#> 6   accept cotedivoire 2014

3. Get Recommending Country

task - extract the substring representing the recommending country. - add this information to our dataframe.

3.1 Extract recommending country

Take a look at several recommendation texts to get an idea of their format.

head(recs_df$text)
#> [1] "127.1 Consider the accession to core human rights instruments (Lesotho); and to other main international human rights treaties that it is not yet a party to (Philippines); "
#> [2] "127.2 Make efforts towards the ratification of the OP-CAT (Chile); "                                                                                                         
#> [3] "127.3 Ratify the OP-CAT (Ghana, Tunisia), as recommended previously in 2009 (Czech Republic) and take policy measures to prevent torture and ill-treatment (Estonia); "      
#> [4] "127.4 Accede to the OP-CAT as soon as possible (Uruguay); "                                                                                                                  
#> [5] "127.5 Consider ratifying OP-CAT (Burkina Faso); "                                                                                                                            
#> [6] "127.6 Ratify the International Convention on the Protection of the Rights of All Migrant Workers and Members of Their Families (ICRMW) (Ghana); "

Notice that they’re all formatted the same way, with the recommending country in parenthesis at the end, in between parentheses.

Using your string skills, find a way to pull out the recommending country from the first recommendation (stored in first_rec below).

first_rec = recs_df$text[1]

rec_after_paran <- str_split(first_rec, "\\(")[[1]]
rec_after_paran <- tail(rec_after_paran, 1)
first_rec_country = str_split(rec_after_paran, "\\)")[[1]]
first_rec_country <- first_rec_country[1]

# this should be 'Philipines'.
first_rec_country
#> [1] "Philippines"

3.2 Create a Function

Create a function called get_country that passes an individual recommendation text and returns the recommending country.

get_country <- function(rec){
  rec_after_paran <- str_split(rec, "\\(")[[1]]
  rec_after_paran <- tail(rec_after_paran, 1) # get last item
  first_rec_country = str_split(rec_after_paran, "\\)")[[1]]
  first_rec_country <- first_rec_country[1] # get first item
  
  return(first_rec_country)
}

# test your code
get_country(first_rec)
#> [1] "Philippines"

3.3 Add from column

Using your map and dplyr skills, add a column to recs_df that contains the country issuing each recommendation.

recs_df <- recs_df %>% mutate(
  from = map_chr(text, get_country)
)

4. Repeat for all documents

We just wrote code that takes one document and turns it into a dataset!

The problem is we have 11 documents!

task

combine the code we’ve written so far to create a function
apply that function to all files to create a single dataset.

4.1 Make a function

Combine the functions that you wrote above to create a single function that passes a row number of all_texts (i.e. an integer), and returns a dataframe of fully parsed recommendations in that file.

parse_file <- function(i){
    # get filename and text
    text <- all_texts$text[i]
    file_name <- all_texts$doc_id[i]
    
    # get to country and year
    country <- str_sub(file_name, 1, -9)
    year <-str_sub(file_name, -8, -5)
    
    # get vector of recs
    sections = str_split(text, "II. Conclusions and/or recommendations\n")[[1]]
    rec_text = sections[2] # get second item -- everything after.
    recs = str_split(rec_text, "\n")[[1]]
    
    # get paragraph numbers
    para1 = str_extract(recs[1], "\\d+")
    para1 = as.numeric(para1)
    para2 = para1 + 1
    para3 = para2 + 1
    
    # chunk recommendations
    accept_recs = recs[str_starts(recs, as.character(para1))]
    examine_recs = recs[str_starts(recs, as.character(para2))]
    reject_recs = recs[str_starts(recs, as.character(para3))]
    
    # remove the first item from each list, which just demarcates the sections
    accept_recs = accept_recs[-1]
    examine_recs = examine_recs[-1]
    reject_recs = reject_recs[-1]
    
    # transform to dataframe
    recs_df <- list(accept = accept_recs,
                        examine = examine_recs,
                        reject = reject_recs)
    
    recs_df <- stack(recs_df) %>%
      select("text" = values, "decision" = ind)
    
    recs_df <- recs_df %>%
      mutate(to = country,
             year = year)
    
    # add from column
    recs_df <- recs_df %>% mutate(
      from = map_chr(text, get_country)
    )
}

4.2 Map the function

Apply the function you created above to all rows in all_texts using your map_ skills. The final output should contain a dataframe of all the recommendations from all the files.__

all_recs <- map_dfr(1:nrow(all_texts), parse_file)

4.3 Print Dimensions and Write a csv

Print the dimensions and export the full dataframe into a csv. You’re done!

dim(all_recs) # should be 1709 x 5
# write.csv(all_recs, "upr-recs.csv")

20.7 Assignment 7 Solutions

Assigned: Nov 21, 2019
Due: Dec 5, 2019 at 12:29pm.

In this week’s lecture, we introduced some tools to collect pieces of data from individual presidential documents. For this assignment, we will be looking at all documents in the database that contain the string “space exploration.” Our goals in this problem set are:

To scrape all documents returned from this search query
To organize this data into a dataframe and ultimately output a CSV file.

Below, I’ve given you the code for a function that passes the URL of an individual document, scrapes the information from that document, and returns that information in a list.

But this is all I will be providing for you. You must complete the rest of the task yourself. Specifically, you should:

Write code that scrapes all documents, organizes the information in a dataframe, and writes a csv file.
The end goal should be a dataset identical to the one I’ve provided for you in data/space.csv.
Split the code up into discrete steps, each with their own corresponding Rmarkdown chunk.
Document (i.e. describe) each step in clear but concise Rmarkdown prose.
The final chunk should:

print the structure (str) of the final data frame.
write the dataframe to a csv file.

Good luck!

library(tidyverse)
library(rvest)
library(stringr)
library(purrr)
library(lubridate)

scrape_docs <- function(URL){
  doc <- read_html(URL)

  speaker <- html_nodes(doc, ".diet-title a") %>% 
    html_text()
  
  date <- html_nodes(doc, ".date-display-single") %>%
    html_text() %>%
    mdy()
  
  title <- html_nodes(doc, "h1") %>%
    html_text()
  
  text <- html_nodes(doc, "div.field-docs-content") %>%
    html_text()
  
  all_info <- list(speaker = speaker, date = date, title = title, text = text)
  
  return(all_info)
}

Solution

There are likely many ways to achieve this task. Here’s one solution:

1. Write function `scrape_urls` to scrape URLs of individual search results.

The following function passes a page of search results, and returns a vector of URLs, each URL corresponding to an individual document.

scrape_urls <- function(path) {
  
  html <- read_html(path) #Download HTML of webpage
  
  links <- html_nodes(html, ".views-field-title a") %>% #select element with document URLs
                html_attr("href")
  
  return(links) #output results
}

scrape_test <- scrape_urls("https://www.presidency.ucsb.edu/advanced-search?field-keywords=%22space+exploration%22&field-keywords2=&field-keywords3=&from%5Bdate%5D=&to%5Bdate%5D=&person2=&items_per_page=100")

scrape_test[1:10]
#>  [1] "/documents/special-message-the-congress-relative-space-science-and-exploration"                    
#>  [2] "/documents/statement-the-president-support-the-administration-bill-relative-space-science-and"     
#>  [3] "/documents/statement-the-president-upon-signing-the-national-aeronautics-and-space-act-1958"       
#>  [4] "/documents/annual-budget-message-the-congress-fiscal-year-1960"                                    
#>  [5] "/documents/letter-t-keith-glennan-administrator-national-aeronautics-and-space-administration"     
#>  [6] "/documents/the-presidents-news-conference-225"                                                     
#>  [7] "/documents/the-presidents-news-conference-augusta-georgia-0"                                       
#>  [8] "/documents/annual-message-the-congress-the-state-the-union-6"                                      
#>  [9] "/documents/special-message-the-congress-recommending-amendments-the-national-aeronautics-and-space"
#> [10] "/documents/special-message-the-congress-transfers-from-the-department-defense-the-national"

2. Iterate over results pager to collect all URLs

scrape_urls collects all of the relative URLs from the first page of our search results (100 documents). While this is a good start, we have 4 pages of search results (310 results total) and need to collect the URLs of ALL results, from ALL result pages.

First, let’s grab the path of all 4 result pages, and store that result in an object called all_pages:

all_pages <- str_c("https://www.presidency.ucsb.edu/advanced-search?field-keywords=%22space%20exploration%22&field-keywords2=&field-keywords3=&from%5Bdate%5D=&to%5Bdate%5D=&person2=&items_per_page=100&page=", 0:3)

Now, we can use scrape_urls to collect the URLs from all the pages of search results. We store the results as a character vector called all_urls.

all_urls <- map(all_pages, scrape_urls) %>%
  unlist

# uncomment to test -- should be 310
length(all_urls)
#> [1] 310

3. Modify to Full Path

The HREF we got above is what’s called a relative URL: i.e., it looks like this:

/documents/special-message-the-congress-relative-space-science-and-exploration

as opposed to having a full path, like:

http://www.presidency.ucsb.edu/documents/special-message-the-congress-relative-space-science-and-exploration

The following code converts the relative paths to full paths, and saves them in an object called all_full_urls.

all_full_urls <- str_c("https://www.presidency.ucsb.edu", all_urls)
all_full_urls[1:10]
#>  [1] "https://www.presidency.ucsb.edu/documents/special-message-the-congress-relative-space-science-and-exploration"                    
#>  [2] "https://www.presidency.ucsb.edu/documents/statement-the-president-support-the-administration-bill-relative-space-science-and"     
#>  [3] "https://www.presidency.ucsb.edu/documents/statement-the-president-upon-signing-the-national-aeronautics-and-space-act-1958"       
#>  [4] "https://www.presidency.ucsb.edu/documents/annual-budget-message-the-congress-fiscal-year-1960"                                    
#>  [5] "https://www.presidency.ucsb.edu/documents/letter-t-keith-glennan-administrator-national-aeronautics-and-space-administration"     
#>  [6] "https://www.presidency.ucsb.edu/documents/the-presidents-news-conference-225"                                                     
#>  [7] "https://www.presidency.ucsb.edu/documents/the-presidents-news-conference-augusta-georgia-0"                                       
#>  [8] "https://www.presidency.ucsb.edu/documents/annual-message-the-congress-the-state-the-union-6"                                      
#>  [9] "https://www.presidency.ucsb.edu/documents/special-message-the-congress-recommending-amendments-the-national-aeronautics-and-space"
#> [10] "https://www.presidency.ucsb.edu/documents/special-message-the-congress-transfers-from-the-department-defense-the-national"

4. Scrape Documents

Now that we have the full paths to each document, we’re ready to scrape each document.

We’ll use the scrape_docs function (given above), which accepts a URL of an individual record, scrapes the page, and returns a list containing the document’s date, speaker, title, and full text.

Using this function, we’ll iterate over all_full_urls to collect information on all the documents. We save the result as a dataframe, with each row representing a document.

Note: This might take a few minutes.

final_df <- map(all_full_urls, scrape_docs) %>%
  bind_rows()

5. Print and write

We’ll print the dataframe’s structure, write the csv, and we’re done!

str(final_df)
#> Classes 'tbl_df', 'tbl' and 'data.frame':    310 obs. of  4 variables:
#>  $ speaker: chr  "Dwight D. Eisenhower" "Dwight D. Eisenhower" "Dwight D. Eisenhower" "Dwight D. Eisenhower" ...
#>  $ date   : Date, format: "1958-04-02" "1958-05-14" ...
#>  $ title  : chr  "Special Message to the Congress Relative to Space Science and Exploration." "Statement by the President in Support of the Administration Bill Relative to Space Science and Exploration." "Statement by the President Upon Signing the National Aeronautics and Space Act of 1958." "Annual Budget Message to the Congress: Fiscal Year 1960." ...
#>  $ text   : chr  "\n    To the Congress of the United States:\nRecent developments in long-range rockets for military purposes ha"| __truncated__ "\n    IN MY MESSAGE to Congress on space science and exploration I recommended that space science activities sp"| __truncated__ "\n    I HAVE TODAY signed H. R. 12575, the National Aeronautics and Space Act of 1958.\nThe enactment of this l"| __truncated__ "\n    To the Congress of the United States:\nThe situation we face today as a Nation differs significantly from"| __truncated__ ...
#write.csv(final_df, "data/space.csv", row.names = F)