Chapter 17 Collecting Data from the Web

17.1 Introduction

There’s a ton of web data that’s useful to social scientists, including:

  • social media
  • news media
  • government publications
  • organizational records

There are two ways to get data off the web:

  1. Web APIs - i.e. application-facing, for computers
  2. Webscraping - i.e. user-facing websites for humans

Rule of Thumb: Check for API first. If not available, scrape.

17.2 Web APIs

API stands for Application Programming Interface. Broadly defined, an API is a set of rules and procedures that facilitate interactions between computers and their applications.

A very common type of API is the Web API, which (among other things) allows users to query a remote database over the internet.

Web APIs take on a variety of formats, but the vast majority adhere to a particular style known as Reperesentational State Transfer or REST. What makes these “RESTful” APIs so convenient is that we can use them to query databases using URLs.

RESTful Web APIs are All Around You…

Consider a simple Google search:

Ever wonder what all that extra stuff in the address bar was all about? In this case, the full address is Google’s way of sending a query to its databases requesting information related to the search term “golden state warriors”.

In fact, it looks like Google makes its query by taking the search terms, separating each of them with a “+”, and appending them to the link “https://www.google.com/#q=”. Therefore, we should be able to actually change our Google search by adding some terms to the URL and following the general format…

Learning how to use RESTful APIs is all about learning how to format these URLs so that you can get the response you want.

17.2.1 Some Basic Terminology

Let’s get on the same page with some basic terminology:

  • Uniform Resource Location (URL): a string of characters that, when interpreted via the Hypertext Transfer Protocol (HTTP), points to a data resource, notably files written in Hypertext Markup Language (HTML) or a subset of a database. This is often referred to as a “call”.

  • HTTP Methods/Verbs:

    • GET: requests a representation of a data resource corresponding to a particular URL. The process of executing the GET method is often referred to as a “GET request” and is the main method used for querying RESTful databases.

    • HEAD, POST, PUT, DELETE: other common methods, though mostly never used for database querying.

17.2.2 How Do GET Requests Work?

A Web Browsing Example

As you might suspect from the example above, surfing the web is basically equivalent to sending a bunch of GET requests to different servers and asking for different files written in HTML.

Suppose, for instance, I wanted to look something up on Wikipedia. My first step would be to open my web browser and type in http://www.wikipedia.org. Once I hit return, I’d see the page below.

Several different processes occured, however, between me hitting “return” and the page finally being rendered. In order:

  1. The web browser took the entered character string and used the command-line tool “Curl” to write a properly formatted HTTP GET request and submitted it to the server that hosts the Wikipedia homepage.

  2. After receiving this request, the server sent an HTTP response, from which Curl extracted the HTML code for the page (partially shown below).

  3. The raw HTML code was parsed and then executed by the web browser, rendering the page as seen in the window.

#> No encoding supplied: defaulting to UTF-8.
#> [1] "<!DOCTYPE html>\n<html lang=\"mul\" class=\"no-js\">\n<head>\n<meta charset=\"utf-8\">\n<title>Wikipedia</title>\n<meta name=\"description\" content=\"Wikipedia is a free online encyclopedia, created and edited by volunteers around the world and hosted by the Wikimedia Foundation.\">\n<![if gt IE 7]>\n<script>\ndocument.documentElement.className = document.documentElement.className.replace( /(^|\\s)no-js(\\s|$)/, \"$1js-enabled$2\" );\n</script>\n<![endif]>\n<!--[if lt IE 7]><meta http-equiv=\"imagetoolbar\" content=\"no\"><![endif]-->\n<meta name=\"viewport\" content=\"initial-scale=1,user-scalable=yes\">\n<link rel=\"apple-touch-icon\" href=\"/static/apple-touch/wikipedia.png\">\n<link rel=\"shortcut icon\" href=\"/static/favicon/wikipedia.ico\">\n<link rel=\"license\" href=\"//creativecommons.org/licenses/by-sa/3.0/\">\n<style>\n.sprite{background-image:url(portal/wikipedia.org/assets/img/sprite-81a290a5.png);background-image:linear-gradient(transparent,transparent),url(portal/wikipedia.org/assets/img/sprite-81a290a5.svg);background"

Web Browsing as a Template for RESTful Database Querying

The process of web browsing described above is a close analogue for the process of database querying via RESTful APIs, with only a few adjustments:

  1. While the Curl tool will still be used to send HTML GET requests to the servers hosting our databases of interest, the character string that we supply to Curl must be constructed so that the resulting request can be interpreted and succesfully acted upon by the server. In particular, it is likely that the character string must encode search terms and/or filtering parameters, as well as one or more authentication codes. While the terms are often similar across APIs, most are API-specific.

  2. Unlike with web browsing, the content of the server’s response that is extracted by Curl is unlikely to be HTML code. Rather, it will likely be raw text response that can be parsed into one of a few file formats commonly used for data storage. The usual suspects include .csv, .xml, and .json files.

  3. Whereas the web browser capably parsed and executed the HTML code, one or more facilities in R, Python, or other programming languages will be necessary for parsing the server response and converting it into a format for local storage (e.g. matrices, dataframes, databases, lists, etc.).

17.2.3 Finding APIs

More and more APIs pop up every day. Programmable Web offers a running list of APIs. This list provides a list of APIs that may be useful to Political Scientists.

Here are some APIs that may be useful to you:

  • NYT Article API: Provides metdata (title, summaries, dates, etc) from all New York Times articles in their archive.
  • GeoNames geographical database: Provides lots of geographical information for all countries and other locations. The geonames package provides a wrapper for R.
  • The Manifesto Project: Provides text and other information on political party manifestos from around the world. It currently covers over 1000 parties from 1945 until today in over 50 countries on five continents. The manifestoR package provides a wrapper for R.
  • The Census Bureau: Provides datasets from US Census Bureau. The tidycensus package allows users to interface with the US Census Bureau’s decennial Census and five-year American Community APIs.

17.2.4 Getting API Access

Most APIs requires a key or other user credentials before you can query their database.

Getting credentialized with a API requires that you register with the organization. Most APIs are set up for developers, so you’ll likely be asked to register an “application”. All this really entails is coming up with a name for your app/bot/project, and providing your real name, organization, and email. Note that some more popular APIs (e.g. Twitter, Facebook) will require additional information, such as a web address or mobile number.

Once you’ve successfully registered, you will be assigned one or more keys, tokens, or other credentials that must be supplied to the server as part of any API call you make. To make sure that users aren’t abusing their data access privileges (e.g. by making many rapid queries), each set of keys will be given rate limits governing the total number of calls that can be made over certain intervals of time.

For example, the NYT Article API has relatively generous rate limits — 4,000 requests per day and 10 requests per minute. So we need to “sleep”" 6 seconds between calls to avoid hitting the per minute rate limit.

17.2.5 Using APIs in R

There are two ways to collect data through APIs in R.

  1. Plug-n-play packages

Many common APIs are available through user-written R Packages. These packages offer functions that “wrap” API queries and format the response. These packages are usually much more convenient than writing our own query, so it’s worth searching around for a package that works with the API we need.

  1. Writing our own API request

If no wrapper function is available, we have to write our own API request, and format the response ourselves using R. This is trickier, but definitely do-able.

17.3 Collecting Twitter Data with RTweet

Twitter actually has two separate APIs:

  1. The REST API allows you to read and write Twitter data. For research purposes, this allows you to search the recent history of tweets and look up specific users.
  2. The Streaming API allows you to access public data flowing through Twitter in real-time. It requires your R session to be running continuously, but allows you to capture a much larger sample of tweets while avoiding rate limits for the REST API.

There are several packages for R for accessing and searching Twitter. In this unit, we’ll practice using the RTweet library, which allows us to easily collect data from Twitter’s REST and stream APIs.

17.3.1 Setting up RTweet

To use RTweet, follow these steps:

  1. If you don’t have a Twitter account, create one here.
  2. Install the RTweet package from CRAN.
  3. Load the package into R.
  4. Send a request to Twitter’s API by calling any of the package’s functions, like search_tweets orget_timeline.
  5. Approve the browser popup (to authorize the rstats2twitter app).
  6. Now, you’re ready to use RTweet!

Let’s go ahead and load RTweet along with some other helpful functions:

library(tidyverse)
library(rtweet)
library(lubridate)
library(kableExtra)

17.3.2 UChicago Political Science Prof Tweets

Let’s explore the RTweet package to see what we can learn about the tweeting habits of UChicago Political Science faculty.

The function get_timeline will pull the most recent n number of tweets from a given handle(s). To pull tweets from multiple handles, write out a vector of the handles in the user argument.

Let’s pull tweets from five faculty members in the department.

profs <- get_timeline(
  user = c("carsonaust", "profpaulpoast", "pstanpolitics", "rochelleterman", "bobbygulotty"),
  n = 1000
)
kable(head(profs))
user_id status_id created_at screen_name text source display_text_width reply_to_status_id reply_to_user_id reply_to_screen_name is_quote is_retweet favorite_count retweet_count quote_count reply_count hashtags symbols urls_url urls_t.co urls_expanded_url media_url media_t.co media_expanded_url media_type ext_media_url ext_media_t.co ext_media_expanded_url ext_media_type mentions_user_id mentions_screen_name lang quoted_status_id quoted_text quoted_created_at quoted_source quoted_favorite_count quoted_retweet_count quoted_user_id quoted_screen_name quoted_name quoted_followers_count quoted_friends_count quoted_statuses_count quoted_location quoted_description quoted_verified retweet_status_id retweet_text retweet_created_at retweet_source retweet_favorite_count retweet_retweet_count retweet_user_id retweet_screen_name retweet_name retweet_followers_count retweet_friends_count retweet_statuses_count retweet_location retweet_description retweet_verified place_url place_name place_full_name place_type country country_code geo_coords coords_coords bbox_coords status_url name location description url protected followers_count friends_count listed_count statuses_count favourites_count account_created_at verified profile_url profile_expanded_url account_lang profile_banner_url profile_background_url profile_image_url
805833715 1196555417918332931 1574116017 carsonaust Off to Texas tomorrow to talk all things covert with Lindsey O’Rourke (???) (???) https://t.co/a0rmx4IGGW Twitter Web App 95 NA NA NA TRUE FALSE 13 0 NA NA NA NA twitter.com/RobertJRalston… https://t.co/a0rmx4IGGW https://twitter.com/RobertJRalston/status/1196539438375034880 NA NA NA NA NA NA NA NA c(“19980084”, “1143984770143150080”) c(“BushSchool”, “AlbrittonCGS”) en 1196539438375034880 For those in Bryan/College Station (and beyond!): the (???) at the (???) is hosting Lindsey O’Rourke and (???) tomorrow evening to discuss ‘secret wars.’ Details linked here: https://t.co/nyEAnIYNAG 1574112207 Twitter Web App 13 1 2278396334 RobertJRalston Robert Ralston 959 2544 1953 College Station, TX Predoc (???) & Ph.D. Candidate in Political Science @ University of Minnesota | Int’l Security; 🇺🇸& 🇬🇧 Foreign Policy; Grand Strategy FALSE NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA c(NA, NA) c(NA, NA) c(NA, NA, NA, NA, NA, NA, NA, NA) https://twitter.com/carsonaust/status/1196555417918332931 Austin Carson Chicago, IL Assistant Professor @ University of Chicago, author of Secret Wars: Covert Conflict in International Politics. Dad & NBA / Pistons enthusiast https://t.co/kApIyoe7RG FALSE 2018 1026 24 1550 2695 1346896669 FALSE https://t.co/kApIyoe7RG https://austinmcarson.com/ NA https://pbs.twimg.com/profile_banners/805833715/1533067436 http://abs.twimg.com/images/themes/theme1/bg.png http://pbs.twimg.com/profile_images/1024384740181336064/ROqTG_uN_normal.jpg
805833715 1196496581853425664 1574101989 carsonaust I recently received great twitter suggestions for research to update my Civil War syllabus. But it can only cover a truly tiny slice of the field. So I’ve created a list of suggested readings on civil war and political violence: https://t.co/EDkQxmMEit Twitter Web App 140 NA NA NA FALSE TRUE 0 101 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA 3033693676 pstanpolitics en NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA 1196485478247206914 I recently received great twitter suggestions for research to update my Civil War syllabus. But it can only cover a truly tiny slice of the field. So I’ve created a list of suggested readings on civil war and political violence: https://t.co/EDkQxmMEit 1574099342 Twitter Web App 325 101 3033693676 pstanpolitics Paul Staniland 14251 3099 20708 Chicago, IL (???) political scientist and (???) director. Research on international relations and political violence in South Asia. Notifications muted. FALSE NA NA NA NA NA NA c(NA, NA) c(NA, NA) c(NA, NA, NA, NA, NA, NA, NA, NA) https://twitter.com/carsonaust/status/1196496581853425664 Austin Carson Chicago, IL Assistant Professor @ University of Chicago, author of Secret Wars: Covert Conflict in International Politics. Dad & NBA / Pistons enthusiast https://t.co/kApIyoe7RG FALSE 2018 1026 24 1550 2695 1346896669 FALSE https://t.co/kApIyoe7RG https://austinmcarson.com/ NA https://pbs.twimg.com/profile_banners/805833715/1533067436 http://abs.twimg.com/images/themes/theme1/bg.png http://pbs.twimg.com/profile_images/1024384740181336064/ROqTG_uN_normal.jpg
805833715 1196458392094683136 1574092884 carsonaust (???) (???) Sounds plausible to me! Twitter Web App 23 1196418237249937413 1101304656326680581 dov_levin FALSE FALSE 2 0 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA c(“1101304656326680581”, “807095”) c(“dov_levin”, “nytimes”) en NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA c(NA, NA) c(NA, NA) c(NA, NA, NA, NA, NA, NA, NA, NA) https://twitter.com/carsonaust/status/1196458392094683136 Austin Carson Chicago, IL Assistant Professor @ University of Chicago, author of Secret Wars: Covert Conflict in International Politics. Dad & NBA / Pistons enthusiast https://t.co/kApIyoe7RG FALSE 2018 1026 24 1550 2695 1346896669 FALSE https://t.co/kApIyoe7RG https://austinmcarson.com/ NA https://pbs.twimg.com/profile_banners/805833715/1533067436 http://abs.twimg.com/images/themes/theme1/bg.png http://pbs.twimg.com/profile_images/1024384740181336064/ROqTG_uN_normal.jpg
805833715 1196452750541893634 1574091539 carsonaust (???) (???) and I have a forthcoming book with (???) on how sensitive information is increasingly being used by IOs to assess compliance, and leading to interesting forms of secrecy in global governance https://t.co/acExjvDsTA Twitter Web App 236 1196227042385256448 140179956 Ali_Wyne FALSE FALSE 8 4 NA NA NA NA cambridge.org/core/books/sec… https://t.co/acExjvDsTA https://www.cambridge.org/core/books/secrets-in-global-governance/4AAE64C09DA1D8417F72D97502FC4BC3 NA NA NA NA NA NA NA NA c(“140179956”, “1013828493795307521”, “18977822”) c(“Ali_Wyne”, “AllieCarnegie”, “CUPAcademic”) en NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA c(NA, NA) c(NA, NA) c(NA, NA, NA, NA, NA, NA, NA, NA) https://twitter.com/carsonaust/status/1196452750541893634 Austin Carson Chicago, IL Assistant Professor @ University of Chicago, author of Secret Wars: Covert Conflict in International Politics. Dad & NBA / Pistons enthusiast https://t.co/kApIyoe7RG FALSE 2018 1026 24 1550 2695 1346896669 FALSE https://t.co/kApIyoe7RG https://austinmcarson.com/ NA https://pbs.twimg.com/profile_banners/805833715/1533067436 http://abs.twimg.com/images/themes/theme1/bg.png http://pbs.twimg.com/profile_images/1024384740181336064/ROqTG_uN_normal.jpg
805833715 1196419661635768320 1574083650 carsonaust

Friday (Nov 15) was the official release day of my new book, “Arguing About Alliances” ((???))!

To celebrate, each day this week I’ll tweet an image/figure/graph/map from the book, along with an explanation for why it’s in the book.

First image: the cover cartoon https://t.co/z7dSX6Me97
Twitter Web App 140 NA NA NA FALSE TRUE 0 27 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA c(“996923362386546694”, “144853604”) c(“ProfPaulPoast”, “CornellPress”) en NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA 1196410911440936966

Friday (Nov 15) was the official release day of my new book, “Arguing About Alliances” ((???))!

To celebrate, each day this week I’ll tweet an image/figure/graph/map from the book, along with an explanation for why it’s in the book.

First image: the cover cartoon https://t.co/z7dSX6Me97
1574081564 Twitter Web App 112 27 996923362386546694 ProfPaulPoast Paul Poast 7605 396 7396 (???) international relations (IR) prof & (???) Fellow. IR Research/Teaching & Foreign Affairs. New book “Arguing About Alliances” ((???)) FALSE NA NA NA NA NA NA c(NA, NA) c(NA, NA) c(NA, NA, NA, NA, NA, NA, NA, NA) https://twitter.com/carsonaust/status/1196419661635768320 Austin Carson Chicago, IL Assistant Professor @ University of Chicago, author of Secret Wars: Covert Conflict in International Politics. Dad & NBA / Pistons enthusiast https://t.co/kApIyoe7RG FALSE 2018 1026 24 1550 2695 1346896669 FALSE https://t.co/kApIyoe7RG https://austinmcarson.com/ NA https://pbs.twimg.com/profile_banners/805833715/1533067436 http://abs.twimg.com/images/themes/theme1/bg.png http://pbs.twimg.com/profile_images/1024384740181336064/ROqTG_uN_normal.jpg
805833715 1196416878605996032 1574082986 carsonaust

As always, we must ask “Whose interests are advanced through this leak? Why are we seeing these?”

Interesting description of where these documents came from. Let the guessing games begin!

The Iran cable leak in (???): https://t.co/0e8AL5t9zJ https://t.co/rkqtf3l8S7
Twitter Web App 247 NA NA NA FALSE FALSE 46 11 NA NA NA NA nytimes.com/interactive/20… https://t.co/0e8AL5t9zJ https://www.nytimes.com/interactive/2019/11/18/world/middleeast/iran-iraq-spy-cables.html http://pbs.twimg.com/media/EJqGzw1WsAA8Obg.jpg https://t.co/rkqtf3l8S7 https://twitter.com/carsonaust/status/1196416878605996032/photo/1 photo http://pbs.twimg.com/media/EJqGzw1WsAA8Obg.jpg https://t.co/rkqtf3l8S7 https://twitter.com/carsonaust/status/1196416878605996032/photo/1 NA 807095 nytimes en NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA c(NA, NA) c(NA, NA) c(NA, NA, NA, NA, NA, NA, NA, NA) https://twitter.com/carsonaust/status/1196416878605996032 Austin Carson Chicago, IL Assistant Professor @ University of Chicago, author of Secret Wars: Covert Conflict in International Politics. Dad & NBA / Pistons enthusiast https://t.co/kApIyoe7RG FALSE 2018 1026 24 1550 2695 1346896669 FALSE https://t.co/kApIyoe7RG https://austinmcarson.com/ NA https://pbs.twimg.com/profile_banners/805833715/1533067436 http://abs.twimg.com/images/themes/theme1/bg.png http://pbs.twimg.com/profile_images/1024384740181336064/ROqTG_uN_normal.jpg

Now, let’s visualize which professors are tweeting the most, by week.

profs %>%
  group_by(screen_name) %>%
  mutate(created_at = as.Date(created_at)) %>%
  filter(created_at >= "2019-06-15") %>%
  ts_plot(by = "week")

17.3.3 Hashtags and Text Strings

We can also use RTweet to explore certain hashtags or text strings.

Let’s take Duke Ellington again – we can use search_tweets to pull the most recent n number of tweets that include the hashtag #DukeEllington or the string "Duke Ellington".

Hashtag Challenge

Using the documentation for search_tweets as a guide, try pulling the 2,000 most recent tweets that include #DukeEllington or "Duke Ellington" – be sure to exclude retweets from the query.

  1. Why didn’t your query return 2,000 results?

  2. Identify the user that has used either the hashtag or the string in the greatest number of tweets – where is this user from?

duke <- search_tweets(
  q = '#DukeEllington OR "Duke Ellington"',
  n = 2000,
  include_rts = FALSE
)

duke %>%
  group_by(user_id, location) %>%
  summarise(n = n()) %>%
  arrange(desc(n))
#> # A tibble: 713 x 3
#> # Groups:   user_id [713]
#>   user_id             location                           n
#>   <chr>               <chr>                          <int>
#> 1 2561373848          Florida, USA                      99
#> 2 764290456599396353  Rochester, New York               46
#> 3 1022122568004894720 ""                                25
#> 4 1030069194539368448 ""                                25
#> 5 826869114           Paris, Ile-de-France              23
#> 6 714738130557992960  Chiclana de la Frontera, Spain    20
#> # … with 707 more rows

17.4 Writing API Queries

If no wrapper package is available, we have to write our own API query, and format the response ourslves using R. This is trickier, but definitely do-able.

In this unit, we’ll practice constructing our own API queries using the New York Time’s Article API. This API provides metadata (title, date, summary, etc) on all of New York Times articles.

Fortunately, this API is very well documented!

You can even try it out here.

Load the following packages to get started:

library(tidyverse)
library(httr)
library(jsonlite)
library(lubridate)

17.4.1 Constructing the API GET Request

Likely the most challenging part of using web APIs is learning how to format your GET request URLs. While there are common architectures for such URLs, each API has its own unique quirks. For this reason, carefully reviewing the API documentation is critical.

Most GET request URLs for API querying have three or four components:

  1. Authenication Key/Token: a user-specific character string appended to a base URL telling the server who is making the query; allows servers to efficiently manage database access

  2. Base URL: a link stub that will be at the beginning of all calls to a given API; points the server to the location of an entire database

  3. Search Parameters: a character string appended to a base URL that tells the server what to extract from the database; basically a series of filters used to point to specific parts of a database

  4. Response Format: a character string indicating how the response should be formatted; usually one of .csv, .json, or .xml

Let’s go ahead and store the these values as variables:

key <- "Onz0BobMTn2IRJ7krcT5RXHknkGLqiaI"
base.url <- "http://api.nytimes.com/svc/search/v2/articlesearch.json"
search_term <- "John Mearsheimer"

How did I know the base.url? I read the documentation.. Notice that this base.url also includes the response format(.jston), so we don’t need to configure that directly.

We’re ready to make the request. We can use the GET function from the httr package (another tidyverse package) to make an HTTP GET Request.

r <- GET(base.url, query = list(`q` = search_term,
                                `api-key` = key))

Now, we have an object called r. We can get all the information we need from this object. For instance, we can see that the URL has been correctly encoded by printing the URL. Click on the link to see what happens.

r$url
#> [1] "http://api.nytimes.com/svc/search/v2/articlesearch.json?q=John%20Mearsheimer&api-key=Onz0BobMTn2IRJ7krcT5RXHknkGLqiaI"

Challenge 1: Adding a date range

What if we only want to search within a particular date range? The NYT Article Api allows us to specify start and end dates.

Alter the get.request code above so that the request only searches for articles in the year 2005.

You’re gonna need to look at the documentation here to see how to do this.

Challenge 2: Specifying a results page

The above will return the first 10 results. To get the next ten, you need to add a “page” parameter. Change the search parameters above to get the second 10 resuls.

17.4.2 Parsing the response

We can read the content of the server’s response using the content() function.

response <- content(r, "text")
substr(response, start = 1, stop = 1000)
#> [1] "{\"status\":\"OK\",\"copyright\":\"Copyright (c) 2019 The New York Times Company. All Rights Reserved.\",\"response\":{\"docs\":[{\"abstract\":\"Dr John J Mearsheimer of University of Chicago and Dr Stephen M Walt of John F Kennedy School of Government at Harvard University publish controversial essay on relationship between United States and Israel; hold that relationship has made Americans target of terrorists and that pro-Israel lobby is extremely powerful in shaping US agenda; many news outlets and academics have condemned paper, but it has been praised in some areas of world; content of paper and debate over its merits discussed; photo (M)\",\"web_url\":\"https://www.nytimes.com/2006/04/12/education/essay-stirs-debate-about-influence-of-a-jewish-lobby.html\",\"snippet\":\"Dr John J Mearsheimer of University of Chicago and Dr Stephen M Walt of John F Kennedy School of Government at Harvard University publish controversial essay on relationship between United States and Israel; hold that relationship has "

What you see here is JSON text, encoded as plain text. JSON stands for “Javascript object notation.” Think of JSON like a nested array built on key/value pairs.

We want to convert the results from JSON format to something easier to work with – notably a data frame.

The jsonlite package provides several easy conversion functions for moving between JSON and vectors, data.frames, and lists. Let’s use the function fromJSON to convert this response into something we can work with:

# Convert JSON response to a dataframe
response_df <- fromJSON(response, simplifyDataFrame = TRUE, flatten = TRUE)

# Inspect the dataframe
str(response_df, max.level = 2)
#> List of 3
#>  $ status   : chr "OK"
#>  $ copyright: chr "Copyright (c) 2019 The New York Times Company. All Rights Reserved."
#>  $ response :List of 2
#>   ..$ docs:'data.frame': 10 obs. of  27 variables:
#>   ..$ meta:List of 3

That looks intimidating! But it’s really just a big, nested list. Let’s see what we got in there.

names(response_df)
#> [1] "status"    "copyright" "response"

# This is boring
response_df$status
#> [1] "OK"

# So is this
response_df$copyright
#> [1] "Copyright (c) 2019 The New York Times Company. All Rights Reserved."

# This is what we want!
names(response_df$response)
#> [1] "docs" "meta"

Within response_df$response, we can extract a number of interesting results, including the number of total hits, as well as information on the first ten documents:

# What's in 'meta'?
response_df$response$meta
#> $hits
#> [1] 168
#> 
#> $offset
#> [1] 0
#> 
#> $time
#> [1] 288

# pull out number of hits
response_df$response$meta$hits
#> [1] 168

# Check out docs
names(response_df$response$docs)
#>  [1] "abstract"                "web_url"                
#>  [3] "snippet"                 "lead_paragraph"         
#>  [5] "print_section"           "print_page"             
#>  [7] "source"                  "multimedia"             
#>  [9] "keywords"                "pub_date"               
#> [11] "document_type"           "news_desk"              
#> [13] "section_name"            "type_of_material"       
#> [15] "_id"                     "word_count"             
#> [17] "uri"                     "headline.main"          
#> [19] "headline.kicker"         "headline.content_kicker"
#> [21] "headline.print_headline" "headline.name"          
#> [23] "headline.seo"            "headline.sub"           
#> [25] "byline.original"         "byline.person"          
#> [27] "byline.organization"

# put it in another variable
docs <- response_df$response$docs

17.4.3 Iteration through results pager

That’s great. But we only have 10 items. The original response said we had 168 hits! Which means we have to make 168 /10, or 17 requests to get them all. Sounds like a job for iteration!

First, let’s write a function that passes a search term and a page number, and returns a dataframe of articles.

nytapi <- function(term = NULL, n){
    base.url = "http://api.nytimes.com/svc/search/v2/articlesearch.json" 
    key = "Onz0BobMTn2IRJ7krcT5RXHknkGLqiaI"
    
    # Send GET request
    r <- GET(base.url, query = list(`q` = term,
                                  `api-key` = key,
                                  `page` = n))
    
    # Parse response to JSON
    response <- content(r, "text")  
    response_df <- fromJSON(response, simplifyDataFrame = T, flatten = T)
    
    print(paste("Scraping page: ", as.character(n)))
    
    return(response_df$response$docs)
}

docs <- nytapi("John Mearsheimer", 2)
#> [1] "Scraping page:  2"

Now, we’re ready to iterate over each page. First, let’s review what’ve done so far:

# set key and base
base.url = "http://api.nytimes.com/svc/search/v2/articlesearch.json" 
key = "Onz0BobMTn2IRJ7krcT5RXHknkGLqiaI"
search_term = "John Mearsheimer" #change me
  
# Send GET request
r <- GET(base.url, query = list(`q` = search_term,
                                `api-key` = key))
  
# Parse response to JSON
response <- content(r, "text")  
response_df <- fromJSON(response, simplifyDataFrame = T, flatten = T)

# extract hits
hits = response_df$response$meta$hits

# get number of pages
pages = ceiling(hits/10)

# modify function to sleep
nytapi_slow <- slowly(nytapi, rate = rate_delay(6))

# iterate over pages, getting all docs
docs_list <- map((1:pages), ~nytapi_slow(term = search_term, n = .))
#> [1] "Scraping page:  1"
#> [1] "Scraping page:  2"
#> [1] "Scraping page:  3"
#> [1] "Scraping page:  4"
#> [1] "Scraping page:  5"
#> [1] "Scraping page:  6"
#> [1] "Scraping page:  7"
#> [1] "Scraping page:  8"
#> [1] "Scraping page:  9"
#> [1] "Scraping page:  10"
#> [1] "Scraping page:  11"
#> [1] "Scraping page:  12"
#> [1] "Scraping page:  13"
#> [1] "Scraping page:  14"
#> [1] "Scraping page:  15"
#> [1] "Scraping page:  16"
#> [1] "Scraping page:  17"

# flatten to create one bit dataframe
docs_df <- bind_rows(docs_list)

17.4.4 Visualizing Results

To figure out how John Mearsheimer’s popularity is changing over time, all we need to do is add an indicator for the year and month each article was published in.

# Format pub_date using lubridate
docs_df$date <- ymd_hms(docs_df$pub_date)

by_month <- docs_df %>% group_by(floor_date(date, "month")) %>%
  summarise(count  = n()) %>%
  rename(month = 1)

by_month %>%
  ggplot(aes(x = month, y = count)) +
  geom_point() +
  theme_bw() + 
  xlab(label = "Date") +
  ylab(label = "Article Count") +
  ggtitle(label = "Coverage of John Mearsheimer")

17.4.5 More resources

The documentation for httr includes two useful vignettes:

  1. httr quickstart guide - summarizes all the basic httr functions like above
  2. Best practices for writing an API package - document outlining the key issues involved in writing API wrappers in R

17.5 Webscraping

If no API is available, we can scrape a website directory. Webscraping has a number of benefits and challenges compared to APIs:

Webscraping Benefits

  • Any content that can be viewed on a webpage can be scraped. Period
  • No API needed
  • No rate-limiting or authentication (usually)

Webscraping Challenges

  • Rarely tailored for researchers
  • Messy, unstructured, inconsistent
  • Entirely site-dependent

Some Disclaimers

  • Check a site’s terms and conditions before scraping.
  • Be nice - don’t hammer the site’s server. Review these ethical webscraping tips
  • Sites change their layout all the time. Your scraper will break.

17.5.1 What’s a website?

A website is some combination of codebase and database. The “front end” product is HTML + CSS stylesheets + javascript, looking something like this:

Your browser turns that into a nice layout.

17.5.2 HTML

The core of a website is HTML (Hyper Text Markup Language.) HTML is composed of a tree of HTML _nodeselements, such as headers, paragraphs, etc.

<!DOCTYPE html>
<html>
    <head>
        <title>Page title</title>
    </head>
    <body>
        <p>Hello world!</p>
    </body>
</html>

HTML elements can contain other elements:

Generally speaking, an HTML element has three components:

  1. Tags (starting and ending the element)
  2. Atributes (giving information about the element)
  3. Text, or Content (the text inside the element)
knitr::include_graphics(path = "img/html-element.png")

HTML: Tags

Common HTML tags

Tag Meaning
<head> page header (metadata, etc
<body> holds all of the content
<p> regular text (paragraph)
<h1>,<h2>,<h3> header text, levels 1, 2, 3
ol,,<ul>,<li> ordered list, unordered list, list item
<a href="page.html"> link to “page.html”
<table>,<tr>,<td> table, table row, table item
<div>,<span general containers

HTML Attributes

  • HTML elements can have attributes.
  • Attributes provide additional information about an element.
  • Attributes are always specified in the start tag.
  • Attributes come in name/value pairs like: name=“value”

  • Sometimes we can find the data we want just by using HTML tags or attributes (e.g, all the <a> tags)
  • More often, this isn’t enough: There might be 1000 <a> tags on a page. But maybe we want only the <a> tags inside of a <p> tag.
  • Enter CSS

17.5.3 CSS

CSS stands for Cascading Style Sheet. CSS defines how HTML elements are to be displayed.

HTML came first. But it was only meant to define content, not format it. While HTML contains tags like <font> and <color>, this is a very inefficient way to develop a website.

To solve this problem, CSS was created specifically to display content on a webpage. Now, one can change the look of an entire website just by changing one file.

Most web designers litter the HTML markup with tons of classes and ids to provide “hooks” for their CSS.

You can piggyback on these to jump to the parts of the markup that contain the data you need.

CSS Anatomy

  • Selectors
    • Element selector: p
    • Class selector: p class="blue"
    • I.D. selector: p id="blue"
  • Declarations
    • Selector: p
    • Property: background-color
    • Value: yellow
  • Hooks

17.5.3.1 CSS + HTML

<body>
    <table id="content">
        <tr class='name'>
            <td class='firstname'>
                Kurtis
            </td>
            <td class='lastname'>
                McCoy
            </td>
        </tr>
        <tr class='name'>
            <td class='firstname'>
                Leah
            </td>
            <td class='lastname'>
                Guerrero
            </td>
        </tr>
    </table>
</body>

Challenge 1

Find the CSS selectors for the following elements in the HTML above.

(Hint: There will be multiple solutions for each)

  1. The entire table
  2. The row containing “Kurtis McCoy”
  3. Just the element containing first names

17.5.4 Finding Elements with Selector Gadget

Selector Gadget is a browser plugin to help you find HTML elements. Install Selector Gadget on your browser by following these instructions.

Once installed, run Selector Gadget and simply click on the type of information you want to select from the webpage. Once this is selected, you can then click the pieces of information you don’t want to keep. Do this until only the pieces you want to keep remain highlighted, then copy the selector from the bottom pane.

Here’s the basic strategy of webscraping:

  1. Use Selector Gadget to see how your data is structured
  2. Pay attention to HTML tags and CSS selectors
  3. Pray that there is some kind of pattern
  4. Use R and add-on modules like RVest to extract just that data.

Challenge 2

Go to http://rochelleterman.github.io/. Using Selector Gadget,

  1. Find the CSS selector capturing all rows in the table.
  2. Find the image source URL.
  3. Find the HREF attribute of the link.

17.6 Scraping Presidential Statements

To demonstrate webscraping in R, we’re going to collect records on presidential statements here: https://www.presidency.ucsb.edu/

Let’s say we’re interested in how presidents speak about “space exploration”. On the website, we punch in this search term, and we get the following 295 results.

Our goal is to scrape these records, and store pertenant information in a dataframe. We will be doing this in two steps:

  1. Write a function to scrape each individual record page (these notes).
  2. Use this function to loop through all results, and collect all pages (homework).

Load the following packages to get started:

library(tidyverse)
library(rvest)
library(stringr)
library(purrr)
library(knitr)

17.6.1 Using RVest to Read HTML

The package RVest allows us to:

  1. Collect the HTML source code of a webpage
  2. Read the HTML of the page
  3. Select and keep certain elements of the page that are of interest

Let’s start with step one. We use the read_html function to call the results URL and grab the HTML response. Store this result as an object.

document1 <- read_html("https://www.presidency.ucsb.edu/documents/special-message-the-congress-relative-space-science-and-exploration")

#Let's take a look at the object we just created
document1
#> {html_document}
#> <html lang="en" dir="ltr" prefix="content: http://purl.org/rss/1.0/modules/content/ dc: http://purl.org/dc/terms/ foaf: http://xmlns.com/foaf/0.1/ og: http://ogp.me/ns# rdfs: http://www.w3.org/2000/01/rdf-schema# sioc: http://rdfs.org/sioc/ns# sioct: http://rdfs.org/sioc/types# skos: http://www.w3.org/2004/02/skos/core# xsd: http://www.w3.org/2001/XMLSchema#">
#> [1] <head profile="http://www.w3.org/1999/xhtml/vocab">\n<meta charset=" ...
#> [2] <body class="html not-front not-logged-in one-sidebar sidebar-first  ...

This is pretty messy. We need to use RVest to make this information more useable.

17.6.2 Find Page Elements

RVest has a number of functions to find information on a page. Like other webscraping tools, RVest lets you find elements by their:

  1. HTML tags
  2. HTML Attributes
  3. CSS Selectors

Let’s search first for HTML tags.

The function html_nodes searches a parsed HTML object to find all the elements with a particular HTML tag, and returns all of those elements.

What does the example below do?

html_nodes(document1, "a")
#> {xml_nodeset (78)}
#>  [1] <a href="#main-content" class="element-invisible element-focusable" ...
#>  [2] <a href="https://www.presidency.ucsb.edu/">The American Presidency  ...
#>  [3] <a class="btn btn-default" href="https://www.presidency.ucsb.edu/ab ...
#>  [4] <a class="btn btn-default" href="/advanced-search"><span class="gly ...
#>  [5] <a href="https://www.ucsb.edu/" target="_blank"><img alt="ucsb word ...
#>  [6] <a href="/documents" class="active-trail dropdown-toggle" data-togg ...
#>  [7] <a href="/documents/presidential-documents-archive-guidebook">Guide ...
#>  [8] <a href="/documents/category-attributes">Category Attributes</a>
#>  [9] <a href="/documents/proclamation-9895-national-maritime-day-2019">D ...
#> [10] <a href="/statistics">Statistics</a>
#> [11] <a href="/media" title="">Media Archive</a>
#> [12] <a href="/presidents" title="">Presidents</a>
#> [13] <a href="/analyses" title="">Analyses</a>
#> [14] <a href="https://giving.ucsb.edu/Funds/Give?id=185" title="">Suppor ...
#> [15] <a href="/documents/presidential-documents-archive-guidebook" title ...
#> [16] <a href="/documents" title="" class="active-trail">Categories</a>
#> [17] <a href="/documents/category-attributes" title="">Attributes</a>
#> [18] <a href="/documents/app-categories/presidential" title="Presidentia ...
#> [19] <a href="/documents/app-categories/spoken-addresses-and-remarks/pre ...
#> [20] <a href="/documents/app-categories/spoken-addresses-and-remarks/pre ...
#> ...

That’s a lot of results! Many elements on a page will have the same HTML tag. For instance, if you search for everything with the a tag, you’re likely to get a lot of stuff, much of which you don’t want.

In our case, we only want the links corresponding to the speaker Dwight D. Eisenhower.

Using selector gadget, we found out that the CSS selector for document’s speaker is .diet-title a.

We can then modify our argument in html_nodes to look for this more specific CSS selector.

html_nodes(document1, ".diet-title a")
#> {xml_nodeset (1)}
#> [1] <a href="/people/president/dwight-d-eisenhower">Dwight D. Eisenhower ...

17.6.3 Get Attributes and Text of Elements

Once we identify elements, we want to access information in that element. Oftentimes this means two things:

  1. Text
  2. Attributes

Getting the text inside an element is pretty straightforward. We can use the html_text() command inside of RVest to get the text of an element:

#Scrape individual document page
document1 <- read_html("https://www.presidency.ucsb.edu/documents/special-message-the-congress-relative-space-science-and-exploration")

#identify element with Speaker name
speaker <- html_nodes(document1, ".diet-title a") %>% 
  html_text() #select text of element

speaker
#> [1] "Dwight D. Eisenhower"

You can access a tag’s attributes using html_attr. For example, we often want to get a URL from an a (link) element. This is the URL is the link “points” to. It’s contained in the attribut href:

speaker_link <- html_nodes(document1, ".diet-title a") %>% 
  html_attr("href")

speaker_link
#> [1] "/people/president/dwight-d-eisenhower"

17.6.4 Let’s DO this.

Believe it or not, that’s all you need to scrape a website. Let’s apply these skills to scrape a sample document from the UCSB website – the first item in our search results.

We’ll collect the document’s date, speaker, title, and full text.

  1. Date
document1 <- read_html("https://www.presidency.ucsb.edu/documents/special-message-the-congress-relative-space-science-and-exploration")

date <- html_nodes(document1, ".date-display-single") %>%
  html_text() %>% # grab element text
  mdy() #format using lubridate
date
#> [1] "1958-04-02"
  1. Speaker
#Speaker
speaker <- html_nodes(document1, ".diet-title a") %>%
  html_text()
speaker
#> [1] "Dwight D. Eisenhower"
  1. Title
#Title
title <- html_nodes(document1, "h1") %>%
  html_text()
title
#> [1] "Special Message to the Congress Relative to Space Science and Exploration."
  1. Text
#Text
text <- html_nodes(document1, "div.field-docs-content") %>%
          html_text()

#this is a long document, so let's just display the first 1000 characters
text %>% substr(1, 1000) 
#> [1] "\n    To the Congress of the United States:\nRecent developments in long-range rockets for military purposes have for the first time provided man with new machinery so powerful that it can put satellites into orbit, and eventually provide the means for space exploration. The United States of America and the Union of Soviet Socialist Republics have already successfully placed in orbit a number of earth satellites. In fact, it is now within the means of any technologically advanced nation to embark upon practicable programs for exploring outer space. The early enactment of appropriate legislation will help assure that the United States takes full advantage of the knowledge of its scientists, the skill of its engineers and technicians, and the resourcefulness of its industry in meeting the challenges of the space age.\nDuring the past several months my Special Assistant for Science and Technology and the President's Science Advisory Committee, of which he is the Chairman, have been conductin"

17.6.5 Challenge 1: Make a function

Make a function called scrape_docs that accepts a URL of an individual document, scrapes the page, and returns a list containing the document’s date, speaker, title, and full text.

This involves:

  • Requesting the HTML of the webpage using the full URL and RVest.
  • Using RVest to locate all elements on the page we want to save.
  • Storing each of these items into a list.
  • Returning this list.
scrape_docs <- function(URL){

  # YOUR CODE HERE
  
}

# uncomment to test
# scrape_doc("https://www.presidency.ucsb.edu/documents/letter-t-keith-glennan-administrator-national-aeronautics-and-space-administration")