Chapter 18 Text Analysis

This unit focuses on computational text analysis (or “text-as-data”). We will explore:

  1. Preprocessing a corpus for common text analysis.
  2. Sentiment Analysis and Dictionary Methods, a simple, supervised method for classification.
  3. Distinctive Words, or word-separating techniques to compare corpora.
  4. Structural Topic Models, a popular unsupervised method for text exploration and analysis.

These materials are based off a longer, week-long intensive workshop on computational text analysis. If you are interested in text-as-data, I would encourage you to work through these materials on your own: https://github.com/rochelleterman/FSUtext

18.1 Preprocessing

First let’s load our required packages:

library(tm) # Framework for text mining
library(tidyverse) # Data preparation and pipes %>%
library(ggplot2) # For plotting word frequencies
library(wordcloud) # Wordclouds!

A corpus is a collection of texts, usually stored electronically, and from which we perform our analysis. A corpus might be a collection of news articles from Reuters or the published works of Shakespeare.

Within each corpus we will have separate articles, stories, volumes, etc., each treated as a separate entity or record. Each unit is called a document.

For this unit, we will be using a section of Machiavelli’s Prince as our corpus. Since The Prince is a monograph, we have already “chunked” the text so that each short paragraph or “chunk” is considered a “document.”

18.1.1 From Words to Numbers

Corpus Readers

The tm package supports a variety of sources and formats. Run the code below to see what it includes.

getSources()
#> [1] "DataframeSource" "DirSource"       "URISource"       "VectorSource"   
#> [5] "XMLSource"       "ZipSource"
getReaders()
#>  [1] "readDataframe"           "readDOC"                
#>  [3] "readPDF"                 "readPlain"              
#>  [5] "readRCV1"                "readRCV1asPlain"        
#>  [7] "readReut21578XML"        "readReut21578XMLasPlain"
#>  [9] "readTagged"              "readXML"

Here we will be reading documents from a CSV file in which each row is a document that includes columns for text and metadata (information about each document). This is the easiest option if you have metadata.

docs.df <-read.csv("data/mach.csv", header=TRUE) # Read in CSV file
docs.df <- docs.df %>%
  mutate(text = str_conv(text, "UTF-8"))
docs <- Corpus(VectorSource(docs.df$text))
docs
#> <<SimpleCorpus>>
#> Metadata:  corpus specific: 1, document level (indexed): 0
#> Content:  documents: 188

Once we have the corpus, we can inspect the documents using inspect().

# See the 16th document
inspect(docs[16])
#> <<SimpleCorpus>>
#> Metadata:  corpus specific: 1, document level (indexed): 0
#> Content:  documents: 1
#> 
#> [1]  Therefore, since a ruler cannot both practise this virtue of generosity and be known to do so without harming himself, he would do well not to worry about being called miserly. For eventually he will come to be considered more generous, when it is realised that, because of his parsimony, his revenues are sufficient to defend himself against any enemies that attack him, and to undertake campaigns without imposing special taxes on the people. Thus he will be acting generously towards the vast majority, whose property he does not touch, and will be acting meanly towards the few to whom he gives nothing.  Those rulers who have achieved great things in our own times have all been considered mean; all the others have failed. Although Pope Julius II cultivated a reputation for generosity in order to become pope, he did not seek to maintain it afterwards, because he wanted to be

Preprocessing Functions

Many text analysis applications follow a similar ‘recipe’ for preprocessing, involving (the order of these steps might differ as per application)

  1. Tokenizing the text to unigrams (or bigrams, or trigrams).
  2. Converting all characters to lowercase.
  3. Removing punctuation.
  4. Removing numbers.
  5. Removing Stop Words, inclugind custom stop words.
  6. “Stemming” words, or lemmitization. There are several stemming algorithms. Porter is the most popular.
  7. Creating a Document-Term Matrix.

tm lets us convert a corpus to a DTM while completing the pre-processing steps in one step.

dtm <- DocumentTermMatrix(docs,
           control = list(stopwords = TRUE,
                          tolower = TRUE,
                          removeNumbers = TRUE,
                          removePunctuation = TRUE,
                          stemming=TRUE))

Weighting

One common pre-processing step that some applications may call for is applying tf-idf weights. The tf-idf, or term frequency-inverse document frequency, is a weight that ranks the importance of a term in its contextual document corpus. The tf-idf value increases proportionally to the number of times a word appears in the document, but is offset by the frequency of the word in the corpus, which helps to adjust for the fact that some words appear more frequently in general. In other words, it places importance on terms frequent in the document but rare in the corpus.

dtm.weighted <- DocumentTermMatrix(docs,
           control = list(weighting =function(x) weightTfIdf(x, normalize = TRUE),
                          stopwords = TRUE,
                          tolower = TRUE,
                          removeNumbers = TRUE,
                          removePunctuation = TRUE,
                          stemming=TRUE))

Compare the first 5 rows and 5 columns of the dtm and dtm.weighted. What do you notice?

inspect(dtm[1:5,1:5])
#> <<DocumentTermMatrix (documents: 5, terms: 5)>>
#> Non-/sparse entries: 3/22
#> Sparsity           : 88%
#> Maximal term length: 7
#> Weighting          : term frequency (tf)
#> Sample             :
#>     Terms
#> Docs abandon abil abject abl ablest
#>    1       0    0      0   0      0
#>    2       0    1      0   0      0
#>    3       0    0      0   0      0
#>    4       0    1      0   1      0
#>    5       0    0      0   0      0
inspect(dtm.weighted[1:5,1:5])
#> <<DocumentTermMatrix (documents: 5, terms: 5)>>
#> Non-/sparse entries: 3/22
#> Sparsity           : 88%
#> Maximal term length: 7
#> Weighting          : term frequency - inverse document frequency (normalized) (tf-idf)
#> Sample             :
#>     Terms
#> Docs abandon   abil abject    abl ablest
#>    1       0 0.0000      0 0.0000      0
#>    2       0 0.0402      0 0.0000      0
#>    3       0 0.0000      0 0.0000      0
#>    4       0 0.0310      0 0.0228      0
#>    5       0 0.0000      0 0.0000      0

18.1.2 Exploring the DTM

Dimensions

Let’s look at the structure of our DTM. Print the dimensions of the DTM. How many documents do we have? How many terms?

# How many documents? How many terms?
dim(dtm)
#> [1]  188 2368

Frequencies

We can obtain the term frequencies as a vector by converting the document term matrix into a matrix and using colSums to sum the column counts.

# How many terms?
freq <- colSums(as.matrix(dtm))
freq[1:5]
#> abandon    abil  abject     abl  ablest 
#>       4      35       1      61       1
length(freq)
#> [1] 2368

By ordering the frequencies, we can list the most frequent terms and the least frequent terms.

# Order
sorted <- sort(freq, decreasing = T)

# Most frequent terms
head(sorted)
#> ruler  will power   one peopl alway 
#>   280   251   169   168    98    95

# Least frequent
tail(sorted)
#>  xxiv   xxv  xxvi yield  yoke youth 
#>     1     1     1     1     1     1

Plotting Frequencies

Let’s make a plot that shows the frequency of frequencies for the terms. (For example, how many words are used only once? 5 times? 10 times?)

# Frequency of frenquencies
head(table(freq),15)
#> freq
#>   1   2   3   4   5   6   7   8   9  10  11  12  13  14  15 
#> 988 363 202 140 103  73  55  39  33  29  24  22  20  20  19
tail(table(freq),15)
#> freq
#>  65  68  70  71  73  74  75  76  77  95  98 168 169 251 280 
#>   1   1   1   1   2   1   1   1   1   2   1   1   1   1   1

# Plot
plot(table(freq))

What does this tell us about the nature of language?

Exploring Common Words

The tm package has lots of useful functions to help you explore common words and associations:

# Have a look at common words
findFreqTerms(dtm, lowfreq=50) # Words that appear at least 50 times
#>  [1] "abl"      "act"      "alway"    "armi"     "becom"    "can"     
#>  [7] "consid"   "either"   "forc"     "great"    "king"     "maintain"
#> [13] "make"     "man"      "mani"     "men"      "much"     "must"    
#> [19] "never"    "new"      "one"      "order"    "other"    "peopl"   
#> [25] "power"    "reason"   "ruler"    "sinc"     "state"    "subject" 
#> [31] "time"     "troop"    "use"      "want"     "way"      "well"    
#> [37] "will"

# Which words correlate with "war"?
findAssocs(dtm, "war", 0.3)
#> $war
#>     wage    fight  antioch     argu     brew    induc      lip    maxim 
#>     0.73     0.52     0.45     0.45     0.45     0.45     0.45     0.45 
#>  relianc     sage    trifl  postpon     mere     evil    avoid     flee 
#>     0.45     0.45     0.45     0.41     0.35     0.34     0.32     0.32 
#>   occupi     glad glorious    heard     hunt ineffect     knew   produc 
#>     0.32     0.30     0.30     0.30     0.30     0.30     0.30     0.30 
#> temporis 
#>     0.30

We can even make wordclouds showing the most commons terms:

# Wordclouds!
set.seed(123)
wordcloud(names(sorted), sorted, max.words=100, colors=brewer.pal(6,"Dark2"))
#> Warning in wordcloud(names(sorted), sorted, max.words = 100, colors =
#> brewer.pal(6, : ruler could not be fit on page. It will not be plotted.
#> Warning in wordcloud(names(sorted), sorted, max.words = 100, colors =
#> brewer.pal(6, : power could not be fit on page. It will not be plotted.

Removing Sparse Terms

Somtimes we want to remove sparse terms and, thus, increase efficency. Look up the help file for the function removeSparseTerms. Using this function, create an object called dtm.s that contains only terms with <.9 sparsity (meaning they appear in more than 10% of documents).

dtm.s <- removeSparseTerms(dtm,.9)
dtm 
#> <<DocumentTermMatrix (documents: 188, terms: 2368)>>
#> Non-/sparse entries: 11754/433430
#> Sparsity           : 97%
#> Maximal term length: 15
#> Weighting          : term frequency (tf)
dtm.s 
#> <<DocumentTermMatrix (documents: 188, terms: 136)>>
#> Non-/sparse entries: 4353/21215
#> Sparsity           : 83%
#> Maximal term length: 12
#> Weighting          : term frequency (tf)

18.1.3 Exporting the DTM

We can convert a DTM to a matrix or dataframe in order to write it to a CSV, add metadata, etc.

First, create an object that converts the DTM to a dataframe (we first have to convert it to a matrix and then to a dataframe):

# Coerce into dataframe
dtm <- as.data.frame(as.matrix(dtm))
names(dtm)[1:10]  # Names of documents
#>  [1] "abandon"   "abil"      "abject"    "abl"       "ablest"    "abovement"
#>  [7] "abovenam"  "absolut"   "absorb"    "accept"

# Write CSV
# write.csv(dtm, "dtm.csv", row.names = F)

18.1.3.1 Challenge.

Using one of the datasets in the data directory, create a document term matrix and a wordcloud of the most common terms.

# YOUR CODE HERE

18.2 Sentiment Analysis and Dictionary Methods

To demonstrate sentiment analysis, we are going to explore lyrics from Taylor Swift songs.

Road the code below to get started:

require(tm)
require(tidytext)
require(tidyverse)
require(stringr)
require(textdata)

18.2.1 Preprocessing and Setup

First, we must preprocess the corpus. Create a document-term matrix from the lyrics column of the ts dataframe. Complete the following preprocessing steps:

  • Convert to lower.
  • Remove stop words.
  • Remove numbers.
  • Remove punctuation.

Think: Why is stemming inappropriate for this application?

ts <- read.csv("data/taylor_swift.csv")

# Preprocess and create DTM
docs <- Corpus(VectorSource(ts$lyrics))

dtm <- DocumentTermMatrix(docs,
           control = list(tolower = TRUE,
                          removeNumbers = TRUE,
                          removePunctuation = TRUE,
                          stopwords = TRUE
                         ))

# Convert to dataframe
dtm <- as.data.frame(as.matrix(dtm))

Sentiment Dictionaries

We are going to use sentiment dictionaries from the tidytext package. Using the get_sentiments function, load the “bing” dictionary and store it in an object called sent.

sent <- get_sentiments("bing")
head(sent)
#> # A tibble: 6 x 2
#>   word       sentiment
#>   <chr>      <chr>    
#> 1 2-faces    negative 
#> 2 abnormal   negative 
#> 3 abolish    negative 
#> 4 abominable negative 
#> 5 abominably negative 
#> 6 abominate  negative

We will now add a column to sent called score. This column should hold a “1” for positive words and “-1” for negative words.

sent$score <- ifelse(sent$sentiment=="positive", 1, -1)

18.2.2 Scoring the Songs

We are now ready to score each song.

(NB: There are probably many ways to program a script that performs this task. If you can think of a more elegant way, go for it!)

First, we will create a dataframe that holds all the words in our DTM along with their sentiment score.

# Get all the words in our DTM and put them in a dataframe
words = data.frame(word = colnames(dtm), stringsAsFactors = F)
head(words)
#>        word
#> 1      back
#> 2 backroads
#> 3       bed
#> 4   believe
#> 5   beneath
#> 6    beside

# Get their sentiment scores
words_sent <- words %>%
  left_join(sent) %>% 
  mutate(score = replace_na(score, 0))
#> Joining, by = "word"

We can now use matrix algebra (!!) to multiply our DTM by the scoring vector. This will return to us a score for each document (i.e., song).

# Calculate documents scores with matrix algebra! 
doc_scores <- as.matrix(dtm) %*% words_sent$score

# Put the scores in the original documents dataframe
ts$sentiment <- doc_scores

Which song is happiest? Go listen to the song and see if you agree.

18.2.3 Challenges

Challenge 1.

Using the code we wrote above, make a function that accepts 1) a vector of texts and 2) a sentiment dictionary (i.e., a dataframe with words and scores) and returns a vector of sentiment scores for each text.

sentiment_score <- function(texts, sent_dict){
  
  # YOUR CODE HERE

  return(doc_scores)
}
  
# Uncomment to test it out!
# sentiment_score(ts$lyrics, sent_dict)

Challenge 2.

Using the function you wrote above, find out what the most and least positive Taylor Swift album is.

# YOUR CODE HERE

18.3 Distinctive Words

This lesson finds distinctive words in the speeches of Obama and Trump.

Run the following code to:

  1. Import the corpus.
  2. Create a DTM.
require(tm)
require(matrixStats) # For statistics
require(tidyverse)

# Import corpus
docs <- Corpus(DirSource("Data/trump_obama"))

# Preprocess and create DTM
dtm <- DocumentTermMatrix(docs,
           control = list(tolower = TRUE,
                          removePunctuation = TRUE,
                          removeNumbers = TRUE,
                          stopwords = TRUE,
                          stemming=TRUE))

# Print the dimensions of the DTM
dim(dtm)
#> [1]   11 4094

# Take a quick look
inspect(dtm[,100:104])
#> <<DocumentTermMatrix (documents: 11, terms: 5)>>
#> Non-/sparse entries: 14/41
#> Sparsity           : 75%
#> Maximal term length: 11
#> Weighting          : term frequency (tf)
#> Sample             :
#>                 Terms
#> Docs             alien align alik aliv allamerican
#>   Obama_2009.txt     0     0    1    0           0
#>   Obama_2010.txt     0     0    1    1           0
#>   Obama_2011.txt     0     1    0    0           0
#>   Obama_2012.txt     0     0    0    1           0
#>   Obama_2013.txt     0     0    0    0           0
#>   Obama_2014.txt     0     0    0    1           0
#>   Obama_2015.txt     1     0    1    0           0
#>   Trump_2017.txt     0     1    0    0           0
#>   Trump_2018.txt     1     0    0    0           1
#>   Trump_2019.txt     3     0    1    1           0

Oftentimes scholars will want to compare different corpora by finding the words (or features) distinctive to each corpora. But finding distinctive words requires a decision about what “distinctive” means. As we will see, there are a variety of definitions that we might use.

18.3.1 Unique Usage

The most obvious definition of distinctive is “exclusive.” That is, distinctive words are those found exclusively in texts associated with a single speaker (or group). For example, if Trump uses the word “access” and Obama never does, we should count “access” as distinctive. Finding words that are exclusive to a group is a simple exercise. All we have to do is sum the usage of each word use across all texts for each speaker and then look for cases where the sum is zero for one speaker.

# Turn DTM into dataframe
dtm.m <- as.data.frame(as.matrix(dtm))
dtm.m$that <- NULL # Fix weird encoding error with stop words
dtm.m$dont <- NULL

# Subset into 2 DTMs, 1 for each speaker
obama <- dtm.m[1:8,]
trump <- dtm.m[9:11,]

# Sum word usage counts across all texts
obama <- colSums(obama)
trump <- colSums(trump)

# Put those sums back into a dataframe
df <- data.frame(rbind(obama, trump))
df[ ,1:5]
#>       abandon abess abid abil abject
#> obama       2     1    1    7      0
#> trump       1     0    0    1      1

# Get words where one speaker's usage is 0
solelyobama <- unlist(df[1, trump==0]) 
solelyobama <- solelyobama[order(solelyobama, decreasing = T)] # Order them by frequency
head(solelyobama, 10) # Get top 10 words for Obama
#> technolog      bank     innov    doesnt   teacher      loan      wont     debat 
#>        31        30        30        29        26        22        22        21 
#>    climat democraci 
#>        19        19

solelytrump <- unlist(df[2, obama==0])
solelytrump <- solelytrump[order(solelytrump, decreasing = T)] # Order them by frequency
head(solelytrump, 10) # Get top 10 words for Trump
#>      isi    agent america.  audienc    megan      it. obamacar     alic 
#>        9        8        8        8        8        7        7        6 
#>   beauti    elvin 
#>        6        6

This is a start, but oftentimes these words tend not to be terribly interesting or informative, so we will remove them from our corpus in order to focus on identifying distinctive words that appear in texts associated with every speaker.

# Subset df with non-zero entries
df <- df[,trump>0 & obama>0]

# How many words are we left with?
ncol(df)
#> [1] 1525
df[,1:5]
#>       abandon abil abl abort abraham
#> obama       2    7  15     1       1
#> trump       1    1   9     1       1

18.3.2 Differences in Frequencies

Another basic approach to identifying distinctive words is to compare the frequencies at which speakers use a word. If one speaker uses a word often across his or her oeuvre, and another barely uses the word at all, the difference in their respective frequencies will be large. We can calculate this quantity the following way:

# Take the differences in frequencies
diffFreq <- obama - trump

# Sort the words
diffFreq <- sort(diffFreq, decreasing = T)

# The top Obama words
head(diffFreq, 10)
#>     will     year      job     work     make      can american  america 
#>      306      217      214      186      177      172      165      155 
#>      new    peopl 
#>      150      147

# The top Trump words
tail(diffFreq, 10)
#>  illeg immigr    isi    usa   hero   ryan border  great  thank   drug 
#>     -9     -9     -9     -9    -11    -11    -13    -13    -19    -22

18.3.3 Differences in Averages

This is a good start. But what if one speaker uses more words overall? Instead of using raw frequencies, a better approach would look at the average rate at which speakers use various words.

We can calculate this quantity the following way:

  1. Normalize the DTM from counts to proportions.
  2. Take the difference between one speaker’s proportion of a word and another’s proportion of the same word.
  3. Find the words with the highest absolute difference.
# Normalize into proportions
rowTotals <- rowSums(df) # Create vector with row totals, i.e., total number of words per document
head(rowTotals) # Notice that one speaker uses more words than the other
#> obama trump 
#> 23021  7432

# Change frequencies to proportions
df <- df/rowTotals # Change frequencies to proportions
df[,1:5]
#>        abandon     abil      abl    abort  abraham
#> obama 8.69e-05 0.000304 0.000652 4.34e-05 4.34e-05
#> trump 1.35e-04 0.000135 0.001211 1.35e-04 1.35e-04

# Get difference in proportions
means.obama <- df[1,]
means.trump <- df[2,]
score <- unlist(means.obama - means.trump)

# Find words with highest difference
score <- sort(score, decreasing = T)
head(score,10) # Top Obama words
#>     job    make    busi     let    need    work    help economi  energi     can 
#> 0.00620 0.00541 0.00473 0.00426 0.00419 0.00407 0.00388 0.00378 0.00363 0.00346
tail(score,10) # Top Trump words
#>   border  tonight   immigr     unit    state     drug     must    great 
#> -0.00284 -0.00293 -0.00322 -0.00322 -0.00322 -0.00342 -0.00354 -0.00476 
#>    thank american 
#> -0.00483 -0.00650

This is a start. The problem with this measure is that it tends to highlight differences in very frequent words. For example, this method gives greater attention to a word that occurs 30 times per 1,000 words in Obama and 25 times per 1,000 in Trump than it does to a word that occurs 5 times per 1,000 words in Obama and 0.1 times per 1,000 words in Trump. This does not seem right. It seems important to recognize cases when one speaker uses a word frequently and another speaker barely uses it.

As this initial attempt suggests, identifying distinctive words will be a balancing act. When comparing two groups of texts, differences in the rates of frequent words will tend to be large relative to differences in the rates of rarer words. Human language is variable; some words occur more frequently than others regardless of who is writing. We need to find a way of adjusting our definition of distinctive in light of this.

18.3.4 Differences in Averages, Adjustment

One adjustment that is easy to make is to divide the difference in speakers’ average rates by the average rate across all speakers. Since dividing a quantity by a large number will make that quantity smaller, our new distinctiveness score will tend to be lower for words that occur frequently. While this is merely a heuristic, it does move us in the right direction.

# Get the average rate of all words across all speakers
means.all <- colMeans(df)

# Now divide the difference in speakers' rates by the average rate across all speakers
score <- unlist((means.obama - means.trump) / means.all)
score <- sort(score, decreasing = T)
head(score,10) # Top Obama words
#>    student       cant       idea      money        oil     higher       earn 
#>       1.78       1.77       1.70       1.67       1.67       1.66       1.60 
#> leadership   research    respons 
#>       1.60       1.59       1.58
tail(score,10) # Top Trump words
#>     drug    grace    death    heart   pillar southern  terribl   unfair 
#>    -1.77    -1.80    -1.82    -1.82    -1.84    -1.84    -1.84    -1.84 
#>     gang     ryan 
#>    -1.87    -1.90

18.4 Structural Topic Models

This unit gives a brief overview of the stm (structural topic model) package. Please read the vignette for more detail.

Structural topic model is a way to estimate a topic model that includes document-level metadata. One can then see how topical prevalence changes according to that metadata.

library(stm)

The data we will be using for this unit consists of all articles about women published in the New York Times and Washington Post, 1980-2014. You worked with a subset of this data in your last homework.

Load the dataset. Notice that we have the text of the articles along with some metadata.

# Load data
women <- read.csv('data/women-full.csv')
names(women)
#>  [1] "BYLINE"              "TEXT.NO.NOUN"        "PUBLICATION"        
#>  [4] "TITLE"               "COUNTRY"             "COUNTRY_FINAL"      
#>  [7] "YEAR"                "UID"                 "COUNTRY_NR"         
#> [10] "entities"            "LENGTH"              "COUNTRY_TOP_PERCENT"
#> [13] "COUNTRY_CODE"        "TEXT"                "DATE"               
#> [16] "COUNTRY_MAJOR"       "TYPE"                "REGION"             
#> [19] "SUBJECT"

18.4.1 Preprocessing

STM has its own unique preprocessing functions and procedure, which I have coded below. Notice that we are going to use the TEXT.NO.NOUN column, which contains all the text of the articles without proper nouns (which I removed earlier).

# Pre-process
temp<-textProcessor(documents = women$TEXT.NO.NOUN, metadata = women)
#> Building corpus... 
#> Converting to Lower Case... 
#> Removing punctuation... 
#> Removing stopwords... 
#> Removing numbers... 
#> Stemming... 
#> Creating Output...
meta<-temp$meta
vocab<-temp$vocab
docs<-temp$documents

# Prep documents in the correct format
out <- prepDocuments(docs, vocab, meta)
#> Removing 19460 of 39403 terms (19460 of 1087166 tokens) due to frequency 
#> Your corpus now has 4531 documents, 19943 terms and 1067706 tokens.
docs<-out$documents
vocab<-out$vocab
meta <-out$meta

Challenge 1.

Read the help file for the prepDocuments function. Alter the code above (in 2.1) to keep only words that appear in at least 10 documents.

# YOUR CODE HERE

18.4.2 Estimate Model

We are now going to estimate a topic model with 15 topics by regressing topical prevalence on region and year covariates.

Running the full model takes a long time to finish. For that reason, we are going to add an argument max.em.its, which sets the number of iterations. By keeping it low (15), we will see a rough estimate of the topics. You can always go back and estimate the model to convergence.

model <- stm(docs, vocab, 15, prevalence = ~ REGION + s(YEAR), data = meta, seed = 15, max.em.its = 15)

Let’s see what our model came up with! The following tools can be used to evaluate the model:

  • labelTopics gives the top words for each topic.
  • findThoughts gives the top documents for each topic (the documents with the highest proportion of each topic).
# Top Words
labelTopics(model)
#> Topic 1 Top Words:
#>       Highest Prob: show, design, fashion, women, art, one, like 
#>       FREX: coutur, fashion, museum, sculptur, ready--wear, jacket, galleri 
#>       Lift: ---inch, -ankl, alexandr, armatur, armhol, art-fair, avant 
#>       Score: coutur, art, artist, fashion, museum, exhibit, cloth 
#> Topic 2 Top Words:
#>       Highest Prob: said, polic, women, kill, report, offici, govern 
#>       FREX: polic, suicid, kill, attack, investig, suspect, arrest 
#>       Lift: abducte, charanjit, humanity-soak, male-control, sunil, kalpana, ciudad 
#>       Score: polic, rape, kill, said, arrest, attack, investig 
#> Topic 3 Top Words:
#>       Highest Prob: women, team, game, said, world, play, olymp 
#>       FREX: tournament, championship, olymp, soccer, player, game, medal 
#>       Lift: -america, -foot--inch, -hole, -kilomet, -rank, -round, -trump 
#>       Score: olymp, championship, tournament, team, player, game, medal 
#> Topic 4 Top Words:
#>       Highest Prob: book, year, life, first, write, novel, work 
#>       FREX: novel, literari, fiction, book, memoir, novelist, poet 
#>       Lift: buster, calla, goncourt, identical-twin, italian-american, kilcher, klone 
#>       Score: novel, book, fiction, literari, poet, writer, write 
#> Topic 5 Top Words:
#>       Highest Prob: women, said, femal, percent, militari, will, compani 
#>       FREX: combat, board, quota, militari, bank, corpor, infantri 
#>       Lift: -combat, cpr, gender-divers, nonexecut, outfitt, r-calif, r-ni 
#>       Score: women, militari, infantri, combat, percent, quota, femal 
#> Topic 6 Top Words:
#>       Highest Prob: protest, said, one, site, peopl, young, video 
#>       FREX: orthodox, internet, web, video, rabbi, prayer, site 
#>       Lift: balaclava, grrrl, tehrik-, braveheart, drawbridg, gravesit, guerrilla-styl 
#>       Score: protest, site, orthodox, video, jewish, rabbi, xxxfx 
#> Topic 7 Top Words:
#>       Highest Prob: women, work, said, year, percent, men, ese 
#>       FREX: ese, factori, employ, incom, worker, job, market 
#>       Lift: flextim, management-track, nec, nontransfer, rabenmutt, chiho, fumiko 
#>       Score: ese, percent, compani, work, job, women, factori 
#> Topic 8 Top Words:
#>       Highest Prob: women, sexual, sex, rape, men, violenc, said 
#>       FREX: harass, sexual, sex, assault, brothel, violenc, behavior 
#>       Lift: offenc, tarun, chaud, much-lov, newt, tiresom, sex-rel 
#>       Score: rape, sexual, harass, violenc, sex, assault, brothel 
#> Topic 9 Top Words:
#>       Highest Prob: women, said, right, law, islam, govern, religi 
#>       FREX: islam, religi, veil, constitut, saudi, secular, cleric 
#>       Lift: afghan-styl, anglo-, archdeacon, bien-aim, episcopaci, fez, government-encourag 
#>       Score: islam, law, women, right, religi, ordin, saudi 
#> Topic 10 Top Words:
#>       Highest Prob: said, one, famili, peopl, day, like, home 
#>       FREX: villag, room, smile, son, couldnt, recal, sit 
#>       Lift: charpoy, jet-black, mitra, schermerhorn, single-famili, tyson, uja-feder 
#>       Score: villag, husband, fistula, famili, school, girl, said 
#> Topic 11 Top Words:
#>       Highest Prob: women, film, one, like, woman, say, play 
#>       FREX: film, theater, movi, charact, actress, documentari, audienc 
#>       Lift: clive, fine-tun, kaffir, nushus, shrew, nushu, cadel 
#>       Score: film, theater, movi, nushu, play, orchestra, femin 
#> Topic 12 Top Words:
#>       Highest Prob: polit, elect, parti, minist, presid, govern, said 
#>       FREX: voter, elect, parti, prime, candid, vote, cabinet 
#>       Lift: ernesto, pinbal, influence-peddl, information-servic, kakuei, left--cent, marxist-leninist 
#>       Score: elect, parti, vote, minist, voter, polit, candid 
#> Topic 13 Top Words:
#>       Highest Prob: women, said, abort, cancer, health, studi, breast 
#>       FREX: implant, cancer, breast, pill, virus, patient, estrogen 
#>       Lift: acet, adren, ambulatori, analges, anastrozol, antioxid, ashkenazi 
#>       Score: cancer, abort, breast, pill, implant, health, virus 
#> Topic 14 Top Words:
#>       Highest Prob: women, said, confer, will, world, organ, right 
#>       FREX: deleg, confer, forum, page, peac, nongovernment, ambassador 
#>       Lift: -glass, barack, brooklyn-born, expansion, foreclosur, guarantor, holden 
#>       Score: deleg, confer, forum, page, palestinian, peac, mrs 
#> Topic 15 Top Words:
#>       Highest Prob: said, women, rape, court, case, girl, practic 
#>       FREX: mutil, genit, circumcis, asylum, sentenc, judg, tribun 
#>       Lift: labia, layli, minora, multifaith, paraleg, salim, strip-search 
#>       Score: rape, genit, circumcis, mutil, court, sentenc, prosecutor

# Example Docs
findThoughts(model, texts = meta$TITLE, n=2,topics = 1:15)
#> 
#>  Topic 1: 
#>       KENZO'S CAREFREE STYLES AT AN OFFBEAT SHOWING
#>      A MODERN LOOK, A CLASSIC TOUCH FROM SAINT LAURENT 
#>  Topic 2: 
#>       Assailants Kill 4 Iraqi Women Working for U.S.; Gunmen Follow Van Carrying Laundry Employees
#>      WORLD IN BRIEF 
#>  Topic 3: 
#>       AMERICANS LEAD EAST GERMANS IN TRACK
#>      Russians Chart a New Path 
#>  Topic 4: 
#>       BEST SELLERS: September 6, 1998
#>      BEST SELLERS: September 13, 1998 
#>  Topic 5: 
#>       In Britain, a Big Push for More Women to Serve on Corporate Boards
#>      Poll: Allow women in combat units 
#>  Topic 6: 
#>       Neda's Legacy; A woman's death moves Iranian protesters.
#>      Jewish Feminists Prompt Protests at Wailing Wall 
#>  Topic 7: 
#>       China Scrambles for Stability as Its Workers Age
#>      A high price for a paycheck; Caught between the demands of the corporate workplace and of their traditional roles in society, more South Korean women are putting off marriage and parenthood 
#>  Topic 8: 
#>       Confronting Rape in India, and Around the World
#>      Sexual Harassment Prosecutions Get Short Shrift in India, Lawyer Says 
#>  Topic 9: 
#>       English Church Advances Bid For Women As Bishops
#>      Egypt Passes Law On Women's Rights; Polygamy Still Allowed for Men 
#>  Topic 10: 
#>       An Old Cinema in Pakistan Has New Life After Quake
#>      Maria Duran's Endless Wait 
#>  Topic 11: 
#>       For France, An All-Purpose Heartthrob
#>      Film: Brazilian 'Vera' 
#>  Topic 12: 
#>       The Widow Of Ex-Leader Wins Race In Panama
#>      Cabinet Defeated in Iceland as Feminists Gain 
#>  Topic 13: 
#>       SECTION: HEALTH; Pg. T18
#>      Dense Breasts May Need Sonograms to Detect Cancer 
#>  Topic 14: 
#>       DISPUTES ON KEY ISSUES STALL KENYA PARLEY
#>      'CHAOTIC' CONDITIONS FEARED AT U.N.'S PARLEY ON WOMEN 
#>  Topic 15: 
#>       Woman Fleeing Tribal Rite Gains Asylum; Genital Mutilation Is Ruled Persecution
#>      Refugee From Mutilation

Challenge 2.

Estimate other models using 5 and 40 topics, respectively. Look at the top words for each topic. How do the topics vary when you change the number of topics?

Now look at your neighbor’s model. Did you get the same results? Why or why not?

# YOUR CODE HERE

18.4.3 Interpret Model

Let’s all load a fully-estimated model that I ran before class.

# Load the already-estimated model
load("data/stm.RData")

Challenge 3.

Using the functions labelTopics and findThoughts, hand label the 15 topics. Hold these labels as a character vector called labels.

# Store your hand labels below
labels = c()

Now look at your neighbor’s labels. Did you get the same results? Why or why not?

18.4.4 Analyze Topics

We are now going to see how the topics compare in terms of their prevalence across regions. What do you notice about the distribution of topic 9?

# Corpus summary
plot.STM(model, type="summary", custom.labels = labels, main="")

# Estimate covariate effects
prep <- estimateEffect(1:15 ~ REGION + s(YEAR), model, meta = meta, uncertainty = "Global", documents=docs)
#> Warning: Using formula(x) is deprecated when x is a character vector of length > 1.
#>   Consider formula(paste(x, collapse = " ")) instead.

# Plot topic 9 over regions
regions = c("Asia", "EECA", "MENA", "Africa", "West", "LA")
plot.estimateEffect(prep, "REGION", method = "pointestimate", topics = 9, printlegend = TRUE, labeltype = "custom", custom.labels = regions, main = "Women's Rights", ci.level = .95, nsims=100)