Chapter 15 Strings and Regular Expressions

This unit focuses on chracter (or “string”) data. We’ll explore:

  1. string basics, like concatinating and subsettings.
  2. regular expressions, a powerful cross-language tool for working with string data.
  3. common tools, that take regex and apply them to real problems.

This chapter will focus on the stringr package for string manipulation. stringr is not part of the core tidyverse because you don’t always have textual data, so we need to load it explicitly.

library(tidyverse)
library(stringr)

15.1 String Basics

15.1.1 Creating Strings

You can create strings with either single quotes or double quotes. Unlike other languages, there is no difference in behavior. I recommend always using ", unless you want to create a string that contains multiple ".

string1 <- "This is a string"
string2 <- 'If I want to include a "quote" inside a string, I use single quotes'

15.1.2 Escape and Special Characters

Single and double quotes are known as “metacharacters,” meaning that they have special meaning to the R language. To include a literal single or double quote in a string you can use \ to “escape” it:

double_quote <- "\"" # or '"'
single_quote <- '\'' # or "'"

That means if you want to include a literal backslash, you’ll need to double it up: "\\".

Beware that the printed representation of a string is not the same as string itself, because the printed representation shows the escapes. To see the raw contents of the string, use writeLines():

x <- c("\"", "\\")
x
#> [1] "\"" "\\"
writeLines(x)
#> "
#> \

There are a handful of other special characters. The most common are "\n", newline, and "\t", tab, but you can see the complete list by requesting help on ": ?'"', or ?"'". You’ll also sometimes see strings like "\u00b5", this is a way of writing non-English characters that works on all platforms:

x <- "\u00b5"
x
#> [1] "µ"

Multiple strings are often stored in a character vector, which you can create with c():

c("one", "two", "three")
#> [1] "one"   "two"   "three"

15.1.3 String length

Base R contains many functions to work with strings but we’ll avoid them because they can be inconsistent, which makes them hard to remember. Instead we’ll use functions from stringr. These have more intuitive names, and all start with str_. For example, str_length() tells you the number of characters in a string:

str_length(c("a", "R for data science", NA))
#> [1]  1 18 NA

The common str_ prefix is particularly useful if you use RStudio, because typing str_ will trigger autocomplete, allowing you to see all stringr functions:

15.1.4 Combining strings

To combine two or more strings, use str_c():

str_c("x", "y")
#> [1] "xy"
str_c("x", "y", "z")
#> [1] "xyz"

Use the sep argument to control how they’re separated:

str_c("x", "y", sep = ", ")
#> [1] "x, y"

str_c() is vectorised, and it automatically recycles shorter vectors to the same length as the longest:

x <- c("a", "b", "c")
str_c("prefix-", x)
#> [1] "prefix-a" "prefix-b" "prefix-c"

To collapse a vector of strings into a single string, use collapse:

x <- c("x", "y", "z")
str_c(x, collapse = ", ")
#> [1] "x, y, z"

15.1.5 Subsetting strings

You can extract parts of a string using str_sub(). As well as the string, str_sub() takes start and end arguments, which give the (inclusive) position of the substring:

x <- c("Rochelle is the Greatest")
str_sub(x, 1, 8)
#> [1] "Rochelle"

# negative numbers count backwards from end
str_sub(x, -8, -1)
#> [1] "Greatest"

Note that str_sub() won’t fail if the string is too short: it will just return as much as possible:

str_sub("a", 1, 3)
#> [1] "a"

You can also use the assignment form of str_sub() to modify strings:

x <- c("Rochelle is the Greatest")
str_sub(x, 1, 1) <- str_to_lower(str_sub(x, 1, 1))
x
#> [1] "rochelle is the Greatest"

15.1.6 Locales

Above I used str_to_lower() to change the text to lower case. You can also use str_to_upper() or str_to_title(). However, changing case is more complicated than it might at first appear because different languages have different rules for changing case. You can pick which set of rules to use by specifying a locale:

# Turkish has two i's: with and without a dot, and it
# has a different rule for capitalising them:
str_to_upper(c("i", "ı"))
#> [1] "I" "I"
str_to_upper(c("i", "ı"), locale = "tr")
#> [1] "İ" "I"

The locale is specified as a ISO 639 language code, which is a two or three letter abbreviation. If you don’t already know the code for your language, Wikipedia has a good list. If you leave the locale blank, it will use the current locale, as provided by your operating system.

Another important operation that’s affected by the locale is sorting. The base R order() and sort() functions sort strings using the current locale. If you want robust behaviour across different computers, you may want to use str_sort() and str_order() which take an additional locale argument:

x <- c("apple", "eggplant", "banana")
str_sort(x, locale = "en")  # English
#> [1] "apple"    "banana"   "eggplant"
str_sort(x, locale = "haw") # Hawaiian
#> [1] "apple"    "eggplant" "banana"

Challenges

  1. In code that doesn’t use stringr, you’ll often see paste() and paste0(). What’s the difference between the two functions? What stringr function are they equivalent to? How do the functions differ in their handling of NA?

  2. In your own words, describe the difference between the sep and collapse arguments to str_c().

  3. Use str_length() and str_sub() to extract the middle character from a string. What will you do if the string has an even number of characters?

  4. What does str_trim() do? What’s the opposite of str_trim()?

15.2 Regular expressions

Regular expressions are a very terse language that allow you to describe patterns in strings. They take a little while to get your head around, but once you understand them, you’ll find them extremely useful.

To learn regular expressions, we’ll use str_view() and str_view_all(). These functions take a character vector and a regular expression, and show you how they match. We’ll start with very simple regular expressions and then gradually get more and more complicated. Once you’ve mastered pattern matching, you’ll learn how to apply those ideas with various stringr functions.

15.2.1 Basic matches

The simplest patterns match exact strings:

x <- c("apple", "banana", "pear")
str_view(x, "an")

The next step up in complexity is ., which matches any character (except a newline):

x <- c("apple", "banana", "pear")
str_view(x, ".a.")

15.2.2 Escape Characters

If “.” matches any character, how do you match the character “.”? You need to use an “escape” to tell the regular expression you want to match it exactly, not use its special behaviour.

Regexps use the backslash, \, to escape special behaviour. So to match an ., you need the regexp \.. Unfortunately this creates a problem. We use strings to represent regular expressions, and \ is also used as an escape symbol in strings. So to create the regular expression \. we need the string "\\.".

# To create the regular expression, we need \\
dot <- "\\."
# But the expression itself only contains one:
writeLines(dot)
#> \.

In this lesson, I’ll write regular expression as \. and strings that represent the regular expression as "\\.".

15.2.3 Anchors

By default, regular expressions will match any part of a string. It’s often useful to anchor the regular expression so that it matches from the start or end of the string. You can use:

  • ^ to match the start of the string.
  • $ to match the end of the string.
x <- c("apple", "banana", "pear")
str_view(x, "^a")
str_view(x, "a$")

To remember which is which, try this mnemonic which I learned from Evan Misshula: if you begin with power (^), you end up with money ($).

To force a regular expression to only match a complete string, anchor it with both ^ and $:

x <- c("apple pie", "apple", "apple cake")
str_view(x, "apple")
str_view(x, "^apple$")

15.2.4 Character classes and alternatives

There are a number of special patterns that match more than one character. You’ve already seen ., which matches any character apart from a newline. There are four other useful tools:

  • \d: matches any digit.
  • \s: matches any whitespace (e.g. space, tab, newline).
  • [abc]: matches a, b, or c.
  • [^abc]: matches anything except a, b, or c.

Remember, to create a regular expression containing \d or \s, you’ll need to escape the \ for the string, so you’ll type "\\d" or "\\s".

A character class containing a single character is a nice alternative to backslash escapes when you want to include a single metacharacter in a regex. Many people find this more readable.

# Look for a literal character that normally has special meaning in a regex
x <- c("abc", "a.c", "a*c", "a c")
str_view(x, "a[.]c")
str_view(x, ".[*]c")
str_view(x, "a[ ]")

This works for most (but not all) regex metacharacters: $ . | ? * + ( ) [ {. Unfortunately, a few characters have special meaning even inside a character class and must be handled with backslash escapes: ] \ ^ and -.

You can use alternation to pick between one or more alternative patterns. For example, abc|deaf will match either ‘“abc”’, or "deaf".

Like with mathematical expressions, if precedence ever gets confusing, use parentheses to make it clear what you want:

x <- c("grey", "gray")
str_view(x, "gr(e|a)y")

Challenges

Create regular expressions to find all words that:

  1. Start with a vowel.

  2. That only contain consonants. (Hint: thinking about matching “not”-vowels.)

  3. End with ed, but not with eed.

  4. End with ing or ise.

15.2.5 Repetition

The next step up in power involves controlling how many times a pattern matches:

  • ?: 0 or 1
  • +: 1 or more
  • *: 0 or more
x <- "1888 is the longest year in Roman numerals: MDCCCLXXXVIII"
str_view(x, "CC?")
str_view(x, "CC+")
str_view(x, 'C[LX]+')

15.2.6 Regex Resources

For more information on regular expressions, see:

  1. this tutorial
  2. this cheatsheet

15.3 Common Tools

Now that you’ve learned the basics of regular expressions, it’s time to learn how to apply them to real problems. In this section you’ll learn a wide array of stringr functions that let you:

  • Determine which strings match a pattern.
  • Find the positions of matches.
  • Extract the content of matches.
  • Replace matches with new values.
  • Split a string based on a match.

15.3.1 Detect matches

To determine if a character vector matches a pattern, use str_detect(). It returns a logical vector the same length as the input:

x <- c("apple", "banana", "pear")
str_detect(x, "e")
#> [1]  TRUE FALSE  TRUE

Remember that when you use a logical vector in a numeric context, FALSE becomes 0 and TRUE becomes 1. That makes sum() and mean() useful if you want to answer questions about matches across a larger vector:

# see common words
words[1:10]
#>  [1] "a"        "able"     "about"    "absolute" "accept"   "account" 
#>  [7] "achieve"  "across"   "act"      "active"
# How many common words start with t?
sum(str_detect(words, "^t"))
#> [1] 65
# What proportion of common words end with a vowel?
mean(str_detect(words, "[aeiou]$"))
#> [1] 0.277

A common use of str_detect() is to select the elements that match a pattern. You can do this with logical subsetting, or the convenient str_subset() wrapper:

words[str_detect(words, "x$")]
#> [1] "box" "sex" "six" "tax"
str_subset(words, "x$")
#> [1] "box" "sex" "six" "tax"

Typically, however, your strings will be one column of a data frame, and you’ll want to use filter instead:

df <- data.frame(
  i = seq_along(words),
  word = words
)
df %>% 
  filter(str_detect(word, "x$"))
#>     i word
#> 1 108  box
#> 2 747  sex
#> 3 772  six
#> 4 841  tax

A variation on str_detect() is str_count(): rather than a simple yes or no, it tells you how many matches there are in a string:

x <- c("apple", "banana", "pear")
str_count(x, "a")
#> [1] 1 3 1
# On average, how many vowels per word?
mean(str_count(words, "[aeiou]"))
#> [1] 1.99

It’s natural to use str_count() with mutate():

df1 <- df %>% 
  mutate(
    vowels = str_count(word, "[aeiou]"),
    consonants = str_count(word, "[^aeiou]")
  )

head(df1)
#>   i     word vowels consonants
#> 1 1        a      1          0
#> 2 2     able      2          2
#> 3 3    about      3          2
#> 4 4 absolute      4          4
#> 5 5   accept      2          4
#> 6 6  account      3          4

Challenges

For each of the following challenges, try solving it by using both a single regular expression, and a combination of multiple str_detect() calls.

1.  Find all words that start or end with `x`.

2.  Find all words that start with a vowel and end with a consonant.

15.3.2 Extract matches

To extract the actual text of a match, use str_extract(). To show that off, we’re going to need a more complicated example. I’m going to use the Harvard sentences. These are provided in stringr::sentences:

length(sentences)
#> [1] 720
head(sentences)
#> [1] "The birch canoe slid on the smooth planks." 
#> [2] "Glue the sheet to the dark blue background."
#> [3] "It's easy to tell the depth of a well."     
#> [4] "These days a chicken leg is a rare dish."   
#> [5] "Rice is often served in round bowls."       
#> [6] "The juice of lemons makes fine punch."

Imagine we want to find all sentences that contain a color. We first create a vector of color names, and then turn it into a single regular expression:

colors <- c("red", "orange", "yellow", "green", "blue", "purple")
color_match <- str_c(colors, collapse = "|")
color_match
#> [1] "red|orange|yellow|green|blue|purple"

Now we can select the sentences that contain a color, and then extract the color to figure out which one it is:

# find sentences with colors
has_color <- str_subset(sentences, color_match)
head(has_color)
#> [1] "Glue the sheet to the dark blue background."
#> [2] "Two blue fish swam in the tank."            
#> [3] "The colt reared and threw the tall rider."  
#> [4] "The wide road shimmered in the hot sun."    
#> [5] "See the cat glaring at the scared mouse."   
#> [6] "A wisp of cloud hung in the blue air."

# extract the color
matches <- str_extract(has_color, color_match)
head(matches)
#> [1] "blue" "blue" "red"  "red"  "red"  "blue"

Note that str_extract() only extracts the first match. This is a common pattern for stringr functions, because working with a single match allows you to use much simpler data structures. To get all matches, use str_extract_all(). It returns a list:

all_colors <- str_extract_all(has_color, color_match)
all_colors[15:20]
#> [[1]]
#> [1] "red"
#> 
#> [[2]]
#> [1] "red"
#> 
#> [[3]]
#> [1] "red"
#> 
#> [[4]]
#> [1] "blue"
#> 
#> [[5]]
#> [1] "red"
#> 
#> [[6]]
#> [1] "blue" "red"

If you use simplify = TRUE, str_extract_all() will return a matrix with short matches expanded to the same length as the longest:

str_extract_all(has_color, color_match, simplify = TRUE)
#>       [,1]     [,2] 
#>  [1,] "blue"   ""   
#>  [2,] "blue"   ""   
#>  [3,] "red"    ""   
#>  [4,] "red"    ""   
#>  [5,] "red"    ""   
#>  [6,] "blue"   ""   
#>  [7,] "yellow" ""   
#>  [8,] "red"    ""   
#>  [9,] "red"    ""   
#> [10,] "green"  ""   
#> [11,] "red"    ""   
#> [12,] "red"    ""   
#> [13,] "blue"   ""   
#> [14,] "red"    ""   
#> [15,] "red"    ""   
#> [16,] "red"    ""   
#> [17,] "red"    ""   
#> [18,] "blue"   ""   
#> [19,] "red"    ""   
#> [20,] "blue"   "red"
#> [21,] "red"    ""   
#> [22,] "green"  ""   
#> [23,] "red"    ""   
#> [24,] "red"    ""   
#> [25,] "red"    ""   
#> [26,] "red"    ""   
#> [27,] "red"    ""   
#> [28,] "red"    ""   
#> [29,] "green"  ""   
#> [30,] "red"    ""   
#> [31,] "green"  ""   
#> [32,] "red"    ""   
#> [33,] "purple" ""   
#> [34,] "green"  ""   
#> [35,] "red"    ""   
#> [36,] "red"    ""   
#> [37,] "red"    ""   
#> [38,] "red"    ""   
#> [39,] "red"    ""   
#> [40,] "blue"   ""   
#> [41,] "red"    ""   
#> [42,] "blue"   ""   
#> [43,] "red"    ""   
#> [44,] "red"    ""   
#> [45,] "red"    ""   
#> [46,] "red"    ""   
#> [47,] "green"  ""   
#> [48,] "green"  ""   
#> [49,] "green"  "red"
#> [50,] "red"    ""   
#> [51,] "red"    ""   
#> [52,] "yellow" ""   
#> [53,] "red"    ""   
#> [54,] "orange" "red"
#> [55,] "red"    ""   
#> [56,] "red"    ""   
#> [57,] "red"    ""

Challenges

In the previous example, you might have noticed that the regular expression matched “flickered”, which is not a color. Modify the regex to fix the problem.

15.3.3 Replacing matches

str_replace() and str_replace_all() allow you to replace matches with new strings. The simplest use is to replace a pattern with a fixed string:

x <- c("apple", "pear", "banana")
str_replace(x, "[aeiou]", "-") # replace the first instance of a match
#> [1] "-pple"  "p-ar"   "b-nana"
str_replace_all(x, "[aeiou]", "-") # replace all instances of a match
#> [1] "-ppl-"  "p--r"   "b-n-n-"

With str_replace_all() you can perform multiple replacements by supplying a named vector:

x <- c("1 house", "2 cars", "3 people")
str_replace_all(x, c("1" = "one", "2" = "two", "3" = "three"))
#> [1] "one house"    "two cars"     "three people"

15.3.4 Splitting

Use str_split() to split a string up into pieces. For example, we could split sentences into words:

sentences %>%
  head(5) %>% 
  str_split(" ")
#> [[1]]
#> [1] "The"     "birch"   "canoe"   "slid"    "on"      "the"     "smooth" 
#> [8] "planks."
#> 
#> [[2]]
#> [1] "Glue"        "the"         "sheet"       "to"          "the"        
#> [6] "dark"        "blue"        "background."
#> 
#> [[3]]
#> [1] "It's"  "easy"  "to"    "tell"  "the"   "depth" "of"    "a"     "well."
#> 
#> [[4]]
#> [1] "These"   "days"    "a"       "chicken" "leg"     "is"      "a"      
#> [8] "rare"    "dish."  
#> 
#> [[5]]
#> [1] "Rice"   "is"     "often"  "served" "in"     "round"  "bowls."

Like the other stringr functions that return a list, you can use simplify = TRUE to return a matrix:

sentences %>%
  head(5) %>% 
  str_split(" ", simplify = TRUE)
#>      [,1]    [,2]    [,3]    [,4]      [,5]  [,6]    [,7]    
#> [1,] "The"   "birch" "canoe" "slid"    "on"  "the"   "smooth"
#> [2,] "Glue"  "the"   "sheet" "to"      "the" "dark"  "blue"  
#> [3,] "It's"  "easy"  "to"    "tell"    "the" "depth" "of"    
#> [4,] "These" "days"  "a"     "chicken" "leg" "is"    "a"     
#> [5,] "Rice"  "is"    "often" "served"  "in"  "round" "bowls."
#>      [,8]          [,9]   
#> [1,] "planks."     ""     
#> [2,] "background." ""     
#> [3,] "a"           "well."
#> [4,] "rare"        "dish."
#> [5,] ""            ""

You can also request a maximum number of pieces:

fields <- c("Name: Hadley", "Country: NZ", "Age: 35")
fields %>% str_split(": ", n = 2, simplify = TRUE)
#>      [,1]      [,2]    
#> [1,] "Name"    "Hadley"
#> [2,] "Country" "NZ"    
#> [3,] "Age"     "35"

Instead of splitting up strings by patterns, you can also split up by character, line, sentence and word boundary()s:

x <- "This is a sentence.  This is another sentence."
str_view_all(x, boundary("word"))
str_split(x, boundary("word"))[[1]]
#> [1] "This"     "is"       "a"        "sentence" "This"     "is"      
#> [7] "another"  "sentence"

Challenges

  1. Split up a string like "apples, pears, and bananas" into individual components.

  2. What does splitting with an empty string ("") do? Experiment, and then read the documentation.

15.4 Other types of patterns

When you use a pattern that’s a string, it’s automatically wrapped into a call to regex():

# The regular call:
str_view(fruit, "nana")
# Is shorthand for
str_view(fruit, regex("nana"))

You can use the other arguments of regex() to control details of the match:

  • ignore_case = TRUE allows characters to match either their uppercase or lowercase forms. This always uses the current locale.

    bananas <- c("banana", "Banana", "BANANA")
    str_view(bananas, "banana")
    str_view(bananas, regex("banana", ignore_case = TRUE))
  • multiline = TRUE allows ^ and $ to match the start and end of each line rather than the start and end of the complete string.

    x <- "Line 1\nLine 2\nLine 3"
    str_extract_all(x, "^Line")[[1]]
    #> [1] "Line"
    str_extract_all(x, regex("^Line", multiline = TRUE))[[1]]
    #> [1] "Line" "Line" "Line"

15.4.1 stringi

stringr is built on top of the stringi package. stringr is useful when you’re learning because it exposes a minimal set of functions, which have been carefully picked to handle the most common string manipulation functions. stringi, on the other hand, is designed to be comprehensive. It contains almost every function you might ever need.

If you find yourself struggling to do something in stringr, it’s worth taking a look at stringi. The packages work very similarly, so you should be able to translate your stringr knowledge in a natural way. The main difference is the prefix: str_ vs. stri_.

Challenges

Find the stringi functions that:

  1. Count the number of words.

  2. Find duplicated strings.

  3. Generate random text.

Acknowledgments

This page was adapted from the following sources:

  1. R for Data Science licensed under Creative Commons Attribution-NonCommercial-NoDerivs 3.0