Chapter 15 Strings and Regular Expressions

This unit focuses on character (or “string”) data. We will explore

  1. String basics, like concatinating and subsettings.
  2. Regular expressions, a powerful cross-language tool for working with string data.
  3. Applying regex real problems using stringr.

This chapter will focus on the stringr package for string manipulation. stringr is not part of the core tidyverse because you do not always have textual data, so we need to load it explicitly.

library(tidyverse)
library(stringr)

15.1 String Basics

This unit focuses on character (or “string”) data. We will focus on string basics, such as concatinating and subsettings.

15.1.1 Creating Strings

You can create strings with either single quotes or double quotes. Unlike other languages, there is no difference in behavior. I recommend always using ", unless you want to create a string that contains multiple ".

string1 <- "This is a string"
string2 <- 'If I want to include a "quote" inside a string, I use single quotes'

15.1.2 Escape and Special Characters

Single and double quotes are known as “metacharacters,” meaning that they have special meaning to the R language. To include a literal single or double quote in a string you can use \ to “escape” it:

double_quote <- "\"" # or '"'
single_quote <- '\'' # or "'"

That means if you want to include a literal backslash, you will need to double it up: "\\".

Beware that the printed representation of a string is not the same as the string itself, because the printed representation shows the escapes. To see the raw contents of the string, use writeLines():

x <- c("\"", "\\")
x
#> [1] "\"" "\\"
writeLines(x)
#> "
#> \

There are a handful of other special characters. The most common are "\n", newline, and "\t", tab, but you can see the complete list by requesting help on ": ?'"', or ?"'".

Sometimes you will also see strings like "\u00b5". This is a way of writing non-English characters that works on all platforms:

x <- "\u00b5"
x
#> [1] "µ"

Multiple strings are often stored in a character vector, which you can create with c():

c("one", "two", "three")
#> [1] "one"   "two"   "three"

15.1.3 Measure string length with str_length()

Base R contains many functions to work with strings, but we will avoid them because they can be inconsistent, which makes them hard to remember.

Instead, we will use functions from stringr. stringr contains functions with more intuitive names, and all start with str_. For example, str_length() tells you the number of characters in a string:

str_length(c("a", "R for data science", NA))
#> [1]  1 18 NA

The common str_ prefix is particularly useful if you use RStudio, because typing str_ will trigger autocomplete, allowing you to see all stringr functions:

15.1.4 Combine strings with str_c()

To combine two or more strings, use str_c():

str_c("x", "y")
#> [1] "xy"
str_c("x", "y", "z")
#> [1] "xyz"

Use the sep argument to control how they are separated:

str_c("x", "y", sep = ", ")
#> [1] "x, y"

str_c() is vectorised, and it automatically recycles shorter vectors to the same length as the longest:

x <- c("a", "b", "c")
str_c("prefix-", x)
#> [1] "prefix-a" "prefix-b" "prefix-c"

To collapse a vector of strings into a single string, use collapse:

x <- c("x", "y", "z")
str_c(x, sep = ", ") # This will not work
#> [1] "x" "y" "z"
str_c(x, collapse = ", ") # But this will
#> [1] "x, y, z"

15.1.5 Subset strings with str_sub()

You can extract parts of a string using str_sub(). As well as the string, str_sub() takes start and end arguments, which give the (inclusive) position of the substring:

x <- c("Rochelle is the GOAT")
str_sub(x, 1, 8)
#> [1] "Rochelle"

# Negative numbers count backwards from the end
str_sub(x, -8, -1)
#> [1] "the GOAT"

You can also use the assignment form of str_sub() to modify strings:

x <- c("Rochelle is the GOAT")
str_sub(x, 1, 1) <- str_to_lower(str_sub(x, 1, 1))
x
#> [1] "rochelle is the GOAT"

15.1.6 Locales

Above I used str_to_lower() to change the text to lower case. You can also use str_to_upper() or str_to_title(). However, changing case is more complicated than it might at first appear, because different languages have different rules for changing case. You can pick which set of rules to use by specifying a locale:

# Turkish has two i's (with and without a dot), and it
# has a different rule for capitalising each of them:
str_to_upper(c("i", "ı"))
#> [1] "I" "I"
str_to_upper(c("i", "ı"), locale = "tr")
#> [1] "İ" "I"

The locale is specified as a ISO 639 language code, which is a two- or three-letter abbreviation. If you do not already know the code for your language, Wikipedia has a good list. If you leave the locale blank, it will use the current locale, as provided by your operating system.

Another important operation that is affected by the locale is sorting. The base R order() and sort() functions sort strings using the current locale. If you want robust behavior across different computers, you may want to use str_sort() and str_order(), which take an additional locale argument:

x <- c("apple", "eggplant", "banana")
str_sort(x, locale = "en")  # English
#> [1] "apple"    "banana"   "eggplant"
str_sort(x, locale = "haw") # Hawaiian
#> [1] "apple"    "eggplant" "banana"

15.1.7 Challenges

Challenge 1.

In your own words, describe the difference between the sep and collapse arguments to str_c().

Challenge 2.

Use str_length() and str_sub() to extract the middle character from a string. What will you do if the string has an even number of characters?

Challenge 3.

What does str_trim() do? What is the opposite of str_trim()?

library(tidyverse)
library(stringr)

15.2 Regular Expressions

Regular expressions are a very terse language that allows you to describe patterns in strings. They take a little while to get your head around, but once you understand them, you will find them extremely useful.

To learn regular expressions, we will use str_view() and str_view_all(). These functions take a character vector and a regular expression and show you how they match. We will start with very simple regular expressions and then gradually get more and more complicated. Once you have mastered pattern matching, you will learn how to apply those ideas with various stringr functions.

15.2.1 Basic Matches

The simplest patterns match exact strings:

x <- c("apple", "banana", "pear")
str_view(x, "an")

The next step up in complexity is ., which matches any character (except a newline):

x <- c("apple", "banana", "pear")
str_view(x, ".a.")

15.2.2 Escape Characters

If “.” matches any character, how do you match the character “.”? You need to use an “escape” to tell the regular expression you want to match it exactly, not use its special behavior.

Regexps use the backslash, \, to escape special behavior. So, to match an ., you need the regexp \.. Unfortunately, this creates a problem. We use strings to represent regular expressions, and \ is also used as an escape symbol in strings. So, to create the regular expression \., we need the string "\\.".

# To create the regular expression, we need \\
dot <- "\\."
# But the expression itself only contains one:
writeLines(dot)
#> \.

In this lesson, I will write the regular expression as \. and strings that represent the regular expression as "\\.".

15.2.3 Anchors

By default, regular expressions will match any part of a string. It is often useful to anchor the regular expression so that it matches from the start or end of the string. You can use:

  • ^ to match the start of the string.
  • $ to match the end of the string.
x <- c("apple", "banana", "pear")
str_view(x, "^a")
str_view(x, "a$")

To remember which is which, try this mnemonic which I learned from Evan Misshula: If you begin with power (^), you end up with money ($).

To force a regular expression to only match a complete string, anchor it with both ^ and $:

x <- c("apple pie", "apple", "apple cake")
str_view(x, "apple")
str_view(x, "^apple$")

15.2.4 Character Classes and Alternatives

There are a number of special patterns that match more than one character. You have already seen ., which matches any character apart from a newline. There are four other useful tools:

  • \d: Matches any digit.
  • \s: Matches any whitespace (e.g., space, tab, newline).
  • [abc]: Matches a, b, or c.
  • [^abc]: Matches anything except a, b, or c.

Remember that, to create a regular expression containing \d or \s, you will need to escape the \ for the string, so you will type "\\d" or "\\s".

A character class containing a single character is a nice alternative to backslash escapes when you want to include a single metacharacter in a regex. Many people find this more readable.

# Look for a literal character that normally has special meaning in a regex:
x <- c("abc", "a.c", "a*c", "a c")
str_view(x, "a[.]c")
str_view(x, ".[*]c")
str_view(x, "a[ ]")

This works for most (but not all) regex metacharacters: $, ., |, ?, *, +, (, ), [, and {. Unfortunately, a few characters have special meaning even inside a character class and must be handled with backslash escapes: ], \, ^, and -.

You can use alternation to pick between one or more alternative patterns. For example, abc|deaf will match either "abc" or "deaf".

Like with mathematical expressions, if precedence ever gets confusing, use parentheses to make it clear what you want:

x <- c("grey", "gray")
str_view(x, "gr(e|a)y")

15.2.5 Repetition

The next step up in power involves controlling how many times a pattern matches:

  • ?: 0 or 1
  • +: 1 or more
  • *: 0 or more
x <- "1888 is the longest year in Roman numerals: MDCCCLXXXVIII"
str_view(x, "CC?")
str_view(x, "CC+")
str_view(x, 'C[LX]+')

15.2.6 Regex Resources

For more information on regular expressions, see:

  1. This tutorial.
  2. This cheatsheet.

15.2.7 Challenges

Create regular expressions to find all words that

  1. Start with a vowel.

  2. Only contain consonants. (Hint: Think about matching “not”-vowels.)

  3. End with ed, but not with eed.

  4. End with ing or ise.

library(tidyverse)
library(stringr)

15.3 Applying regex

Now that you have learned the basics of regular expressions, it is time to learn how to apply them to real problems. In this section, you will learn a wide array of stringr functions that let you

  • Detect matches in a string with str_detect().
  • Count the number of matches with str_count()
  • Extract matches with str_extract() and str_extract_all()
  • Replace matches with str_replace() and str_replace_all()
  • Split a string based on a match with str_split()

15.3.1 Detect matches with str_detect()

To determine if a character vector matches a pattern, use str_detect(). It returns a logical vector the same length as the input:

x <- c("apple", "banana", "pear")
str_detect(x, "e")
#> [1]  TRUE FALSE  TRUE

Remember that, when you use a logical vector in a numeric context, FALSE becomes 0 and TRUE becomes 1. That makes sum() and mean() useful if you want to answer questions about matches across a larger vector:

words<- stringr::words
# See common words
words[1:10]
#>  [1] "a"        "able"     "about"    "absolute" "accept"   "account" 
#>  [7] "achieve"  "across"   "act"      "active"
# How many common words start with t?
sum(str_detect(words, "^t"))
#> [1] 65
# What proportion of common words end with a vowel?
mean(str_detect(words, "[aeiou]$"))
#> [1] 0.277

A common use of str_detect() is to select the elements that match a pattern. You can do this with logical subsetting, or the convenient str_subset() wrapper:

words[str_detect(words, "x$")]
#> [1] "box" "sex" "six" "tax"
str_subset(words, "x$")
#> [1] "box" "sex" "six" "tax"

Typically, however, your strings will be one column of a data frame, and you will want to use filter instead:

df <- data.frame(
  i = seq_along(words),
  word = words
)
df %>% 
  filter(str_detect(word, "x$"))
#>     i word
#> 1 108  box
#> 2 747  sex
#> 3 772  six
#> 4 841  tax

15.3.2 Count the number of matches with str_count()

A variation on str_detect() is str_count(). Rather than a simple yes or no, it tells you how many matches there are in a string:

x <- c("apple", "banana", "pear")
str_count(x, "a")
#> [1] 1 3 1
# On average, how many vowels per word?
mean(str_count(words, "[aeiou]"))
#> [1] 1.99

It is natural to use str_count() with mutate():

df1 <- df %>% 
  mutate(
    vowels = str_count(word, "[aeiou]"),
    consonants = str_count(word, "[^aeiou]")
  )

head(df1)
#>   i     word vowels consonants
#> 1 1        a      1          0
#> 2 2     able      2          2
#> 3 3    about      3          2
#> 4 4 absolute      4          4
#> 5 5   accept      2          4
#> 6 6  account      3          4

Challenge 1.

For each of the following challenges, try solving it by using both a single regular expression and a combination of multiple str_detect() calls.

1.  Find all words that start or end with `x`.

2.  Find all words that start with a vowel and end with a consonant.

15.3.3 Extract matches with str_extract() and str_extract_all()

To extract the actual text of a match, use str_extract(). To show that off, we are going to need a more complicated example. I am going to use the Harvard sentences. These are provided in stringr::sentences:

length(sentences)
#> [1] 720
head(sentences)
#> [1] "The birch canoe slid on the smooth planks." 
#> [2] "Glue the sheet to the dark blue background."
#> [3] "It's easy to tell the depth of a well."     
#> [4] "These days a chicken leg is a rare dish."   
#> [5] "Rice is often served in round bowls."       
#> [6] "The juice of lemons makes fine punch."

Imagine we want to find all sentences that contain a color. We first create a vector of color names, and then turn it into a single regular expression:

colors <- c("red", "orange", "yellow", "green", "blue", "purple")
color_match <- str_c(colors, collapse = "|")
color_match
#> [1] "red|orange|yellow|green|blue|purple"

Now we can select the sentences that contain a color, and then extract the color to figure out which one it is:

# Find sentences with colors
has_color <- str_subset(sentences, color_match)
head(has_color)
#> [1] "Glue the sheet to the dark blue background."
#> [2] "Two blue fish swam in the tank."            
#> [3] "The colt reared and threw the tall rider."  
#> [4] "The wide road shimmered in the hot sun."    
#> [5] "See the cat glaring at the scared mouse."   
#> [6] "A wisp of cloud hung in the blue air."

# Extract the color
matches <- str_extract(has_color, color_match)
head(matches)
#> [1] "blue" "blue" "red"  "red"  "red"  "blue"

Note that str_extract() only extracts the first match. This is a common pattern for stringr functions, because working with a single match allows you to use much simpler data structures. To get all matches, use str_extract_all(). It returns a list:

all_colors <- str_extract_all(has_color, color_match)
all_colors[15:20]
#> [[1]]
#> [1] "red"
#> 
#> [[2]]
#> [1] "red"
#> 
#> [[3]]
#> [1] "red"
#> 
#> [[4]]
#> [1] "blue"
#> 
#> [[5]]
#> [1] "red"
#> 
#> [[6]]
#> [1] "blue" "red"

If you use simplify = TRUE, str_extract_all() will return a matrix with short matches expanded to the same length as the longest:

str_extract_all(has_color, color_match, simplify = TRUE)
#>       [,1]     [,2] 
#>  [1,] "blue"   ""   
#>  [2,] "blue"   ""   
#>  [3,] "red"    ""   
#>  [4,] "red"    ""   
#>  [5,] "red"    ""   
#>  [6,] "blue"   ""   
#>  [7,] "yellow" ""   
#>  [8,] "red"    ""   
#>  [9,] "red"    ""   
#> [10,] "green"  ""   
#> [11,] "red"    ""   
#> [12,] "red"    ""   
#> [13,] "blue"   ""   
#> [14,] "red"    ""   
#> [15,] "red"    ""   
#> [16,] "red"    ""   
#> [17,] "red"    ""   
#> [18,] "blue"   ""   
#> [19,] "red"    ""   
#> [20,] "blue"   "red"
#> [21,] "red"    ""   
#> [22,] "green"  ""   
#> [23,] "red"    ""   
#> [24,] "red"    ""   
#> [25,] "red"    ""   
#> [26,] "red"    ""   
#> [27,] "red"    ""   
#> [28,] "red"    ""   
#> [29,] "green"  ""   
#> [30,] "red"    ""   
#> [31,] "green"  ""   
#> [32,] "red"    ""   
#> [33,] "purple" ""   
#> [34,] "green"  ""   
#> [35,] "red"    ""   
#> [36,] "red"    ""   
#> [37,] "red"    ""   
#> [38,] "red"    ""   
#> [39,] "red"    ""   
#> [40,] "blue"   ""   
#> [41,] "red"    ""   
#> [42,] "blue"   ""   
#> [43,] "red"    ""   
#> [44,] "red"    ""   
#> [45,] "red"    ""   
#> [46,] "red"    ""   
#> [47,] "green"  ""   
#> [48,] "green"  ""   
#> [49,] "green"  "red"
#> [50,] "red"    ""   
#> [51,] "red"    ""   
#> [52,] "yellow" ""   
#> [53,] "red"    ""   
#> [54,] "orange" "red"
#> [55,] "red"    ""   
#> [56,] "red"    ""   
#> [57,] "red"    ""

Challenge 2.

In the previous example, you might have noticed that the regular expression matched “flickered”, which is not a color. Modify the regex to fix the problem.

15.3.4 Replace matches with str_replace() and str_replace_all()

str_replace() and str_replace_all() allow you to replace matches with new strings. The simplest use is to replace a pattern with a fixed string:

x <- c("apple", "pear", "banana")
str_replace(x, "[aeiou]", "-") # replace the first instance of a match
#> [1] "-pple"  "p-ar"   "b-nana"
str_replace_all(x, "[aeiou]", "-") # replace all instances of a match
#> [1] "-ppl-"  "p--r"   "b-n-n-"

With str_replace_all(), you can perform multiple replacements by supplying a named vector:

x <- c("1 house", "2 cars", "3 people")
str_replace_all(x, c("1" = "one", "2" = "two", "3" = "three"))
#> [1] "one house"    "two cars"     "three people"

15.3.5 Split on a match with str_split()

Use str_split() to split a string up into pieces. For example, we could split sentences into words:

sentences %>%
  head(5) %>% 
  str_split(" ")
#> [[1]]
#> [1] "The"     "birch"   "canoe"   "slid"    "on"      "the"     "smooth" 
#> [8] "planks."
#> 
#> [[2]]
#> [1] "Glue"        "the"         "sheet"       "to"          "the"        
#> [6] "dark"        "blue"        "background."
#> 
#> [[3]]
#> [1] "It's"  "easy"  "to"    "tell"  "the"   "depth" "of"    "a"     "well."
#> 
#> [[4]]
#> [1] "These"   "days"    "a"       "chicken" "leg"     "is"      "a"      
#> [8] "rare"    "dish."  
#> 
#> [[5]]
#> [1] "Rice"   "is"     "often"  "served" "in"     "round"  "bowls."

Like the other stringr functions that return a list, you can use simplify = TRUE to return a matrix:

sentences %>%
  head(5) %>% 
  str_split(" ", simplify = TRUE)
#>      [,1]    [,2]    [,3]    [,4]      [,5]  [,6]    [,7]     [,8]         
#> [1,] "The"   "birch" "canoe" "slid"    "on"  "the"   "smooth" "planks."    
#> [2,] "Glue"  "the"   "sheet" "to"      "the" "dark"  "blue"   "background."
#> [3,] "It's"  "easy"  "to"    "tell"    "the" "depth" "of"     "a"          
#> [4,] "These" "days"  "a"     "chicken" "leg" "is"    "a"      "rare"       
#> [5,] "Rice"  "is"    "often" "served"  "in"  "round" "bowls." ""           
#>      [,9]   
#> [1,] ""     
#> [2,] ""     
#> [3,] "well."
#> [4,] "dish."
#> [5,] ""

You can also request a maximum number of pieces:

fields <- c("Name: Rochelle", "State: IL", "Age: 34")
fields %>% str_split(": ", n = 2, simplify = TRUE)
#>      [,1]    [,2]      
#> [1,] "Name"  "Rochelle"
#> [2,] "State" "IL"      
#> [3,] "Age"   "34"

Instead of splitting up strings by patterns, you can also split them up by character, line, sentence, or word boundary()s:

x <- "This is a sentence.  This is another sentence."
str_view_all(x, boundary("word"))
str_split(x, boundary("word"))[[1]]
#> [1] "This"     "is"       "a"        "sentence" "This"     "is"       "another" 
#> [8] "sentence"

Challenge 3.

  1. Split up a string like "apples, pears, and bananas" into individual components.

  2. What does splitting with an empty string ("") do? Experiment, and then read the documentation.

15.4 Other Types of Patterns

When you use a pattern that is a string, it is automatically wrapped into a call to regex():

# The regular call
str_view(fruit, "nana")
# Is shorthand for
str_view(fruit, regex("nana"))

You can use the other arguments of regex() to control details of the match:

  • ignore_case = TRUE allows characters to match either their uppercase or lowercase forms. This always uses the current locale.

    bananas <- c("banana", "Banana", "BANANA")
    str_view(bananas, "banana")
    str_view(bananas, regex("banana", ignore_case = TRUE))
  • multiline = TRUE allows ^ and $ to match the start and end of each line rather than the start and end of the complete string.

    x <- "Line 1\nLine 2\nLine 3"
    str_extract_all(x, "^Line")[[1]]
    #> [1] "Line"
    str_extract_all(x, regex("^Line", multiline = TRUE))[[1]]
    #> [1] "Line" "Line" "Line"

Acknowledgments

This page was adapted from the following source:

R for Data Science, licensed under Creative Commons Attribution-NonCommercial-NoDerivs 3.0.