Chapter 16 Programming in R
This unit covers some more advanced programming in R - namely:
Mastering these skills will make you virtually invincible in R!
Note that these concepts are not specific to R. While the syntax might vary, the basic idea of flow, functions, and iteration are common across all scripting languages. So if you ever think of picking up Python or something else, it is critical to familiarize yourself with these concepts.
16.1 Conditional Flow
Sometimes you only want to execute code if a certain condition is met. To do that, we use an if-else statement. It looks like this:
if (condition) {
# Code executed when condition is TRUE
} else {
# Code executed when condition is FALSE
}
condition
is a statement that must always evaluate to either TRUE
or FALSE
. This is similar to filter()
, except condition
can only be a single value (i.e., a vector of length 1), whereas filter()
works for entire vectors (or columns).
Let’s look at a simple example:
age = 84
if (age > 60) {
print("OK Boomer")
} else {
print("But you don't look like a professor!")
}
#> [1] "OK Boomer"
We refer to the first print
command as the first branch.
Let’s change the age
variable to execute the second branch:
age = 20
if (age > 60) {
print("OK Boomer")
} else {
print("But you don't look like a professor!")
}
#> [1] "But you don't look like a professor!"
16.1.1 Multiple Conditions
You can chain conditional statements together:
16.1.2 Complex Statements
We can generate more complex conditional statements with Boolean operators like &
and |
:
16.1.3 Code Style
Both if
and function
should (almost) always be followed by squiggly brackets ({}
), and the contents should be indented. This makes it easier to see the hierarchy in your code by skimming the left-hand margin.
An opening curly brace should never go on its own line and should always be followed by a new line. A closing curly brace should always go on its own line, unless it is followed by else. Always indent the code inside curly braces.
16.1.4 if
vs. if_else
Because if-else conditional statements like the ones outlined above must always resolve to a single TRUE
or FALSE
, they cannot be used for vector operations. Vector operations are where you make multiple comparisons simultaneously for each value stored inside a vector.
Consider the gapminder
data and imagine you wanted to create a new column identifying whether or not a country-year observation has a life expectancy of at least 35.
gap <- gapminder
head(gap)
#> # A tibble: 6 x 6
#> country continent year lifeExp pop gdpPercap
#> <fct> <fct> <int> <dbl> <int> <dbl>
#> 1 Afghanistan Asia 1952 28.8 8425333 779.
#> 2 Afghanistan Asia 1957 30.3 9240934 821.
#> 3 Afghanistan Asia 1962 32.0 10267083 853.
#> 4 Afghanistan Asia 1967 34.0 11537966 836.
#> 5 Afghanistan Asia 1972 36.1 13079460 740.
#> 6 Afghanistan Asia 1977 38.4 14880372 786.
This sounds like a classic if-else operation. For each observation, if lifeExp
is greater than or equal to 35
, then the value in the new column should be 1
. Otherwise, it should be 0
. But what happens if we try to implement this using an if-else operation like above?
gap_if <- gap %>%
mutate(life.35 = if(lifeExp >= 35){
1
} else {
0
})
#> Warning: Problem with `mutate()` input `life.35`.
#> ℹ the condition has length > 1 and only the first element will be used
#> ℹ Input `life.35` is `if (...) NULL`.
#> Warning in if (lifeExp >= 35) {: the condition has length > 1 and only the first
#> element will be used
head(gap_if)
#> # A tibble: 6 x 7
#> country continent year lifeExp pop gdpPercap life.35
#> <fct> <fct> <int> <dbl> <int> <dbl> <dbl>
#> 1 Afghanistan Asia 1952 28.8 8425333 779. 0
#> 2 Afghanistan Asia 1957 30.3 9240934 821. 0
#> 3 Afghanistan Asia 1962 32.0 10267083 853. 0
#> 4 Afghanistan Asia 1967 34.0 11537966 836. 0
#> 5 Afghanistan Asia 1972 36.1 13079460 740. 0
#> 6 Afghanistan Asia 1977 38.4 14880372 786. 0
This did not work correctly. Because if()
can only handle a single TRUE
/FALSE
value, it only checked the first row of the data frame. That row contained 28.801
, so it generated a vector of length 1704 with each value being 0
.
Because we in fact want to make this if-else comparison 1704 times, we should instead use if_else()
. This vectorizes the if-else comparison and makes a separate comparison for each row of the data frame. This allows us to correctly generate this new column.
gap_ifelse <- gap %>%
mutate(life.35 = if_else(lifeExp >= 35, 1, 0))
gap_ifelse
#> # A tibble: 1,704 x 7
#> country continent year lifeExp pop gdpPercap life.35
#> <fct> <fct> <int> <dbl> <int> <dbl> <dbl>
#> 1 Afghanistan Asia 1952 28.8 8425333 779. 0
#> 2 Afghanistan Asia 1957 30.3 9240934 821. 0
#> 3 Afghanistan Asia 1962 32.0 10267083 853. 0
#> 4 Afghanistan Asia 1967 34.0 11537966 836. 0
#> 5 Afghanistan Asia 1972 36.1 13079460 740. 1
#> 6 Afghanistan Asia 1977 38.4 14880372 786. 1
#> # … with 1,698 more rows
16.2 Functions
Functions are the basic building blocks of programs. Think of them as “mini-scripts” or “tiny commands.” We have already used dozens of functions created by others (e.g., filter()
and mean()
).
This lesson teaches you how to write you own functions and why you would want to do so. The details are pretty simple, but this is one of those ideas where it is good to get lots of practice!
16.2.1 Why Write Functions?
Functions allow you to automate common tasks in a more powerful and general way than copy-and-pasting. For example, take a look at the following code:
gap <- gapminder
gap_norm <- gap %>%
mutate(pop_norm = (pop - min(pop)) / (max(pop) - min (pop)),
gdp_norm = (gdpPercap - min(gdpPercap)) / (max(gdpPercap) - min (gdpPercap)),
life_norm = (lifeExp - min(lifeExp) / (max(pop)) - min (lifeExp)))
summary(gap_norm$pop_norm)
You might be able to puzzle out that this rescales each numeric column to have a range from 0 to 1. But did you spot the mistakes? I made two errors when copying-and-pasting the code for lifeExp
.
Functions have a number of advantages over this “copy-and-paste” approach:
They are easy to reuse. If you need to change things, you only have to update code in one place instead of many.
They are self-documenting. Functions name pieces of code the way variables name strings and numbers. Give your function a good name and you will easily remember the function and its purpose.
They are easier to debug. There are fewer chances to make mistakes, because the code only exists in one location (i.e., updating a variable name in one place, but not in another).
16.2.2 Anatomy of a Function
Functions have three key components:
A name. This should be informative and describe what the function does.
The arguments, or list of inputs, to the function. They go inside the parentheses in
function()
.The body. This is the block of code within
{}
that immediately followsfunction(...)
, and it is the code that you develop to perform the action described in the name using the arguments you provide.
In this example, my_function
is the name of the function, x
and y
are the arguments, and the stuff inside the {}
is the body.
16.2.3 Writing a Function
Let’s re-write the scaling code above as a function. To write a function, you need to first analyze the code. How many inputs does it have?
# The corrected code
gap <- gapminder
gap_norm <- gap %>%
mutate(pop_norm = (pop - min(pop)) / (max(pop) - min (pop)),
gdp_norm = (gdpPercap - min(gdpPercap)) / (max(gdpPercap) - min (gdpPercap)),
life_norm = (lifeExp - min(lifeExp)) / (max(lifeExp) - min (lifeExp)))
# Focus on the line
# pop_norm = (pop - min(pop)) / (max(pop) - min (pop))
This code only has one input: gap$pop
. To make the inputs more clear, it is a good idea to rewrite the code using temporary variables with general names. Here this code only requires a single numeric vector, which I will call x
:
There is still some duplication in this code. We are calulating some version of the range three times. Pulling out intermediate calculations into named variables is a good practice, because it becomes clearer what the code is doing.
Now that I have simplified the code and checked that it still works, I can turn it into a function:
rescale01 <- function(x) {
rng <- range(x)
scales <- (x - rng[1]) / (rng[2] - rng[1])
return(scales)
}
Note the overall process: I only made the function after I had figured out how to make it work with a simple input. It is easier to start with working code and turn it into a function; it is harder to create a function and then try to make it work.
At this point, it is a good idea to check your function with a few different inputs:
16.2.4 Using a Function
Two important points about using (or calling) functions:
Notice that when we call a function, we are passing a value into it that is assigned to the parameter we defined when writing the function. In this case, the parameter
x
is automatically assigned toc(-10, 0, 10)
.When using functions, by default the returned object is merely printed to the screen. If you want it saved, you need to assign it to an object.
Let’s see if we can simplify the original example with our brand new function:
rescale01 <- function(x) {
rng <- range(x)
scales <- (x - rng[1]) / (rng[2] - rng[1])
return(scales)
}
gap_norm <- gap %>%
mutate(pop_norm = rescale01(pop),
gdp_norm = rescale01(gdpPercap),
life_norm = rescale01(lifeExp))
Compared to the original, this code is easier to understand, and we have eliminated one class of copy-and-paste errors. There is still quite a bit of duplication, since we are doing the same thing to multiple columns. We will learn how to eliminate that duplication in the lesson on iteration.
Another advantage of functions is that if our requirements change, we only need to make the change in one place. For example, we might discover that some of our variables include NA
values, and rescale01()
fails:
Because we have extracted the code into a function, we only need to make the fix in one place:
16.2.5 Challenges
Challenge 1.
Write a function that calculates the sum of the squared value of two numbers. For instance, it should generate the following output:
Challenge 2.
Write both_na()
, a function that takes two vectors and returns the total number of NAs in both vectors.
For instance, it should generate the following output:
Challenge 3.
Fill in the blanks to create a function that takes a name like "Rochelle Terman"
and returns that name in uppercase and reversed, like "TERMAN, ROCHELLE"
.
standard_names <- function(name){
upper_case = toupper(____) # Make upper
upper_case_vec = strsplit(_____, split = ' ')[[1]] # Turn into a vector
first_name = ______ # Take first name
last_name = _______ # Take last name
reversed_name = paste(______, _______, sep = ", ") # Reverse and separate by a comma and space
return(reversed_name)
}
16.3 Iteration
In the last unit, we talked about how important it is to reduce duplication in your code by creating functions instead of copying-and-pasting. Avoiding duplication allows for more readable, more flexible, and less error-prone code.
Functions are one method of reducing duplication in your code. Another tool for reducing duplication is iteration, which lets you do the same task to multiple inputs.
In this chapter, you will learn about three approaches to iteratation:
- Vectorized functions.
map
and functional programming.- Scoped verbs in
dplyr
.
16.3.1 Vectorized Functions
Most of R’s built-in functions are vectorized, meaning that the function will operate on all elements of a vector without needing to loop through and act on each element at a time.
That means you should never need to perform explicit iteration when performing simple mathematical computations.
Notice that the multiplication happened to each element of the vector. Most built-in functions also operate element-wise on vectors:
We can also add two vectors together:
Notice that each element of x
was added to its corresponding element of y
:
What happens if you add two vectors of different lengths?
Here, R will expand the shortest vector to the same length as the longest. This is called recycling. This usually (but not always) happens silently, meaning R will not warn you. Beware!
16.3.2 Functional Programming and map
You might have used for loops in other languages. Loops are not as important in R as they are in other languages, because R is a functional programming language. This means that it is possible to wrap up for
loops in a function and call that function instead of using the for loop directly.
The pattern of looping over a vector, doing something to each element, and saving the results is so common that the purrr
package (part of tidyverse
) provides a family of functions to do it for you. They effectively eliminate the need for many common for
loops.
There is one function for each type of output:
map()
makes a list.map_lgl()
makes a logical vector.map_int()
makes an integer vector.map_dbl()
makes a double vector.map_chr()
makes a character vector.
Each function takes a vector as input, applies a function to each piece, and then returns a new vector that is the same length (and has the same names) as the input.
NB: Some people will tell you to avoid for
loops because they are slow. They are wrong! (Well, at least they are rather out of date, as for loops have not been slow for many years.) The main benefit of using functions like map()
is not speed, but clarity: They make your code easier to write and to read.
To see how map
works, consider this simple data frame:
What if we wanted to calculate the mean, median, and standard deviation of each column?
map_dbl(df, mean)
#> a b c d
#> -0.441 -0.179 -0.124 0.152
map_dbl(df, median)
#> a b c d
#> -0.2458 -0.2873 -0.0567 0.1443
map_dbl(df, sd)
#> a b c d
#> 1.118 1.176 1.047 0.964
The data can even be piped!
df %>% map_dbl(mean)
#> a b c d
#> -0.441 -0.179 -0.124 0.152
df %>% map_dbl(median)
#> a b c d
#> -0.2458 -0.2873 -0.0567 0.1443
df %>% map_dbl(sd)
#> a b c d
#> 1.118 1.176 1.047 0.964
We can also pass additional arguments. For example, the function mean
passes an optional argument trim
. From the help file: “The fraction (0 to 0.5) of observations to be trimmed from each end of x
before the mean
is computed.”
Check out other fun applications of map
functions here.
16.3.3 Challenges
Write code that uses one of the map
functions to:
Challenge 1.
Calculate the arithmetic mean for every column in mtcars
.
Challenge 2.
Calculate the number of unique values in each column of iris
.
Challenge 3.
Generate 10 random normals for each of \(\mu = -10\), \(0\), \(10\), and \(100\).
16.3.4 Scoped Verbs
The last iteration technique we will discuss is scoped verbs in dplyr
.
Frequently, when working with dataframes, we want to apply a function to multiple columns. For example, let’s say we want to calculate the mean value of each column in mtcars
.
If we wanted to calculate the average of a single column, it would be pretty simple using just regular dplyr
functions:
But if we want to calculate the mean for all of them, we would have to duplicate this code many times over:
mtcars %>%
summarize(mpg = mean(mpg),
cyl = mean(cyl),
disp = mean(disp),
hp = mean(hp),
drat = mean(drat),
wt = mean(wt),
qsec = mean(qsec),
vs = mean(vs),
am = mean(am),
gear = mean(gear),
carb = mean(carb))
#> mpg cyl disp hp drat wt qsec vs am gear carb
#> 1 20.1 6.19 231 147 3.6 3.22 17.8 0.438 0.406 3.69 2.81
This is very repetitive and prone to mistakes!
We just saw one approach to solve this problem: map
. Another approach is scoped verbs.
Scoped verbs allow you to use standard verbs (or functions) in dplyr
that affect multiple variables at once.
_if
allows you to pick variables based on a predicate function likeis.numeric()
oris.character()
._at
allows you to pick variables using the same syntax asselect()
._all
operates on all variables.
These verbs can apply to summarize
, filter
, or mutate
. Let’s go over summarize
:
summarize_all()
summarize_all()
takes a dataframe and a function and applies that function to each column.
summarize_at()
summarize_at()
allows you to pick columns in the same way as select()
, that is, based on their names. There is one small difference: You need to wrap the complete selection with the vars()
helper (this avoids ambiguity).
summarize_if()
summarize_if()
allows you to pick variables to summarize based on some property of the column. For example, what if we want to apply a numeric summary function only to numeric columns?
iris %>%
summarize_if(.predicate = is.numeric, .funs = mean)
#> Sepal.Length Sepal.Width Petal.Length Petal.Width
#> 1 5.84 3.06 3.76 1.2
mutate
and filter
work in a similar way. To see more, check out Scoped verbs by the Data Challenge Lab.
Acknowledgments
A good portion of this lesson is based on: