Chapter 16 Programming in R

This unit covers some more advanced programming in R - namely:

  1. Conditional Flow.
  2. Functions.
  3. Iteration.

Mastering these skills will make you virtually invincible in R!

Note that these concepts are not specific to R. While the syntax might vary, the basic idea of flow, functions, and iteration are common across all scripting languages. So if you ever think of picking up Python or something else, it is critical to familiarize yourself with these concepts.

library(tidyverse)
library(gapminder)

16.1 Conditional Flow

Sometimes you only want to execute code if a certain condition is met. To do that, we use an if-else statement. It looks like this:

if (condition) {
  # Code executed when condition is TRUE
} else {
  # Code executed when condition is FALSE
}

condition is a statement that must always evaluate to either TRUE or FALSE. This is similar to filter(), except condition can only be a single value (i.e., a vector of length 1), whereas filter() works for entire vectors (or columns).

Let’s look at a simple example:

age = 84
if (age > 60) {
    print("OK Boomer")
} else {
    print("But you don't look like a professor!")
}
#> [1] "OK Boomer"

We refer to the first print command as the first branch.

Let’s change the age variable to execute the second branch:

age = 20
if (age > 60) {
    print("OK Boomer")
} else {
    print("But you don't look like a professor!")
}
#> [1] "But you don't look like a professor!"

16.1.1 Multiple Conditions

You can chain conditional statements together:

if (this) {
  # Do that
} else if (that) {
  # Do something else
} else {
  # Do something completely different
}

16.1.2 Complex Statements

We can generate more complex conditional statements with Boolean operators like & and |:

age = 45 

if (age > 60) {
    print("OK Boomer")
} else if (age < 60 & age > 40) {
    print("How's the midlife crisis?")
} else {
    print("But you don't look like a professor!")
}
#> [1] "How's the midlife crisis?"

16.1.3 Code Style

Both if and function should (almost) always be followed by squiggly brackets ({}), and the contents should be indented. This makes it easier to see the hierarchy in your code by skimming the left-hand margin.

An opening curly brace should never go on its own line and should always be followed by a new line. A closing curly brace should always go on its own line, unless it is followed by else. Always indent the code inside curly braces.

# Bad
if (y < 0 && debug)
message("Y is negative")

if (y == 0) {
  log(x)
} 
else {
  y ^ x
}

# Good
if (y < 0 && debug) {
  message("Y is negative")
}

if (y == 0) {
  log(x)
} else {
  y ^ x
}

16.1.4 if vs. if_else

Because if-else conditional statements like the ones outlined above must always resolve to a single TRUE or FALSE, they cannot be used for vector operations. Vector operations are where you make multiple comparisons simultaneously for each value stored inside a vector.

Consider the gapminder data and imagine you wanted to create a new column identifying whether or not a country-year observation has a life expectancy of at least 35.

gap <- gapminder
head(gap)
#> # A tibble: 6 x 6
#>   country     continent  year lifeExp      pop gdpPercap
#>   <fct>       <fct>     <int>   <dbl>    <int>     <dbl>
#> 1 Afghanistan Asia       1952    28.8  8425333      779.
#> 2 Afghanistan Asia       1957    30.3  9240934      821.
#> 3 Afghanistan Asia       1962    32.0 10267083      853.
#> 4 Afghanistan Asia       1967    34.0 11537966      836.
#> 5 Afghanistan Asia       1972    36.1 13079460      740.
#> 6 Afghanistan Asia       1977    38.4 14880372      786.

This sounds like a classic if-else operation. For each observation, if lifeExp is greater than or equal to 35, then the value in the new column should be 1. Otherwise, it should be 0. But what happens if we try to implement this using an if-else operation like above?

gap_if <- gap %>%
   mutate(life.35 = if(lifeExp >= 35){
     1
   } else {
     0
   })
#> Warning: Problem with `mutate()` input `life.35`.
#> ℹ the condition has length > 1 and only the first element will be used
#> ℹ Input `life.35` is `if (...) NULL`.
#> Warning in if (lifeExp >= 35) {: the condition has length > 1 and only the first
#> element will be used

head(gap_if)
#> # A tibble: 6 x 7
#>   country     continent  year lifeExp      pop gdpPercap life.35
#>   <fct>       <fct>     <int>   <dbl>    <int>     <dbl>   <dbl>
#> 1 Afghanistan Asia       1952    28.8  8425333      779.       0
#> 2 Afghanistan Asia       1957    30.3  9240934      821.       0
#> 3 Afghanistan Asia       1962    32.0 10267083      853.       0
#> 4 Afghanistan Asia       1967    34.0 11537966      836.       0
#> 5 Afghanistan Asia       1972    36.1 13079460      740.       0
#> 6 Afghanistan Asia       1977    38.4 14880372      786.       0

This did not work correctly. Because if() can only handle a single TRUE/FALSE value, it only checked the first row of the data frame. That row contained 28.801, so it generated a vector of length 1704 with each value being 0.

Because we in fact want to make this if-else comparison 1704 times, we should instead use if_else(). This vectorizes the if-else comparison and makes a separate comparison for each row of the data frame. This allows us to correctly generate this new column.

gap_ifelse <- gap %>%
  mutate(life.35 = if_else(lifeExp >= 35, 1, 0))

gap_ifelse
#> # A tibble: 1,704 x 7
#>   country     continent  year lifeExp      pop gdpPercap life.35
#>   <fct>       <fct>     <int>   <dbl>    <int>     <dbl>   <dbl>
#> 1 Afghanistan Asia       1952    28.8  8425333      779.       0
#> 2 Afghanistan Asia       1957    30.3  9240934      821.       0
#> 3 Afghanistan Asia       1962    32.0 10267083      853.       0
#> 4 Afghanistan Asia       1967    34.0 11537966      836.       0
#> 5 Afghanistan Asia       1972    36.1 13079460      740.       1
#> 6 Afghanistan Asia       1977    38.4 14880372      786.       1
#> # … with 1,698 more rows
library(tidyverse)
library(gapminder)

16.2 Functions

Functions are the basic building blocks of programs. Think of them as “mini-scripts” or “tiny commands.” We have already used dozens of functions created by others (e.g., filter() and mean()).

This lesson teaches you how to write you own functions and why you would want to do so. The details are pretty simple, but this is one of those ideas where it is good to get lots of practice!

16.2.1 Why Write Functions?

Functions allow you to automate common tasks in a more powerful and general way than copy-and-pasting. For example, take a look at the following code:

gap <- gapminder

gap_norm <- gap %>%
  mutate(pop_norm = (pop - min(pop)) / (max(pop) - min (pop)),
         gdp_norm = (gdpPercap - min(gdpPercap)) / (max(gdpPercap) - min (gdpPercap)),
         life_norm = (lifeExp - min(lifeExp) / (max(pop)) - min (lifeExp)))

summary(gap_norm$pop_norm)

You might be able to puzzle out that this rescales each numeric column to have a range from 0 to 1. But did you spot the mistakes? I made two errors when copying-and-pasting the code for lifeExp.

Functions have a number of advantages over this “copy-and-paste” approach:

  • They are easy to reuse. If you need to change things, you only have to update code in one place instead of many.

  • They are self-documenting. Functions name pieces of code the way variables name strings and numbers. Give your function a good name and you will easily remember the function and its purpose.

  • They are easier to debug. There are fewer chances to make mistakes, because the code only exists in one location (i.e., updating a variable name in one place, but not in another).

16.2.2 Anatomy of a Function

Functions have three key components:

  1. A name. This should be informative and describe what the function does.

  2. The arguments, or list of inputs, to the function. They go inside the parentheses in function().

  3. The body. This is the block of code within {} that immediately follows function(...), and it is the code that you develop to perform the action described in the name using the arguments you provide.

my_function <- function(x, y){
  # do
  # something
  # here
  return(result)
}

In this example, my_function is the name of the function, x and y are the arguments, and the stuff inside the {} is the body.

16.2.3 Writing a Function

Let’s re-write the scaling code above as a function. To write a function, you need to first analyze the code. How many inputs does it have?

# The corrected code
gap <- gapminder

gap_norm <- gap %>%
  mutate(pop_norm = (pop - min(pop)) / (max(pop) - min (pop)),
         gdp_norm = (gdpPercap - min(gdpPercap)) / (max(gdpPercap) - min (gdpPercap)),
         life_norm = (lifeExp - min(lifeExp)) / (max(lifeExp) - min (lifeExp)))

# Focus on the line
# pop_norm = (pop - min(pop)) / (max(pop) - min (pop))

This code only has one input: gap$pop. To make the inputs more clear, it is a good idea to rewrite the code using temporary variables with general names. Here this code only requires a single numeric vector, which I will call x:

x <- gap$pop

(x - min(x)) / (max(x) - min(x))

There is still some duplication in this code. We are calulating some version of the range three times. Pulling out intermediate calculations into named variables is a good practice, because it becomes clearer what the code is doing.

x <- gap$pop

rng <- range(x)
(x - rng[1]) / (rng[2] - rng[1])

Now that I have simplified the code and checked that it still works, I can turn it into a function:

rescale01 <- function(x) {
  rng <- range(x)
  scales <- (x - rng[1]) / (rng[2] - rng[1])
  return(scales)
}

Note the overall process: I only made the function after I had figured out how to make it work with a simple input. It is easier to start with working code and turn it into a function; it is harder to create a function and then try to make it work.

At this point, it is a good idea to check your function with a few different inputs:

rescale01(c(-10, 0, 10))
#> [1] 0.0 0.5 1.0

rescale01(c(1, 2, 3, 5))
#> [1] 0.00 0.25 0.50 1.00

16.2.4 Using a Function

Two important points about using (or calling) functions:

  1. Notice that when we call a function, we are passing a value into it that is assigned to the parameter we defined when writing the function. In this case, the parameter x is automatically assigned to c(-10, 0, 10).

  2. When using functions, by default the returned object is merely printed to the screen. If you want it saved, you need to assign it to an object.

Let’s see if we can simplify the original example with our brand new function:

rescale01 <- function(x) {
  rng <- range(x)
  scales <- (x - rng[1]) / (rng[2] - rng[1])
  return(scales)
}

gap_norm <- gap %>%
  mutate(pop_norm = rescale01(pop),
         gdp_norm = rescale01(gdpPercap),
         life_norm = rescale01(lifeExp))

Compared to the original, this code is easier to understand, and we have eliminated one class of copy-and-paste errors. There is still quite a bit of duplication, since we are doing the same thing to multiple columns. We will learn how to eliminate that duplication in the lesson on iteration.

Another advantage of functions is that if our requirements change, we only need to make the change in one place. For example, we might discover that some of our variables include NA values, and rescale01() fails:

rescale01(c(1, 2, NA, 3, 4, 5))
#> [1] NA NA NA NA NA NA

Because we have extracted the code into a function, we only need to make the fix in one place:

rescale01 <- function(x) {
  rng <- range(x, na.rm = T)
  scales <- (x - rng[1]) / (rng[2] - rng[1])
  return(scales)
}

rescale01(c(1, 2, NA, 3, 4, 5))
#> [1] 0.00 0.25   NA 0.50 0.75 1.00

16.2.5 Challenges

Challenge 1.

Write a function that calculates the sum of the squared value of two numbers. For instance, it should generate the following output:

my_function(3, 4)
# [1] 25

Challenge 2.

Write both_na(), a function that takes two vectors and returns the total number of NAs in both vectors.

For instance, it should generate the following output:

vec1 <- c(NA, 4, 6, 2, NA, 5, NA)
vec2 <- c(NA, "Dec", "Apr", NA, "Jul", "Apr")

my_other_function(vec1, vec2)
# [1] 5

# Hints
is.na(vec1)
sum(c(T, F))

Challenge 3.

Fill in the blanks to create a function that takes a name like "Rochelle Terman" and returns that name in uppercase and reversed, like "TERMAN, ROCHELLE".

standard_names <- function(name){
    upper_case = toupper(____) # Make upper
    upper_case_vec = strsplit(_____, split = ' ')[[1]] # Turn into a vector
    first_name = ______ # Take first name
    last_name = _______ # Take last name
    reversed_name = paste(______, _______, sep = ", ") # Reverse and separate by a comma and space
    return(reversed_name)
}

Challenge 4.

Look at the following function:

print_date <- function(year, month, day){
    joined = paste(as.character(year), as.character(month), as.character(day), sep = "/")
    return(joined)
}

Why do the two lines of code below return different values?

print_date(day=1, month=2, year=2003)
print_date(1, 2, 2003)

16.3 Iteration

In the last unit, we talked about how important it is to reduce duplication in your code by creating functions instead of copying-and-pasting. Avoiding duplication allows for more readable, more flexible, and less error-prone code.

Functions are one method of reducing duplication in your code. Another tool for reducing duplication is iteration, which lets you do the same task to multiple inputs.

In this chapter, you will learn about three approaches to iteratation:

  1. Vectorized functions.
  2. map and functional programming.
  3. Scoped verbs in dplyr.

16.3.1 Vectorized Functions

Most of R’s built-in functions are vectorized, meaning that the function will operate on all elements of a vector without needing to loop through and act on each element at a time.

That means you should never need to perform explicit iteration when performing simple mathematical computations.

x <- 1:4
x * 2
#> [1] 2 4 6 8

Notice that the multiplication happened to each element of the vector. Most built-in functions also operate element-wise on vectors:

x <- 1:4
log(x)
#> [1] 0.000 0.693 1.099 1.386

We can also add two vectors together:

x <- 1:4
y <- 6:9
x + y
#> [1]  7  9 11 13

Notice that each element of x was added to its corresponding element of y:

x:  1  2  3  4
    +  +  +  +
y:  6  7  8  9
---------------
    7  9 11 13

What happens if you add two vectors of different lengths?

1:10 + 1:2
#>  [1]  2  4  4  6  6  8  8 10 10 12

Here, R will expand the shortest vector to the same length as the longest. This is called recycling. This usually (but not always) happens silently, meaning R will not warn you. Beware!

16.3.2 Functional Programming and map

You might have used for loops in other languages. Loops are not as important in R as they are in other languages, because R is a functional programming language. This means that it is possible to wrap up for loops in a function and call that function instead of using the for loop directly.

The pattern of looping over a vector, doing something to each element, and saving the results is so common that the purrr package (part of tidyverse) provides a family of functions to do it for you. They effectively eliminate the need for many common for loops.

library(tidyverse)

There is one function for each type of output:

  1. map() makes a list.
  2. map_lgl() makes a logical vector.
  3. map_int() makes an integer vector.
  4. map_dbl() makes a double vector.
  5. map_chr() makes a character vector.

Each function takes a vector as input, applies a function to each piece, and then returns a new vector that is the same length (and has the same names) as the input.

NB: Some people will tell you to avoid for loops because they are slow. They are wrong! (Well, at least they are rather out of date, as for loops have not been slow for many years.) The main benefit of using functions like map() is not speed, but clarity: They make your code easier to write and to read.

To see how map works, consider this simple data frame:

df <- data.frame(
  a = rnorm(10),
  b = rnorm(10),
  c = rnorm(10),
  d = rnorm(10)
)

What if we wanted to calculate the mean, median, and standard deviation of each column?

map_dbl(df, mean)
#>      a      b      c      d 
#> -0.441 -0.179 -0.124  0.152
map_dbl(df, median)
#>       a       b       c       d 
#> -0.2458 -0.2873 -0.0567  0.1443
map_dbl(df, sd)
#>     a     b     c     d 
#> 1.118 1.176 1.047 0.964

The data can even be piped!

df %>% map_dbl(mean)
#>      a      b      c      d 
#> -0.441 -0.179 -0.124  0.152
df %>% map_dbl(median)
#>       a       b       c       d 
#> -0.2458 -0.2873 -0.0567  0.1443
df %>% map_dbl(sd)
#>     a     b     c     d 
#> 1.118 1.176 1.047 0.964

We can also pass additional arguments. For example, the function mean passes an optional argument trim. From the help file: “The fraction (0 to 0.5) of observations to be trimmed from each end of x before the mean is computed.”

map_dbl(df, mean, trim = 0.5)
#>       a       b       c       d 
#> -0.2458 -0.2873 -0.0567  0.1443

Check out other fun applications of map functions here.

16.3.3 Challenges

Write code that uses one of the map functions to:

Challenge 1.

Calculate the arithmetic mean for every column in mtcars.

Challenge 2.

Calculate the number of unique values in each column of iris.

Challenge 3.

Generate 10 random normals for each of \(\mu = -10\), \(0\), \(10\), and \(100\).

16.3.4 Scoped Verbs

The last iteration technique we will discuss is scoped verbs in dplyr.

Frequently, when working with dataframes, we want to apply a function to multiple columns. For example, let’s say we want to calculate the mean value of each column in mtcars.

If we wanted to calculate the average of a single column, it would be pretty simple using just regular dplyr functions:

mtcars %>%
  summarize(mpg = mean(mpg))
#>    mpg
#> 1 20.1

But if we want to calculate the mean for all of them, we would have to duplicate this code many times over:

mtcars %>%
  summarize(mpg = mean(mpg),
            cyl = mean(cyl),
            disp = mean(disp),
            hp = mean(hp),
            drat = mean(drat),
            wt = mean(wt),
            qsec = mean(qsec),
            vs = mean(vs),
            am = mean(am),
            gear = mean(gear),
            carb = mean(carb))
#>    mpg  cyl disp  hp drat   wt qsec    vs    am gear carb
#> 1 20.1 6.19  231 147  3.6 3.22 17.8 0.438 0.406 3.69 2.81

This is very repetitive and prone to mistakes!

We just saw one approach to solve this problem: map. Another approach is scoped verbs.

Scoped verbs allow you to use standard verbs (or functions) in dplyr that affect multiple variables at once.

  • _if allows you to pick variables based on a predicate function like is.numeric() or is.character().
  • _at allows you to pick variables using the same syntax as select().
  • _all operates on all variables.

These verbs can apply to summarize, filter, or mutate. Let’s go over summarize:

summarize_all()

summarize_all() takes a dataframe and a function and applies that function to each column.

mtcars %>%
  summarize_all(.funs = mean)
#>    mpg  cyl disp  hp drat   wt qsec    vs    am gear carb
#> 1 20.1 6.19  231 147  3.6 3.22 17.8 0.438 0.406 3.69 2.81

summarize_at()

summarize_at() allows you to pick columns in the same way as select(), that is, based on their names. There is one small difference: You need to wrap the complete selection with the vars() helper (this avoids ambiguity).

mtcars %>%
  summarize_at(.vars = vars(mpg, wt), .funs = mean)
#>    mpg   wt
#> 1 20.1 3.22

summarize_if()

summarize_if() allows you to pick variables to summarize based on some property of the column. For example, what if we want to apply a numeric summary function only to numeric columns?

iris %>%
  summarize_if(.predicate = is.numeric, .funs = mean)
#>   Sepal.Length Sepal.Width Petal.Length Petal.Width
#> 1         5.84        3.06         3.76         1.2

mutate and filter work in a similar way. To see more, check out Scoped verbs by the Data Challenge Lab.

Acknowledgments

A good portion of this lesson is based on: