Chapter 16 Programming in R

This unit covers some more advanced programming in R - namely:

  1. Conditional Flow
  2. Functions
  3. Iteration

Mastering these skills will make you virtually invincible in R!

Note that these concepts are not specific to R. While the syntax might vary, the basic idea of flow, functions, and iteration are common across all scripting languages. So if you ever think of picking up Python or something else, it’s critical to familiarize yourself with these concepts.

16.1 Conditional Flow

Sometimes you only want to execute code if a certain condition is met. To do that, we use an if-else statement. It looks like this:

if (condition) {
  # code executed when condition is TRUE
} else {
  # code executed when condition is FALSE
}

condition is a statement that must always evaluate to either TRUE or FALSE. This is similar to filter(), except condition can only be a single value (i.e. a vector of length 1), whereas filter() works for entire vectors (or columns).

Let’s look at a simple example:

age = 84
if (age > 60) {
    print("OK Boomer")
} else {
    print("But you don't look like a professor!")
}
#> [1] "OK Boomer"

We refer to the first print command as the first branch.

Let’s change the age variable to execute the second branch:

age = 20
if (age > 60) {
    print("OK Boomer")
} else {
    print("But you don't look like a professor!")
}
#> [1] "But you don't look like a professor!"

16.1.1 Multiple Conditions

You can chain conditional statements together:

if (this) {
  # do that
} else if (that) {
  # do something else
} else {
  # do something completely different
}

16.1.2 Complex Statements

We can generate more complex conditional statements with boolean operators like & and |:

age = 45 

if (age > 60) {
    print("OK Boomer")
} else if (age < 60 & age > 40) {
    print("How's the midlife crisis?")
} else {
    print("But you don't look like a professor!")
}
#> [1] "How's the midlife crisis?"

16.1.3 Code Style

Both if and function should (almost) always be followed by squiggly brackets ({}), and the contents should be indented. This makes it easier to see the hierarchy in your code by skimming the left-hand margin.

An opening curly brace should never go on its own line and should always be followed by a new line. A closing curly brace should always go on its own line, unless it’s followed by else. Always indent the code inside curly braces.

# Bad
if (y < 0 && debug)
message("Y is negative")

if (y == 0) {
  log(x)
} 
else {
  y ^ x
}

# Good
if (y < 0 && debug) {
  message("Y is negative")
}

if (y == 0) {
  log(x)
} else {
  y ^ x
}

16.1.4 if vs. if_else

Because if-else conditional statements like the ones outlined above must always resolve to a single TRUE or FALSE, they cannot be used for vector operations. Vector operations are where you make multiple comparisons simultaneously for each value stored inside a vector.

Consider the gapminder data and imagine you wanted to create a new column identifying whether or not a country-year observation has a life expectancy of at least 35.

gap <- read.csv("Data/gapminder-FiveYearData.csv")
head(gap)
#>       country year      pop continent lifeExp gdpPercap
#> 1 Afghanistan 1952  8425333      Asia    28.8       779
#> 2 Afghanistan 1957  9240934      Asia    30.3       821
#> 3 Afghanistan 1962 10267083      Asia    32.0       853
#> 4 Afghanistan 1967 11537966      Asia    34.0       836
#> 5 Afghanistan 1972 13079460      Asia    36.1       740
#> 6 Afghanistan 1977 14880372      Asia    38.4       786

This sounds like a classic if-else operation. For each observation, if lifeExp is greater than or equal to 35, then the value in the new column should be 1. Otherwise, it should be 0. But what happens if we try to implement this using an if-else operation like above?

gap_if <- gap %>%
   mutate(life.35 = if(lifeExp >= 35){
     1
   } else {
     0
   })
#> Warning in if (lifeExp >= 35) {: the condition has length > 1 and only the
#> first element will be used

head(gap_if)
#>       country year      pop continent lifeExp gdpPercap life.35
#> 1 Afghanistan 1952  8425333      Asia    28.8       779       0
#> 2 Afghanistan 1957  9240934      Asia    30.3       821       0
#> 3 Afghanistan 1962 10267083      Asia    32.0       853       0
#> 4 Afghanistan 1967 11537966      Asia    34.0       836       0
#> 5 Afghanistan 1972 13079460      Asia    36.1       740       0
#> 6 Afghanistan 1977 14880372      Asia    38.4       786       0

This did not work correctly. Because if() can only handle a single TRUE/FALSE value, it only checked the first row of the data frame. That row contained 28.801, so it generated a vector of length 1704 with each value being 0.

Because we in fact want to make this if-else comparison 1704 times, we should instead use if_else(). This vectorizes the if-else comparison and makes a separate comparison for each row of the data frame. This allows us to correctly generate this new column.

gap_ifelse <- gap %>%
  mutate(life.35 = if_else(lifeExp >= 35, 1, 0))

head(gap_ifelse)
#>       country year      pop continent lifeExp gdpPercap life.35
#> 1 Afghanistan 1952  8425333      Asia    28.8       779       0
#> 2 Afghanistan 1957  9240934      Asia    30.3       821       0
#> 3 Afghanistan 1962 10267083      Asia    32.0       853       0
#> 4 Afghanistan 1967 11537966      Asia    34.0       836       0
#> 5 Afghanistan 1972 13079460      Asia    36.1       740       1
#> 6 Afghanistan 1977 14880372      Asia    38.4       786       1

16.2 Functions

Functions are the basic building blocks of programs. Think of them as “mini-scripts” or “tiny commands.” We’ve already used dozens of functions created by others (e.g. filter(), mean().)

This lesson teaches you how to write you own functions, and why you would want to do so. The details are pretty simple, but this is one of those ideas where it’s good to get lots of practice!

16.2.1 Why Write Functions?

Functions allow you to automate common tasks in a more powerful and general way than copy-and-pasting. For example, take a look at the following code:

df <- data.frame(
  a = rnorm(10),
  b = rnorm(10),
  c = rnorm(10),
  d = rnorm(10)
)

df$a <- (df$a - min(df$a)) / (max(df$a) - min(df$a))
df$b <- (df$b - min(df$b)) / (max(df$b) - min(df$a))
df$c <- (df$c - min(df$c)) / (max(df$c) - min(df$c))
df$d <- (df$d - min(df$d)) / (max(df$d) - min(df$d))

You might be able to puzzle out that this rescales each column to have a range from 0 to 1. But did you spot the mistake? I made an error when copying-and-pasting the code for df$b: I forgot to change an a to a b.

Functions have a number of advantages over this “copy-and-paste” approach:

  • They are easy to reuse. If you need to change things, you only have to update code in one place, instead of many.

  • They are self-documenting. Functions name pieces of code the way variables name strings and numbers. Give your function a good name and you will easily remember the function and its purpose.

  • They are easier to debug. There are fewer chances to make mistakes because the code only exists in one location (i.e. updating a variable name in one place, but not in another).

16.2.2 Anatomy of a Function

Functions have three key components:

  1. A name. This should be informative and describe what the function does.
  2. The arguments, or list of inputs, to the function. They go inside the parentheses in function().
  3. The body. This is the block of code within {} that immediately follows function(...), and is the code that you develop to perform the action described in the name using the arguments you provide.
my_function <- function(x, y){
  # do
  # something
  # here
  return(result)
}

In this example, my_function is the name of the function, x and y are the arguments, and the stuff inside the {} is the body.

16.2.3 Writing a Function

Let’s re-write the scaling code above as a function. To write a function you need to first analyze the code. How many inputs does it have?

df <- data.frame(
  a = rnorm(10),
  b = rnorm(10),
  c = rnorm(10),
  d = rnorm(10)
)

(df$a - min(df$a)) / (max(df$a) - min(df$a))
#>  [1] 0.289 0.751 0.000 0.678 0.853 1.000 0.172 0.611 0.612 0.601

This code only has one input: df$a. To make the inputs more clear, it’s a good idea to rewrite the code using temporary variables with general names. Here this code only requires a single numeric vector, which I’ll call x:

x <- df$a
(x - min(x)) / (max(x) - min(x))
#>  [1] 0.289 0.751 0.000 0.678 0.853 1.000 0.172 0.611 0.612 0.601

There is some duplication in this code. We’re computing the range of the data three times, so it makes sense to do it in one step:

rng <- range(x)
rng
#> [1] -2.44  1.15

(x - rng[1]) / (rng[2] - rng[1])
#>  [1] 0.289 0.751 0.000 0.678 0.853 1.000 0.172 0.611 0.612 0.601

Pulling out intermediate calculations into named variables is a good practice because it becomes more clear what the code is doing. Now that I’ve simplified the code, and checked that it still works, I can turn it into a function:

rescale01 <- function(x) {
  rng <- range(x)
  scaled <- (x - rng[1]) / (rng[2] - rng[1])
  return(scaled)
}

Note the overall process: I only made the function after I’d figured out how to make it work with a simple input. It’s easier to start with working code and turn it into a function; it’s harder to create a function and then try to make it work.

At this point it’s a good idea to check your function with a few different inputs:

rescale01(c(-10, 0, 10))
#> [1] 0.0 0.5 1.0

rescale01(c(1, 2, 3, 5))
#> [1] 0.00 0.25 0.50 1.00

16.2.4 Using a Function

Two important points about using (or *calling**) functions:

  1. Notice that when we call a function, we’re passing a value into it that is assigned to the parameter we defined when writing the function. In this case, the parameter x is automatically assigned to c(-10, 0, 10).

  2. When using functions, by default the returned object is merely printed to the screen. If you want it saved, you need to assign it to an object.

Let’s see if we can simplify the original example with our brand new function:

df$a <- rescale01(df$a)
df$b <- rescale01(df$b)
df$c <- rescale01(df$c)
df$d <- rescale01(df$d)

Compared to the original, this code is easier to understand and we’ve eliminated one class of copy-and-paste errors. There is still quite a bit of duplication since we’re doing the same thing to multiple columns. We’ll learn how to eliminate that duplication in the lesson on iteration.

Another advantage of functions is that if our requirements change, we only need to make the change in one place. For example, we might discover that some of our variables include NA values, and rescale01() fails:

rescale01(c(1, 2, NA, 3, 4, 5))
#> [1] NA NA NA NA NA NA

Because we’ve extracted the code into a function, we only need to make the fix in one place:

rescale01 <- function(x) {
  rng <- range(x, na.rm = TRUE)
  (x - rng[1]) / (rng[2] - rng[1])
}
rescale01(c(1, 2, NA, 3, 4, 5))
#> [1] 0.00 0.25   NA 0.50 0.75 1.00

16.2.5 Variable Scope

Analyze the following function:

  1. Identify the name, arguments, and body
  2. What does it do?
  3. If a = 3 and b = 4, what should we expect the output to be?
pythagorean <- function(a, b){
  hypotenuse <- sqrt(a^2 + b^2)
  return(hypotenuse)
}

Now take a look at the following code:

pythagorean(a = 3, b = 4)
#> [1] 5

hypotenuse
#> Error in eval(expr, envir, enclos): object 'hypotenuse' not found

Why does this generate an error? Why can we not see the results of hypotenuse? After all, it was generated by pythagorean, right?

When you call a function, a temporary workspace is set up that will be destroyed when the function returns, either by:

  1. getting to the end, or
  2. an explicit return statement

So think of functions as an alternative reality, where objects are created and destroyed in a function call.

This is why you do not see hypotenuse listed in the environment - it has already been destroyed.

Global vs. Local Environments

Things can get confusing when you use the same names for variables both inside and outside a function. Check out this example:

pressure = 103.9
adjust <- function(t){
    temperature = t * 1.43 / pressure
    return(temperature)
}
pressure
#> [1] 104
temperature
#> Error in eval(expr, envir, enclos): object 'temperature' not found

t and temperature are local variables in adjust. Local variables are:

  • Defined in the function.
  • Not visible in the main program.
  • Remember: a function parameter is a variable that is automatically assigned a value when the function is called.

pressure is a global variable. Global variables are:

  • Defined outside any particular function.
  • Visible everywhere.

This difference is referred to as scope. The scope of a variable is the part of a program that can ‘see’ that variable.

16.2.6 Arguments

Functions do not need to take input.

print_hello <- function(){
    print("hello")
}
print_hello()
#> [1] "hello"

But if a function takes input, arguments can be passed to functions in different ways.

  1. Positional arguments are mandatory and have no default values.
send <- function(message, recipient){
  message <- paste(message, recipient)
  return(message)
}
send("Hello", "world")
#> [1] "Hello world"

In the case above, it is possible to use argument names when calling the functions and, by doing so, it is possible to switch the order of arguments. For instance:

send(recipient='World', message='Hello')
#> [1] "Hello World"

However, positional arguments (send('Hello', 'World')) are greatly perfered over names (send(recipient='World', message='Hello')), as it is very easy to accidentally specifying incorrect argument values.

  1. Keyword arguments are not mandatory and have default values. They are often used for optional parameters sent to the function.
send <- function(message, recipient, cc=NULL){
  message <- paste(message, recipient, "cc:", cc)
  return(message)
}
send("Hello", "world") 
#> [1] "Hello world cc: "
send("Hello", "world", "rochelle")
#> [1] "Hello world cc: rochelle"

Here cc and bcc are optional, and evaluate to NULL when they are not passed another value.

16.2.7 Challenges

Challenge 1

Write a function that calculates the sum of the squared value of two numbers. For instance, it should generate the following output:

my_function(3, 4)
# [1] 25

Challenge 2

Write both_na(), a function that takes two vectors of the same length and returns the number of positions that have an NA in both vectors.

Challenge 3

Fill in the blanks to create a function that takes a name like “Rochelle Terman” and returns that name in uppercase and reversed, like “TERMAN, ROCHELLE”

standard_names <- function(name){
    upper_case = toupper(____) # make upper
    upper_case_vec = strsplit(_____, split = ' ')[[1]] # turn into a vector
    first_name = ______ # take first name
    last_name = _______ # take last name
    reversed_name = paste(______, _______, sep = ", ") # reverse and separate by a comma and space
    return(reversed_name)
}

Challenge 4

Look at the following function:

print_date <- function(year, month, day){
    joined = paste(as.character(year), as.character(month), as.character(day), sep = "/")
    return(joined)
}

What does this short program print?

print_date(day=1, month=2, year=2003)

Acknowledgements and Resources

16.3 Iteration

In the last unit, we talked about how important it is to reduce duplication in your code by creating functions instead of copying-and-pasting. Avoiding duplication allows for more readable, more flexible, and less error-prone code.

Functions are one method of reducing duplication in your code. Another tool for reducing duplication is iteration, which lets you do the same task to multiple inputs.

In this chapter you’ll learn about four approaches to iteratation:

  1. Vectorized functions
  2. For-loops
  3. map and functional programming
  4. Scoped verbs in dplyr

16.3.1 Vectorized Functions

Most of R’s built-in functions are vectorized, meaning that the function will operate on all elements of a vector without needing to loop through and act on each element at a time.

That means you should never need to perform explicit iteration when performing simple mathematical computations.

x <- 1:4
x * 2
#> [1] 2 4 6 8

Notice that the multiplication happened to each element of the vector. Most built-in functions also operate element-wise on vectors:

x <- 1:4
log(x)
#> [1] 0.000 0.693 1.099 1.386

We can also add two vectors together:

x <- 1:4
y <- 6:9
x + y
#> [1]  7  9 11 13

Notice that each element of x was added to its corresponding element of y:

x:  1  2  3  4
    +  +  +  +
y:  6  7  8  9
---------------
    7  9 11 13

What happens if you add two vectors of different lengths?

1:10 + 1:2
#>  [1]  2  4  4  6  6  8  8 10 10 12

Here, R will expand the shortest vector to the same length as the longest. This is called recycling. This usually (but not always) happens silently, meaning R will not warn you. Beware!

16.3.2 For-loops

You will frequently need to iterate over vectors or data frames, perform an operation on each element, and save the results somewhere.

For example, imagine we have this simple dataframe:

df <- data.frame(
  a = rnorm(10),
  b = rnorm(10),
  c = rnorm(10),
  d = rnorm(10)
)

We want to compute the median of each column. You could do with copy-and-paste:

median(df$a)
#> [1] -0.246
median(df$b)
#> [1] -0.287
median(df$c)
#> [1] -0.0567
median(df$d)
#> [1] 0.144

But that breaks our rule of thumb: never copy and paste more than twice. Instead, we could use a for loop:

output <- vector("double", ncol(df))  # 1. output
for (i in seq_along(df)) {            # 2. sequence
  output[i] <- median(df[[i]])        # 3. body
}
output
#> [1] -0.2458 -0.2873 -0.0567  0.1443

Components of a for Loop

Every for loop has three components:

  1. The output: output <- vector("double", length(x)).

Before you start a loop, you need to create an empty vector to store the output of the loop. Notice that the object is created outside the loop!

Preallocating space for your output is very important for efficiency: if you grow the for loop at each iteration using c() (for example), your for loop will be very slow.

  1. The sequence: i in seq_along(df).

This determines what to loop over. In this case, the sequence is seq_along(df), which creates a numeric vector for a sequence of numbers beginning at 1 and continuing until it reaches the length of df (the length here is the number of columns in df).

seq_along(df)
#> [1] 1 2 3 4

It’s useful to think of i as a pronoun, like “it”. Each iteration of the for loop will assign i to a new value based on the designed sequence:

| Iteration | i =  |
|-----------|------|
| 1         | 1    | 
| 2         | 2    | 
| 3         | 3    | 
| 4         | 4    | 
=

NB: seq_along is a safe version of the more familiar 1:length(l), with an important difference: if you have a zero-length vector, seq_along() does the right thing:

y <- vector("double", 0)
seq_along(y)
#> integer(0)
1:length(y)
#> [1] 1 0

You probably won’t create a zero-length vector deliberately, but it’s easy to create one accidentally. If you use 1:length(x) instead of seq_along(x), you’re likely to get a confusing error message.

  1. The body: output[[i]] <- median(df[[i]]).

This is the code that does the work. It runs repeatedly, each time with a different value for i:

| Iteration | i =  | body                           |
|-----------|------|--------------------------------|
| 1         | 1    | output[[1]] <- median(df[[1]]) |
| 2         | 2    | output[[2]] <- median(df[[2]]) |
| 3         | 3    | output[[3]] <- median(df[[3]]) |
| 4         | 4    | output[[4]] <- median(df[[4]]) |

NB:: We use [[ notation to reference each column of df using indices of columns, instead of $ and column names.

16.3.3 Challenges

Challenge 1.

Fill in the blanks to write a for loop that calculates the arithmetic mean for every column in mtcars.

mtcars.means <- vector("double", ______)
for(i in ______){
  ______[i] <- mean(______[[i]])
}

Challenge 2.

Check out the iris dataset:

kable(head(iris))
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
5.1 3.5 1.4 0.2 setosa
4.9 3.0 1.4 0.2 setosa
4.7 3.2 1.3 0.2 setosa
4.6 3.1 1.5 0.2 setosa
5.0 3.6 1.4 0.2 setosa
5.4 3.9 1.7 0.4 setosa

Write a for loop that calculates the number of unique values in each column of iris. Before you write the for loop, identify the three components you need:

  1. Output
  2. Sequence
  3. Body

Challenge 3.

Generate 10 random normals for each of \(\mu = -10\), \(0\), \(10\), and \(100\). Store them in a list.

16.3.4 Functional Programming and map

Loops are not as important in R as they are in other languages because R is a functional programming language. This means that it’s possible to wrap up for loops in a function, and call that function instead of using the for loop directly.

The pattern of looping over a vector, doing something to each element and saving the results is so common that the purrr package provides a family of functions to do it for you. They effectively eliminate the need for many common for loops.

There is one function for each type of output:

  1. map() makes a list.
  2. map_lgl() makes a logical vector.
  3. map_int() makes an integer vector.
  4. map_dbl() makes a double vector.
  5. map_chr() makes a character vector.

Each function takes a vector as input, applies a function to each piece, and then returns a new vector that’s the same length (and has the same names) as the input.

NB: Some people will tell you to avoid for loops because they are slow. They’re wrong! (Well, at least they’re rather out of date, as for loops haven’t been slow for many years). The main benefit of using functions like map() is not speed, but clarity: they make your code easier to write and to read.

To see how map works, consider (again) this simple data frame:

df <- tibble(
  a = rnorm(10),
  b = rnorm(10),
  c = rnorm(10),
  d = rnorm(10)
)

What if we wanted to calculate the mean, median, and standard deviation of each column?

map_dbl(df, mean)
#>      a      b      c      d 
#>  0.116  0.127 -0.089  0.281
map_dbl(df, median)
#>       a       b       c       d 
#>  0.0583  0.0244 -0.0571  0.2604
map_dbl(df, sd)
#>     a     b     c     d 
#> 1.161 1.226 1.024 0.798

Compared to using a for loop, this approach is much easier to read, and less error-prone.

The data can even be piped!

df %>% map_dbl(mean)
#>      a      b      c      d 
#>  0.116  0.127 -0.089  0.281
df %>% map_dbl(median)
#>       a       b       c       d 
#>  0.0583  0.0244 -0.0571  0.2604
df %>% map_dbl(sd)
#>     a     b     c     d 
#> 1.161 1.226 1.024 0.798

We can also pass additional arguments. For example, the function mean passes an optional argument trim. From the help file: “the fraction (0 to 0.5) of observations to be trimmed from each end of x before the meanis computed.

map_dbl(df, mean, trim = 0.5)
#>       a       b       c       d 
#>  0.0583  0.0244 -0.0571  0.2604

Check out other fun applications of map functions here

16.3.5 Challenges

Write code that uses one of the map functions to:

  1. Calculates the arithmetic mean for every column in mtcars.

  2. Calculates the number of unique values in each column of iris.

  3. Generate 10 random normals for each of \(\mu = -10\), \(0\), \(10\), and \(100\).

16.3.6 Scoped Verbs

The last iteration technique we’ll discuss is scoped verbs in dplyr.

Frequently, when working with dataframes, we want to apply a function to multiple columns. For example, let’s say we want to calculate the mean value of each column in mtcars.

If we wanted to calculate the average of a single column, it would be pretty simple using just regular dplyr functions:

mtcars %>%
  summarize(mpg = mean(mpg))
#>    mpg
#> 1 20.1

But if we want to calculate the mean for all of them, we’d have to duplicate this code many times over:

mtcars %>%
  summarize(mpg = mean(mpg),
            cyl = mean(cyl),
            disp = mean(disp),
            hp = mean(hp),
            drat = mean(drat),
            wt = mean(wt),
            qsec = mean(qsec),
            vs = mean(vs),
            am = mean(am),
            gear = mean(gear),
            carb = mean(carb))
#>    mpg  cyl disp  hp drat   wt qsec    vs    am gear carb
#> 1 20.1 6.19  231 147  3.6 3.22 17.8 0.438 0.406 3.69 2.81

This is very repetitive and prone to mistakes!

We just saw one approach to solve this problem: map. Another approach is scoped verbs.

Scoped verbs allow you to use standard verbs (or functions) in dplyr that affect multiple variables at once.

  • _if allows you to pick variables based on a predicate function like is.numeric() or is.character()
  • _at allows you to pick variables using the same syntax as select()
  • _all operates on all variables

These verbs can apply to summarize, filter, or mutate. Let’s go over summarize:

summarize_all()

summarize_all() takes a dataframe and a function and applies that function to each column:

mtcars %>%
  summarize_all(.funs = mean)
#>    mpg  cyl disp  hp drat   wt qsec    vs    am gear carb
#> 1 20.1 6.19  231 147  3.6 3.22 17.8 0.438 0.406 3.69 2.81

summarize_at()

summarize_at() allows you to pick columns in the same way as select(), that is, based on their names. There is one small difference: you need to wrap the complete selection with the vars() helper (this avoids ambiguity).

mtcars %>%
  summarize_at(.vars = vars(mpg, wt), .funs = mean)
#>    mpg   wt
#> 1 20.1 3.22

summarize_if()

summarize_if() allows you to pick variables to summarize based on some property of the column. For example, what if we want to apply a numeric summary function only to numeric columns:

iris %>%
  summarize_if(.predicate = is.numeric, .funs = mean)
#>   Sepal.Length Sepal.Width Petal.Length Petal.Width
#> 1         5.84        3.06         3.76         1.2

mutate and filter work in a similar way. To see more, check out Scoped verbs by the Data Challenge Lab

Acknowledgments

A good portion of this lesson is based on: