Chapter 8 Subsetting

When working with data, you'll need to subset objects early and often. Luckily, R's subsetting operators are powerful and fast. Mastery of subsetting allows you to succinctly express complex operations in a way that few other languages can match. Subsetting is hard to learn because you need to master a number of interrelated concepts:

  • The three subsetting operators: [, [[, and $.

  • The four types of subsetting.

  • The important differences in behaviour for different objects (e.g., vectors, lists, factors, matrices, and data frames).

  • The use of subsetting in conjunction with assignment.

This unit helps you master subsetting by starting with the simplest type of subsetting: subsetting an atomic vector with [. It then gradually extends your knowledge, first to more complicated data types (like dataframes and lists), and then to the other subsetting operators, [[ and $. You'll then learn how subsetting and assignment can be combined to modify parts of an object, and, finally, you'll see a large number of useful applications.

8.1 Subsetting Vectors

It's easiest to learn how subsetting works for vectors, and then how it generalises to higher dimensions and other more complicated objects. We'll start with [, the most commonly used operator.

8.1.1 Subsetting Types

Let's explore the different types of subsetting with a simple vector, x.

x <- c(2.1, 4.2, 3.3, 5.4)

Note that the number after the decimal point gives the original position in the vector.

There are four things you can use to subset a vector:

1. Positive integers return elements at the specified positions:

(x <- c(2.1, 4.2, 3.3, 5.4))
#> [1] 2.1 4.2 3.3 5.4
x[1]
#> [1] 2.1

We can also index multiple values by passing a vector of integers:

(x <- c(2.1, 4.2, 3.3, 5.4))
#> [1] 2.1 4.2 3.3 5.4
x[c(3, 1)]
#> [1] 3.3 2.1

# Duplicated indices yield duplicated values
x[c(1, 1)]
#> [1] 2.1 2.1

Note that you have to use c inside the [ for this to work!

More examples:

# `order(x)` gives the index positions of smallest to largest values.
(x <- c(2.1, 4.2, 3.3, 5.4))
#> [1] 2.1 4.2 3.3 5.4
order(x)
#> [1] 1 3 2 4

# use this to order values.
x[order(x)]
#> [1] 2.1 3.3 4.2 5.4
x[c(1, 3, 2, 4)]
#> [1] 2.1 3.3 4.2 5.4

2. Negative integers omit elements at the specified positions:

x <- c(2.1, 4.2, 3.3, 5.4)
x[-1]
#> [1] 4.2 3.3 5.4
x[-c(3, 1)]
#> [1] 4.2 5.4

You can't mix positive and negative integers in a single subset:

x <- c(2.1, 4.2, 3.3, 5.4)
x[c(-1, 2)]
#> Error in x[c(-1, 2)]: only 0's may be mixed with negative subscripts

3. Character vectors return elements with matching names. This only works if the vector is named.

x <- c(2.1, 4.2, 3.3, 5.4)

# apply names
names(x) <- c("a", "b", "c", "d")

# subset using names
x[c("d", "c", "a")]
#>   d   c   a 
#> 5.4 3.3 2.1

# Like integer indices, you can repeat indices
x[c("a", "a", "a")]
#>   a   a   a 
#> 2.1 2.1 2.1

# Careful! Names are always matched exactly
x <- c(abc = 1, def = 2)
x[c("a", "d")]
#> <NA> <NA> 
#>   NA   NA

4. Logical vectors select elements where the corresponding logical value is TRUE.

x <- c(2.1, 4.2, 3.3, 5.4)
x[c(TRUE, TRUE, FALSE, FALSE)]
#> [1] 2.1 4.2

8.1.2 Conditional Subsetting

Logical subsetting is the most useful type of subsetting, because you use it to subset based on conditional or comparative statements.

The (logical) comparison operators known to R are:

  • < for less than
  • > for greater than
  • <= for less than or equal to
  • >= for greater than or equal to
  • == for equal to each other
  • != not equal to each other

The nice thing about R is that you can use these comparison operators also on vectors. For example:

x <- c(2.1, 4.2, 3.3, 5.4)
x > 3
#> [1] FALSE  TRUE  TRUE  TRUE

This command tests for every element of the vector if the condition stated by the comparison operator is TRUE or FALSE. And it returns a logical vector!

We can now pass this statement between the square brackets that follow x to subset only those items that match TRUE:

x[x > 3]
#> [1] 4.2 3.3 5.4

You can combine conditional statements with & (and), | (or), and ! (not)

x <- c(2.1, 4.2, 3.3, 5.4)

# combing two conditional statements with &
x > 3 & x < 5
#> [1] FALSE  TRUE  TRUE FALSE
x[x > 3 & x < 5]
#> [1] 4.2 3.3

# combing two conditional statements with |
x < 3 | x > 5 
#> [1]  TRUE FALSE FALSE  TRUE
x[x < 3 | x > 5]
#> [1] 2.1 5.4

# combining conditional statements with !
!x > 5 
#> [1]  TRUE  TRUE  TRUE FALSE
x[!x > 5]
#> [1] 2.1 4.2 3.3

Another way to generate implicit conditional statements is using the %in% operator, which tests whether an item is in a set:

x <- c(2.1, 4.2, 3.3, 5.4)

# generate implicit logical vectors through the %in% operator
x %in% c(3.3, 4.2)
#> [1] FALSE  TRUE  TRUE FALSE
x[x %in% c(3.3, 4.2)]
#> [1] 4.2 3.3

8.1.3 Challenge

Subset country.vector below to return every value EXCEPT "Canada" and "Brazil"

country.vector<-c("Afghanistan", "Canada", "Sierra Leone", "Denmark", "Japan", "Brazil")

# Do it using positive integers

# Do it using negative integers

# Do it using a logical vector

# Do it using a conditional statement (and an implicit logical vector)

8.2 Subsetting Lists

Subsetting a list works in the same way as subsetting an atomic vector. However, there's one important difference: [ will always return a list. [[ and $, as described below, let you pull out the components of the list.

Let's illustrate with the following list my_list:

my_list <- list(a = 1:3, b = "a string", c = pi, d = list(-1, -5))

8.2.1 With [

[ extracts a sub-list where the result will always be a list. Like with vectors, you can subset with a logical, integer, or character vector.

my_list[1:2]
#> $a
#> [1] 1 2 3
#> 
#> $b
#> [1] "a string"
str(my_list[1:2])
#> List of 2
#>  $ a: int [1:3] 1 2 3
#>  $ b: chr "a string"

my_list[4]
#> $d
#> $d[[1]]
#> [1] -1
#> 
#> $d[[2]]
#> [1] -5
str(my_list[4])
#> List of 1
#>  $ d:List of 2
#>   ..$ : num -1
#>   ..$ : num -5

my_list["a"]
#> $a
#> [1] 1 2 3
str(my_list["a"])
#> List of 1
#>  $ a: int [1:3] 1 2 3

8.2.2 With [[

[[ extracts a single component from a list. In other words, it removes that hierarchy and returns whatever object is stored inside.

my_list[[1]]
#> [1] 1 2 3
str(my_list[[1]])
#>  int [1:3] 1 2 3

# compare to
my_list[1]
#> $a
#> [1] 1 2 3
str(my_list[1])
#> List of 1
#>  $ a: int [1:3] 1 2 3

The distinction between [ and [[ is really important for lists, because [[ drills down into the list while [ returns a new, smaller list.

"If list x is a train carrying objects, then x[[5]] is the object in car 5; x[4:6] is a train of cars 4-6."

--- (???)

8.2.3 with $

$ is a shorthand for extracting named elements of a list. It works similarly to [[ except that you don’t need to use quotes.

my_list$a
#> [1] 1 2 3

# same as
my_list[["a"]]
#> [1] 1 2 3

The $ operator becomes especially helpful when applied to dataframes, explained more below.

8.2.4 Challenge

Take a look at the linear model below:

mod <- lm(mpg ~ wt, data = mtcars)
summary(mod)
#> 
#> Call:
#> lm(formula = mpg ~ wt, data = mtcars)
#> 
#> Residuals:
#>    Min     1Q Median     3Q    Max 
#> -4.543 -2.365 -0.125  1.410  6.873 
#> 
#> Coefficients:
#>             Estimate Std. Error t value Pr(>|t|)    
#> (Intercept)   37.285      1.878   19.86  < 2e-16 ***
#> wt            -5.344      0.559   -9.56  1.3e-10 ***
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Residual standard error: 3.05 on 30 degrees of freedom
#> Multiple R-squared:  0.753,  Adjusted R-squared:  0.745 
#> F-statistic: 91.4 on 1 and 30 DF,  p-value: 1.29e-10

Extract the R squared from the model summary.

8.3 Subsetting Matrices

Similar to vectors, you can use the square brackets [ ] to select one or multiple elements from a matrix. But whereas vectors have one dimension, matrices have two dimensions. We therefore have to use two subsetting vectors -- one for rows to select, another for columns -- separated by a comma.

Check out the following matrix:

a <- matrix(1:9, nrow = 3)
colnames(a) <- c("A", "B", "C")
a
#>      A B C
#> [1,] 1 4 7
#> [2,] 2 5 8
#> [3,] 3 6 9

We can subset this matrix by passing two subsetting vectors: one to select rows, another to select columns:

# selects the value at the first row and second column
a[1, 2] 
#> B 
#> 4

# selects first row, and the first and third columns
a[1, -2] 
#> A C 
#> 1 7

# selects first two rows, and the first and third columns
a[c(1,2), c(1, 3)] 
#>      A C
#> [1,] 1 7
#> [2,] 2 8

Blank subsetting is also useful because it lets you keep all rows or all columns.

a[c(1, 2), ] # selects first two rows and all columns
#>      A B C
#> [1,] 1 4 7
#> [2,] 2 5 8

8.4 Subsetting Dataframes

Data from data frames can be addressed like matrices, using two vectors separated by a comma.

# Definition of vectors
name <- c("Mercury", "Venus", "Earth", "Mars", "Jupiter", "Saturn", "Uranus", "Neptune")
type <- c("Terrestrial planet", "Terrestrial planet", "Terrestrial planet", "Terrestrial planet", "Gas giant", "Gas giant", "Gas giant", "Gas giant")
diameter <- c(0.382, 0.949, 1, 0.532, 11.209, 9.449, 4.007, 3.883)
rings <- c(FALSE, FALSE, FALSE, FALSE, TRUE, TRUE, TRUE, TRUE)

planets <- data.frame(name, type, diameter, rings, stringsAsFactors = F)
planets
#>      name               type diameter rings
#> 1 Mercury Terrestrial planet    0.382 FALSE
#> 2   Venus Terrestrial planet    0.949 FALSE
#> 3   Earth Terrestrial planet    1.000 FALSE
#> 4    Mars Terrestrial planet    0.532 FALSE
#> 5 Jupiter          Gas giant   11.209  TRUE
#> 6  Saturn          Gas giant    9.449  TRUE
#> 7  Uranus          Gas giant    4.007  TRUE
#> 8 Neptune          Gas giant    3.883  TRUE

Let's try some subsetting now.

# Print out diameter of Mercury (row 1, column 3)
planets[1, 3]
#> [1] 0.382

# Print out data for Mars (entire fourth row)
planets[4, ]
#>   name               type diameter rings
#> 4 Mars Terrestrial planet    0.532 FALSE

# Print first two rows of the first two columns
planets[1:2, 1:2]
#>      name               type
#> 1 Mercury Terrestrial planet
#> 2   Venus Terrestrial planet

8.4.1 Subsetting Names and $

Instead of using numerics to select elements of a data frame, you can also use the variable names to select columns of a data frame.

Suppose you want to select the first three elements of the type column. One way to do this is

planets[1:3, 2]
#> [1] "Terrestrial planet" "Terrestrial planet" "Terrestrial planet"

A possible disadvantage of this approach is that you have to know (or look up) the column number of type, which gets hard if you have a lot of variables. It is often easier to just make use of the variable name:

planets[1:3, "type"]
#> [1] "Terrestrial planet" "Terrestrial planet" "Terrestrial planet"

You will often want to select an entire column, namely one specific variable from a data frame. If you want to select all elements of the variable "diameter", for example, both of these will do the trick:

planets[,3]
#> [1]  0.382  0.949  1.000  0.532 11.209  9.449  4.007  3.883
planets[,"diameter"]
#> [1]  0.382  0.949  1.000  0.532 11.209  9.449  4.007  3.883

However, there is a short-cut. If your columns have names, you can use the $ sign:

planets$diameter
#> [1]  0.382  0.949  1.000  0.532 11.209  9.449  4.007  3.883

Remember that datasets are really lists of vectors (one vector per column). Just as list$name selects the name element from the list, df$name selects the name column (vector) from the dataframe.

8.4.2 Conditional Subsetting

What if we want to subset the dataset based on some condition? Let's say we want to extract all the planets with a diameter greater than 3? We could inspect the dataset and record all the observations that fit that description, but that's tedious and error prone.

There's a better way! We can combine two powerful subsetting tools: the $ operator and conditional subsetting.

First, we extract the diameter column.

diameters <- planets$diameter

Then, we find the elements that are greater than 3.

diameters > 3
#> [1] FALSE FALSE FALSE FALSE  TRUE  TRUE  TRUE  TRUE

It's a boolean vector! We can now use this inside [ , ] to extract all plantes with diameter > 3.

Think: Are we subsettings row or columns here?

planets[diameters > 3, ]
#>      name      type diameter rings
#> 5 Jupiter Gas giant    11.21  TRUE
#> 6  Saturn Gas giant     9.45  TRUE
#> 7  Uranus Gas giant     4.01  TRUE
#> 8 Neptune Gas giant     3.88  TRUE

# same as
# planets[planets$diameter > 3, ]

Because it allows you to easily combine conditions from multiple columns, logical subsetting is probably the most commonly used technique for extracting rows out of a data frame.

8.4.3 List-Like and Matrix-Like Subsetting

Data frames possess the characteristics of both lists and matrices: if you subset with a single vector, they behave like lists, and return only the columns.

df <- data.frame(x = 4:6, y = 3:1, z = letters[1:3])

# Like a list:
df[c("x", "z")]
#>   x z
#> 1 4 a
#> 2 5 b
#> 3 6 c

# Like a matrix
df[, c("x", "z")]
#>   x z
#> 1 4 a
#> 2 5 b
#> 3 6 c

But there’s an important difference when you select a single column: matrix subsetting simplifies by default, list subsetting does not.

df <- data.frame(x = 4:6, y = 3:1, z = letters[1:3])

# like a list
df["x"]
#>   x
#> 1 4
#> 2 5
#> 3 6
class(df["x"])
#> [1] "data.frame"

# like a matrix
df[, "x"]
#> [1] 4 5 6
class(df[, "x"])
#> [1] "integer"

8.4.4 Challenges

Challenge 1.

Fix each of the following common data frame subsetting errors:

# check out what we're dealing with
mtcars

# fix
mtcars[mtcars$cyl = 4, ]
mtcars[-1:4, ]
mtcars[mtcars$cyl <= 5]
mtcars[mtcars$cyl == 4 | 6, ]

Challenge 2.

Why does mtcars[1:20] return an error? How does it differ from the similar mtcars[1:20, ]?

8.5 Sub-assignment

8.5.1 Basics of Sub-assignment

All subsetting operators can be combined with assignment to modify selected values of the input vector.

x <- 1:5
x[c(1, 2)] <- 2:3
x
#> [1] 2 3 3 4 5

This is especially useful when conditionally modifying vectors. For example, let's say we wanted to replace all values less than 3 with NA.

x <- 1:5
x[x < 3] <- NA
x
#> [1] NA NA  3  4  5

This also works on dataframes. Let's say we wanted to modify our planets dataframe.

# Definition of vectors
name <- c("Mercury", "Venus", "Earth", "Mars", "Jupiter", "Saturn", "Uranus", "Neptune")
type <- c("Terrestrial planet", "Terrestrial planet", "Terrestrial planet", "Terrestrial planet", "Gas giant", "Gas giant", "Gas giant", "Gas giant")
diameter <- c(0.382, 0.949, 1, 0.532, 11.209, 9.449, 4.007, 3.883)
rings <- c(FALSE, FALSE, FALSE, FALSE, TRUE, TRUE, TRUE, TRUE)

planets <- data.frame(name, type, diameter, rings, stringsAsFactors = F)
planets
#>      name               type diameter rings
#> 1 Mercury Terrestrial planet    0.382 FALSE
#> 2   Venus Terrestrial planet    0.949 FALSE
#> 3   Earth Terrestrial planet    1.000 FALSE
#> 4    Mars Terrestrial planet    0.532 FALSE
#> 5 Jupiter          Gas giant   11.209  TRUE
#> 6  Saturn          Gas giant    9.449  TRUE
#> 7  Uranus          Gas giant    4.007  TRUE
#> 8 Neptune          Gas giant    3.883  TRUE

Let's say we want to replace the term "Terrestrial planet" with "TP". First we need to subset type for those elements:

planets$type == "Terrestrial planet"
#> [1]  TRUE  TRUE  TRUE  TRUE FALSE FALSE FALSE FALSE

Now we can re-assign the values of type:

planets$type[planets$type == "Terrestrial planet"]
#> [1] "Terrestrial planet" "Terrestrial planet" "Terrestrial planet"
#> [4] "Terrestrial planet"
planets$type[planets$type == "Terrestrial planet"] <- "TP"
planets
#>      name      type diameter rings
#> 1 Mercury        TP    0.382 FALSE
#> 2   Venus        TP    0.949 FALSE
#> 3   Earth        TP    1.000 FALSE
#> 4    Mars        TP    0.532 FALSE
#> 5 Jupiter Gas giant   11.209  TRUE
#> 6  Saturn Gas giant    9.449  TRUE
#> 7  Uranus Gas giant    4.007  TRUE
#> 8 Neptune Gas giant    3.883  TRUE

8.5.2 Recycling

When applying an operation to two vectors that requires them to be the same length, R automatically recycles, or repeats, the shorter one, until it is long enough to match the longer one.

df <- data.frame(x = 4:7, y = letters[1:4])

# r recycles values
df$x <- c(1, 2)
df
#>   x y
#> 1 1 a
#> 2 2 b
#> 3 1 c
#> 4 2 d

# sometimes this is helpful if you want to replace an entire vector to one value.
df$x <- df$x + 3
df
#>   x y
#> 1 4 a
#> 2 5 b
#> 3 4 c
#> 4 5 d

8.5.3 Applications

The basic principles described above give rise to a wide variety of useful applications. Some of the most important applications are described below. Many of these basic techniques are wrapped up into more concise functions (e.g., subset(), merge(), plyr::arrange()), but it is useful to understand how they are implemented with basic subsetting. This will allow you to adapt to new situations that are not dealt with by existing functions.

Ordering Columns

Consider we have this data frame:

df <- data.frame(
  Country = c("Iraq", "China", "Mexico", "Russia", "United Kingdom"),
  Region = c("Middle East", "Asia", "North America", "Eastern Europe", "Western Europe"),
  Language = c("Arabic", "Mandarin", "Spanish", "Russian", "English")
)
df
#>          Country         Region Language
#> 1           Iraq    Middle East   Arabic
#> 2          China           Asia Mandarin
#> 3         Mexico  North America  Spanish
#> 4         Russia Eastern Europe  Russian
#> 5 United Kingdom Western Europe  English

What if we wanted to reorder the columns so that Region is first? We can do so using subsetting with the names (or number) of the columns:

df <- data.frame(
  Country = c("Iraq", "China", "Mexico", "Russia", "United Kingdom"),
  Region = c("Middle East", "Asia", "North America", "Eastern Europe", "Western Europe"),
  Language = c("Arabic", "Mandarin", "Spanish", "Russian", "English")
)

# reorder columns using names
names(df)
#> [1] "Country"  "Region"   "Language"
df1 <- df[, c("Region", "Country", "Language")]
df1
#>           Region        Country Language
#> 1    Middle East           Iraq   Arabic
#> 2           Asia          China Mandarin
#> 3  North America         Mexico  Spanish
#> 4 Eastern Europe         Russia  Russian
#> 5 Western Europe United Kingdom  English

# reorder columns using indices
names(df)
#> [1] "Country"  "Region"   "Language"
df1 <- df[, c(2,1,3)]
df1
#>           Region        Country Language
#> 1    Middle East           Iraq   Arabic
#> 2           Asia          China Mandarin
#> 3  North America         Mexico  Spanish
#> 4 Eastern Europe         Russia  Russian
#> 5 Western Europe United Kingdom  English

One helpul function is the order function. It takes a vector as input and returns an integer vector describing how the subsetted vector should be ordered:

x <- c("b", "c", "a")
order(x)
#> [1] 3 1 2
x[order(x)]
#> [1] "a" "b" "c"

Knowing this, we can use order to reorder our columns by alphabetical order.

Removing (or keeping) columns from data frames.

There are two ways to remove columns from a data frame. You can set individual columns to NULL:

df <- data.frame(
  Country = c("Iraq", "China", "Mexico", "Russia", "United Kingdom"),
  Region = c("Middle East", "Asia", "North America", "Eastern Europe", "Western Europe"),
  Language = c("Arabic", "Mandarin", "Spanish", "Russian", "English")
)

df$Language <- NULL

Or you can subset to return only the columns you want:

df <- data.frame(
  Country = c("Iraq", "China", "Mexico", "Russia", "United Kingdom"),
  Region = c("Middle East", "Asia", "North America", "Eastern Europe", "Western Europe"),
  Language = c("Arabic", "Mandarin", "Spanish", "Russian", "English")
)

df1 <- df[, c("Country", "Region")]
df1
#>          Country         Region
#> 1           Iraq    Middle East
#> 2          China           Asia
#> 3         Mexico  North America
#> 4         Russia Eastern Europe
#> 5 United Kingdom Western Europe

# using negative integers
df2 <- df[, -3]
df2
#>          Country         Region
#> 1           Iraq    Middle East
#> 2          China           Asia
#> 3         Mexico  North America
#> 4         Russia Eastern Europe
#> 5 United Kingdom Western Europe