Chapter 8 Subsetting
When working with data, you'll need to subset objects early and often. Luckily, R's subsetting operators are powerful and fast. Mastery of subsetting allows you to succinctly express complex operations in a way that few other languages can match. Subsetting is hard to learn because you need to master a number of interrelated concepts:
The three subsetting operators:
[
,[[
, and$
.The four types of subsetting.
The important differences in behaviour for different objects (e.g., vectors, lists, factors, matrices, and data frames).
The use of subsetting in conjunction with assignment.
This unit helps you master subsetting by starting with the simplest type of subsetting: subsetting an atomic vector with [
. It then gradually extends your knowledge, first to more complicated data types (like dataframes and lists), and then to the other subsetting operators, [[
and $
. You'll then learn how subsetting and assignment can be combined to modify parts of an object, and, finally, you'll see a large number of useful applications.
8.1 Subsetting Vectors
It's easiest to learn how subsetting works for vectors, and then how it generalises to higher dimensions and other more complicated objects. We'll start with [
, the most commonly used operator.
8.1.1 Subsetting Types
Let's explore the different types of subsetting with a simple vector, x
.
x <- c(2.1, 4.2, 3.3, 5.4)
Note that the number after the decimal point gives the original position in the vector.
There are four things you can use to subset a vector:
1. Positive integers return elements at the specified positions:
(x <- c(2.1, 4.2, 3.3, 5.4))
#> [1] 2.1 4.2 3.3 5.4
x[1]
#> [1] 2.1
We can also index multiple values by passing a vector of integers:
(x <- c(2.1, 4.2, 3.3, 5.4))
#> [1] 2.1 4.2 3.3 5.4
x[c(3, 1)]
#> [1] 3.3 2.1
# Duplicated indices yield duplicated values
x[c(1, 1)]
#> [1] 2.1 2.1
Note that you have to use c
inside the [
for this to work!
More examples:
# `order(x)` gives the index positions of smallest to largest values.
(x <- c(2.1, 4.2, 3.3, 5.4))
#> [1] 2.1 4.2 3.3 5.4
order(x)
#> [1] 1 3 2 4
# use this to order values.
x[order(x)]
#> [1] 2.1 3.3 4.2 5.4
x[c(1, 3, 2, 4)]
#> [1] 2.1 3.3 4.2 5.4
2. Negative integers omit elements at the specified positions:
x <- c(2.1, 4.2, 3.3, 5.4)
x[-1]
#> [1] 4.2 3.3 5.4
x[-c(3, 1)]
#> [1] 4.2 5.4
You can't mix positive and negative integers in a single subset:
x <- c(2.1, 4.2, 3.3, 5.4)
x[c(-1, 2)]
#> Error in x[c(-1, 2)]: only 0's may be mixed with negative subscripts
3. Character vectors return elements with matching names. This only works if the vector is named.
x <- c(2.1, 4.2, 3.3, 5.4)
# apply names
names(x) <- c("a", "b", "c", "d")
# subset using names
x[c("d", "c", "a")]
#> d c a
#> 5.4 3.3 2.1
# Like integer indices, you can repeat indices
x[c("a", "a", "a")]
#> a a a
#> 2.1 2.1 2.1
# Careful! Names are always matched exactly
x <- c(abc = 1, def = 2)
x[c("a", "d")]
#> <NA> <NA>
#> NA NA
4. Logical vectors select elements where the corresponding logical value is TRUE
.
x <- c(2.1, 4.2, 3.3, 5.4)
x[c(TRUE, TRUE, FALSE, FALSE)]
#> [1] 2.1 4.2
8.1.2 Conditional Subsetting
Logical subsetting is the most useful type of subsetting, because you use it to subset based on conditional or comparative statements.
The (logical) comparison operators known to R are:
<
for less than>
for greater than<=
for less than or equal to>=
for greater than or equal to==
for equal to each other!=
not equal to each other
The nice thing about R is that you can use these comparison operators also on vectors. For example:
x <- c(2.1, 4.2, 3.3, 5.4)
x > 3
#> [1] FALSE TRUE TRUE TRUE
This command tests for every element of the vector if the condition stated by the comparison operator is TRUE
or FALSE
. And it returns a logical vector!
We can now pass this statement between the square brackets that follow x
to subset only those items that match TRUE
:
x[x > 3]
#> [1] 4.2 3.3 5.4
You can combine conditional statements with &
(and), |
(or), and !
(not)
x <- c(2.1, 4.2, 3.3, 5.4)
# combing two conditional statements with &
x > 3 & x < 5
#> [1] FALSE TRUE TRUE FALSE
x[x > 3 & x < 5]
#> [1] 4.2 3.3
# combing two conditional statements with |
x < 3 | x > 5
#> [1] TRUE FALSE FALSE TRUE
x[x < 3 | x > 5]
#> [1] 2.1 5.4
# combining conditional statements with !
!x > 5
#> [1] TRUE TRUE TRUE FALSE
x[!x > 5]
#> [1] 2.1 4.2 3.3
Another way to generate implicit conditional statements is using the %in%
operator, which tests whether an item is in a set:
x <- c(2.1, 4.2, 3.3, 5.4)
# generate implicit logical vectors through the %in% operator
x %in% c(3.3, 4.2)
#> [1] FALSE TRUE TRUE FALSE
x[x %in% c(3.3, 4.2)]
#> [1] 4.2 3.3
8.1.3 Challenge
Subset country.vector
below to return every value EXCEPT "Canada" and "Brazil"
country.vector<-c("Afghanistan", "Canada", "Sierra Leone", "Denmark", "Japan", "Brazil")
# Do it using positive integers
# Do it using negative integers
# Do it using a logical vector
# Do it using a conditional statement (and an implicit logical vector)
8.2 Subsetting Lists
Subsetting a list works in the same way as subsetting an atomic vector. However, there's one important difference: [
will always return a list. [[
and $
, as described below, let you pull out the components of the list.
Let's illustrate with the following list my_list
:
my_list <- list(a = 1:3, b = "a string", c = pi, d = list(-1, -5))
8.2.1 With [
[
extracts a sub-list where the result will always be a list. Like with vectors, you can subset with a logical, integer, or character vector.
my_list[1:2]
#> $a
#> [1] 1 2 3
#>
#> $b
#> [1] "a string"
str(my_list[1:2])
#> List of 2
#> $ a: int [1:3] 1 2 3
#> $ b: chr "a string"
my_list[4]
#> $d
#> $d[[1]]
#> [1] -1
#>
#> $d[[2]]
#> [1] -5
str(my_list[4])
#> List of 1
#> $ d:List of 2
#> ..$ : num -1
#> ..$ : num -5
my_list["a"]
#> $a
#> [1] 1 2 3
str(my_list["a"])
#> List of 1
#> $ a: int [1:3] 1 2 3
8.2.2 With [[
[[
extracts a single component from a list. In other words, it removes that hierarchy and returns whatever object is stored inside.
my_list[[1]]
#> [1] 1 2 3
str(my_list[[1]])
#> int [1:3] 1 2 3
# compare to
my_list[1]
#> $a
#> [1] 1 2 3
str(my_list[1])
#> List of 1
#> $ a: int [1:3] 1 2 3
The distinction between [
and [[
is really important for lists, because [[
drills down into the list while [
returns a new, smaller list.
"If list
x
is a train carrying objects, thenx[[5]]
is the object in car 5;x[4:6]
is a train of cars 4-6."--- (???)
8.2.3 with $
$
is a shorthand for extracting named elements of a list. It works similarly to [[
except that you don’t need to use quotes.
my_list$a
#> [1] 1 2 3
# same as
my_list[["a"]]
#> [1] 1 2 3
The $
operator becomes especially helpful when applied to dataframes, explained more below.
8.2.4 Challenge
Take a look at the linear model below:
mod <- lm(mpg ~ wt, data = mtcars)
summary(mod)
#>
#> Call:
#> lm(formula = mpg ~ wt, data = mtcars)
#>
#> Residuals:
#> Min 1Q Median 3Q Max
#> -4.543 -2.365 -0.125 1.410 6.873
#>
#> Coefficients:
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) 37.285 1.878 19.86 < 2e-16 ***
#> wt -5.344 0.559 -9.56 1.3e-10 ***
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Residual standard error: 3.05 on 30 degrees of freedom
#> Multiple R-squared: 0.753, Adjusted R-squared: 0.745
#> F-statistic: 91.4 on 1 and 30 DF, p-value: 1.29e-10
Extract the R squared from the model summary.
8.3 Subsetting Matrices
Similar to vectors, you can use the square brackets [ ]
to select one or multiple elements from a matrix. But whereas vectors have one dimension, matrices have two dimensions. We therefore have to use two subsetting vectors -- one for rows to select, another for columns -- separated by a comma.
Check out the following matrix:
a <- matrix(1:9, nrow = 3)
colnames(a) <- c("A", "B", "C")
a
#> A B C
#> [1,] 1 4 7
#> [2,] 2 5 8
#> [3,] 3 6 9
We can subset this matrix by passing two subsetting vectors: one to select rows, another to select columns:
# selects the value at the first row and second column
a[1, 2]
#> B
#> 4
# selects first row, and the first and third columns
a[1, -2]
#> A C
#> 1 7
# selects first two rows, and the first and third columns
a[c(1,2), c(1, 3)]
#> A C
#> [1,] 1 7
#> [2,] 2 8
Blank subsetting is also useful because it lets you keep all rows or all columns.
a[c(1, 2), ] # selects first two rows and all columns
#> A B C
#> [1,] 1 4 7
#> [2,] 2 5 8
8.4 Subsetting Dataframes
Data from data frames can be addressed like matrices, using two vectors separated by a comma.
# Definition of vectors
name <- c("Mercury", "Venus", "Earth", "Mars", "Jupiter", "Saturn", "Uranus", "Neptune")
type <- c("Terrestrial planet", "Terrestrial planet", "Terrestrial planet", "Terrestrial planet", "Gas giant", "Gas giant", "Gas giant", "Gas giant")
diameter <- c(0.382, 0.949, 1, 0.532, 11.209, 9.449, 4.007, 3.883)
rings <- c(FALSE, FALSE, FALSE, FALSE, TRUE, TRUE, TRUE, TRUE)
planets <- data.frame(name, type, diameter, rings, stringsAsFactors = F)
planets
#> name type diameter rings
#> 1 Mercury Terrestrial planet 0.382 FALSE
#> 2 Venus Terrestrial planet 0.949 FALSE
#> 3 Earth Terrestrial planet 1.000 FALSE
#> 4 Mars Terrestrial planet 0.532 FALSE
#> 5 Jupiter Gas giant 11.209 TRUE
#> 6 Saturn Gas giant 9.449 TRUE
#> 7 Uranus Gas giant 4.007 TRUE
#> 8 Neptune Gas giant 3.883 TRUE
Let's try some subsetting now.
# Print out diameter of Mercury (row 1, column 3)
planets[1, 3]
#> [1] 0.382
# Print out data for Mars (entire fourth row)
planets[4, ]
#> name type diameter rings
#> 4 Mars Terrestrial planet 0.532 FALSE
# Print first two rows of the first two columns
planets[1:2, 1:2]
#> name type
#> 1 Mercury Terrestrial planet
#> 2 Venus Terrestrial planet
8.4.1 Subsetting Names and $
Instead of using numerics to select elements of a data frame, you can also use the variable names to select columns of a data frame.
Suppose you want to select the first three elements of the type column. One way to do this is
planets[1:3, 2]
#> [1] "Terrestrial planet" "Terrestrial planet" "Terrestrial planet"
A possible disadvantage of this approach is that you have to know (or look up) the column number of type, which gets hard if you have a lot of variables. It is often easier to just make use of the variable name:
planets[1:3, "type"]
#> [1] "Terrestrial planet" "Terrestrial planet" "Terrestrial planet"
You will often want to select an entire column, namely one specific variable from a data frame. If you want to select all elements of the variable "diameter", for example, both of these will do the trick:
planets[,3]
#> [1] 0.382 0.949 1.000 0.532 11.209 9.449 4.007 3.883
planets[,"diameter"]
#> [1] 0.382 0.949 1.000 0.532 11.209 9.449 4.007 3.883
However, there is a short-cut. If your columns have names, you can use the $
sign:
planets$diameter
#> [1] 0.382 0.949 1.000 0.532 11.209 9.449 4.007 3.883
Remember that datasets are really lists of vectors (one vector per column). Just as list$name
selects the name
element from the list, df$name
selects the name
column (vector) from the dataframe.
8.4.2 Conditional Subsetting
What if we want to subset the dataset based on some condition? Let's say we want to extract all the planets with a diameter greater than 3? We could inspect the dataset and record all the observations that fit that description, but that's tedious and error prone.
There's a better way! We can combine two powerful subsetting tools: the $
operator and conditional subsetting.
First, we extract the diameter
column.
diameters <- planets$diameter
Then, we find the elements that are greater than 3.
diameters > 3
#> [1] FALSE FALSE FALSE FALSE TRUE TRUE TRUE TRUE
It's a boolean vector! We can now use this inside [ , ]
to extract all plantes with diameter > 3
.
Think: Are we subsettings row or columns here?
planets[diameters > 3, ]
#> name type diameter rings
#> 5 Jupiter Gas giant 11.21 TRUE
#> 6 Saturn Gas giant 9.45 TRUE
#> 7 Uranus Gas giant 4.01 TRUE
#> 8 Neptune Gas giant 3.88 TRUE
# same as
# planets[planets$diameter > 3, ]
Because it allows you to easily combine conditions from multiple columns, logical subsetting is probably the most commonly used technique for extracting rows out of a data frame.
8.4.3 List-Like and Matrix-Like Subsetting
Data frames possess the characteristics of both lists and matrices: if you subset with a single vector, they behave like lists, and return only the columns.
df <- data.frame(x = 4:6, y = 3:1, z = letters[1:3])
# Like a list:
df[c("x", "z")]
#> x z
#> 1 4 a
#> 2 5 b
#> 3 6 c
# Like a matrix
df[, c("x", "z")]
#> x z
#> 1 4 a
#> 2 5 b
#> 3 6 c
But there’s an important difference when you select a single column: matrix subsetting simplifies by default, list subsetting does not.
df <- data.frame(x = 4:6, y = 3:1, z = letters[1:3])
# like a list
df["x"]
#> x
#> 1 4
#> 2 5
#> 3 6
class(df["x"])
#> [1] "data.frame"
# like a matrix
df[, "x"]
#> [1] 4 5 6
class(df[, "x"])
#> [1] "integer"
8.4.4 Challenges
Challenge 1.
Fix each of the following common data frame subsetting errors:
# check out what we're dealing with
mtcars
# fix
mtcars[mtcars$cyl = 4, ]
mtcars[-1:4, ]
mtcars[mtcars$cyl <= 5]
mtcars[mtcars$cyl == 4 | 6, ]
Challenge 2.
Why does mtcars[1:20]
return an error? How does it differ from the similar mtcars[1:20, ]
?
8.5 Sub-assignment
8.5.1 Basics of Sub-assignment
All subsetting operators can be combined with assignment to modify selected values of the input vector.
x <- 1:5
x[c(1, 2)] <- 2:3
x
#> [1] 2 3 3 4 5
This is especially useful when conditionally modifying vectors. For example, let's say we wanted to replace all values less than 3 with NA.
x <- 1:5
x[x < 3] <- NA
x
#> [1] NA NA 3 4 5
This also works on dataframes. Let's say we wanted to modify our planets
dataframe.
# Definition of vectors
name <- c("Mercury", "Venus", "Earth", "Mars", "Jupiter", "Saturn", "Uranus", "Neptune")
type <- c("Terrestrial planet", "Terrestrial planet", "Terrestrial planet", "Terrestrial planet", "Gas giant", "Gas giant", "Gas giant", "Gas giant")
diameter <- c(0.382, 0.949, 1, 0.532, 11.209, 9.449, 4.007, 3.883)
rings <- c(FALSE, FALSE, FALSE, FALSE, TRUE, TRUE, TRUE, TRUE)
planets <- data.frame(name, type, diameter, rings, stringsAsFactors = F)
planets
#> name type diameter rings
#> 1 Mercury Terrestrial planet 0.382 FALSE
#> 2 Venus Terrestrial planet 0.949 FALSE
#> 3 Earth Terrestrial planet 1.000 FALSE
#> 4 Mars Terrestrial planet 0.532 FALSE
#> 5 Jupiter Gas giant 11.209 TRUE
#> 6 Saturn Gas giant 9.449 TRUE
#> 7 Uranus Gas giant 4.007 TRUE
#> 8 Neptune Gas giant 3.883 TRUE
Let's say we want to replace the term "Terrestrial planet" with "TP". First we need to subset type
for those elements:
planets$type == "Terrestrial planet"
#> [1] TRUE TRUE TRUE TRUE FALSE FALSE FALSE FALSE
Now we can re-assign the values of type
:
planets$type[planets$type == "Terrestrial planet"]
#> [1] "Terrestrial planet" "Terrestrial planet" "Terrestrial planet"
#> [4] "Terrestrial planet"
planets$type[planets$type == "Terrestrial planet"] <- "TP"
planets
#> name type diameter rings
#> 1 Mercury TP 0.382 FALSE
#> 2 Venus TP 0.949 FALSE
#> 3 Earth TP 1.000 FALSE
#> 4 Mars TP 0.532 FALSE
#> 5 Jupiter Gas giant 11.209 TRUE
#> 6 Saturn Gas giant 9.449 TRUE
#> 7 Uranus Gas giant 4.007 TRUE
#> 8 Neptune Gas giant 3.883 TRUE
8.5.2 Recycling
When applying an operation to two vectors that requires them to be the same length, R automatically recycles, or repeats, the shorter one, until it is long enough to match the longer one.
df <- data.frame(x = 4:7, y = letters[1:4])
# r recycles values
df$x <- c(1, 2)
df
#> x y
#> 1 1 a
#> 2 2 b
#> 3 1 c
#> 4 2 d
# sometimes this is helpful if you want to replace an entire vector to one value.
df$x <- df$x + 3
df
#> x y
#> 1 4 a
#> 2 5 b
#> 3 4 c
#> 4 5 d
8.5.3 Applications
The basic principles described above give rise to a wide variety of useful applications. Some of the most important applications are described below. Many of these basic techniques are wrapped up into more concise functions (e.g., subset()
, merge()
, plyr::arrange()
), but it is useful to understand how they are implemented with basic subsetting. This will allow you to adapt to new situations that are not dealt with by existing functions.
Ordering Columns
Consider we have this data frame:
df <- data.frame(
Country = c("Iraq", "China", "Mexico", "Russia", "United Kingdom"),
Region = c("Middle East", "Asia", "North America", "Eastern Europe", "Western Europe"),
Language = c("Arabic", "Mandarin", "Spanish", "Russian", "English")
)
df
#> Country Region Language
#> 1 Iraq Middle East Arabic
#> 2 China Asia Mandarin
#> 3 Mexico North America Spanish
#> 4 Russia Eastern Europe Russian
#> 5 United Kingdom Western Europe English
What if we wanted to reorder the columns so that Region
is first? We can do so using subsetting with the names (or number) of the columns:
df <- data.frame(
Country = c("Iraq", "China", "Mexico", "Russia", "United Kingdom"),
Region = c("Middle East", "Asia", "North America", "Eastern Europe", "Western Europe"),
Language = c("Arabic", "Mandarin", "Spanish", "Russian", "English")
)
# reorder columns using names
names(df)
#> [1] "Country" "Region" "Language"
df1 <- df[, c("Region", "Country", "Language")]
df1
#> Region Country Language
#> 1 Middle East Iraq Arabic
#> 2 Asia China Mandarin
#> 3 North America Mexico Spanish
#> 4 Eastern Europe Russia Russian
#> 5 Western Europe United Kingdom English
# reorder columns using indices
names(df)
#> [1] "Country" "Region" "Language"
df1 <- df[, c(2,1,3)]
df1
#> Region Country Language
#> 1 Middle East Iraq Arabic
#> 2 Asia China Mandarin
#> 3 North America Mexico Spanish
#> 4 Eastern Europe Russia Russian
#> 5 Western Europe United Kingdom English
One helpul function is the order
function. It takes a vector as input and returns an integer vector describing how the subsetted vector should be ordered:
x <- c("b", "c", "a")
order(x)
#> [1] 3 1 2
x[order(x)]
#> [1] "a" "b" "c"
Knowing this, we can use order
to reorder our columns by alphabetical order.
Removing (or keeping) columns from data frames.
There are two ways to remove columns from a data frame. You can set individual columns to NULL
:
df <- data.frame(
Country = c("Iraq", "China", "Mexico", "Russia", "United Kingdom"),
Region = c("Middle East", "Asia", "North America", "Eastern Europe", "Western Europe"),
Language = c("Arabic", "Mandarin", "Spanish", "Russian", "English")
)
df$Language <- NULL
Or you can subset to return only the columns you want:
df <- data.frame(
Country = c("Iraq", "China", "Mexico", "Russia", "United Kingdom"),
Region = c("Middle East", "Asia", "North America", "Eastern Europe", "Western Europe"),
Language = c("Arabic", "Mandarin", "Spanish", "Russian", "English")
)
df1 <- df[, c("Country", "Region")]
df1
#> Country Region
#> 1 Iraq Middle East
#> 2 China Asia
#> 3 Mexico North America
#> 4 Russia Eastern Europe
#> 5 United Kingdom Western Europe
# using negative integers
df2 <- df[, -3]
df2
#> Country Region
#> 1 Iraq Middle East
#> 2 China Asia
#> 3 Mexico North America
#> 4 Russia Eastern Europe
#> 5 United Kingdom Western Europe