Chapter 14 Data Classes and Structures

To make the best use of the R language, you will need a strong understanding of basic data structures and how to operate on them.

This is critical to understand because these are the objects you will manipulate on a day-to-day basis in R. But they are not always as easy to work with as they seem at the outset. Dealing with object types and conversions is one of the most common sources of frustration for beginners.

R’s base data structures can be organized by their dimensionality (1d, 2d, or nd) and whether they are homogeneous (all contents must be of the same type) or heterogeneous (the contents can be of different types). This gives rise to the five data types most often used in data analysis:

Homogeneous Heterogeneous
1d Atomic vector List
2d Matrix Dataframe
nd Array

Each data structure has its own specifications and behavior. In the rest of this chapter, we will cover the types of data objects that exist in R and how they work.

  1. Vectors
  2. Lists
  3. Matrices
  4. Dataframes

14.1 Vectors

Your garden variety R object is a vector. Vectors are 1-dimensional chains of values. We call each value an element of a vector.

14.1.1 Creating Vectors

A single piece of information that you regard as a scalar is just a vector of length 1. R will cheerfully let you add stuff to it with c(), which is short for ‘combine’:

x <- 3 * 4
x
#> [1] 12
is.vector(x)
#> [1] TRUE
length(x)
#> [1] 1

x <- c(1, 2, 3)
x
#> [1] 1 2 3
length(x)
#> [1] 3

# Other ways to make a vector
x <- 1:3

We can also add elements to the end of a vector by passing the original vector into the c function, like so:

z <- c("Beyonce", "Kelly", "Michelle", "LeToya")
z <- c(z, "Farrah")
z
#> [1] "Beyonce"  "Kelly"    "Michelle" "LeToya"   "Farrah"

Notice that vectors are always flat, even if you nest c()’s:

# These are equivalent
c(1, c(2, c(3, 4)))
#> [1] 1 2 3 4
c(1, 2, 3, 4)
#> [1] 1 2 3 4

14.1.2 Vectors Are Everywhere

R is built to work with vectors. Many operations are vectorized, meaning they will perform calculations on each component by default. Novices often do not internalize or exploit this and they write lots of unnecessary for loops.

a <- c(1, -2, 3)
a^2
#> [1] 1 4 9

We can also add two vectors. It is important to know that if you sum two vectors in R, it takes the element-wise sum. For example, the following three statements are completely equivalent:

c(1, 2, 3) + c(4, 5, 6)
c(1 + 4, 2 + 5, 3 + 6)
c(5, 7, 9)

When reading function documentation, keep your eyes peeled for arguments that can be vectors. You will be surprised how common they are. For example, the mean of random normal variables can be provided as a vector.

set.seed(1999)
rnorm(5, mean = c(10, 100, 1000, 10000, 100000))
#> [1]     10.7    100.0   1001.2  10001.5 100000.1

This could be awesome in some settings, but dangerous in others, i.e., if you exploit this by mistake and get no warning. This is one of the reasons it is so important to keep close tabs on your R objects: Are they what you expect in terms of their flavor and length or dimensions? Check early and check often.

14.1.3 Recycling

R recycles vectors if they are not the necessary length. You will get a warning if R suspects recycling is unintended, i.e., when one length is not an integer multiple of another, but recycling is silent if it seems like you know what you are doing. This can be a beautiful thing when you are doing it deliberately, but devastating when you are not.

(y <- 1:3)
#> [1] 1 2 3
(z <- 3:7)
#> [1] 3 4 5 6 7
y + z
#> Warning in y + z: longer object length is not a multiple of shorter object
#> length
#> [1] 4 6 8 7 9

(y <- 1:10)
#>  [1]  1  2  3  4  5  6  7  8  9 10
(z <- 3:7)
#> [1] 3 4 5 6 7
y + z
#>  [1]  4  6  8 10 12  9 11 13 15 17

14.1.4 Types of Vectors

There are four common types of vectors, depending on the class:

  1. integer
  2. numeric (same as double)
  3. character
  4. logical

Numeric Vectors

Numeric vectors contain numbers. They can be stored as integers (whole numbers) or doubles (numbers with decimal points). In practice, you rarely need to concern yourself with this difference, but just know that they are different but related things.

c(1, 2, 335)
#> [1]   1   2 335
c(4.2, 4, 6, 53.2)
#> [1]  4.2  4.0  6.0 53.2

Character Vectors

Character vectors contain character (or ‘string’) values. Note that each value has to be surrounded by quotation marks before the comma.

c("Beyonce", "Kelly", "Michelle", "LeToya")
#> [1] "Beyonce"  "Kelly"    "Michelle" "LeToya"

Logical (Boolean) Vectors

Logical vectors take on one of three possible values:

  1. TRUE
  2. FALSE
  3. NA (missing value)

They are often used in conjunction with Boolean expressions.

b1 <- c(TRUE, TRUE, FALSE, NA)
b1
#> [1]  TRUE  TRUE FALSE    NA

vec_1 <- c(1, 2, 3)
vec_2 <- c(1, 9, 3)
vec_1 == vec_2
#> [1]  TRUE FALSE  TRUE

b2 <- vec_1 == vec_2

14.1.5 Coercion

We can change or convert a vector’s type using as.....

num_var <- c(1, 2.5, 4.5)
class(num_var)
#> [1] "numeric"
as.character(num_var)
#> [1] "1"   "2.5" "4.5"

Remember that all elements of a vector must be the same type. So when you attempt to combine different types, they will be coerced to the most “flexible” type.

For example, combining a character and an integer yields a character:

c("a", 1)
#> [1] "a" "1"

Guess what the following do without running them first:

c(1.7, "a") 
c(TRUE, 2) 
c("a", TRUE) 

TRUE == 1 and FALSE == 0

Notice that when a logical vector is coerced to an integer or double, TRUE becomes 1 and FALSE becomes 0. This is very useful in conjunction with sum() and mean().

vec_1 <- c(1, 2, 3)
vec_2 <- c(1, 9, 3)
boo_1 <- vec_1 == vec_2

# Total number of TRUEs
sum(boo_1)
#> [1] 2

# Proportion that are TRUE
mean(boo_1)
#> [1] 0.667

Coercion often happens automatically.

This is called implicit coercion. Most mathematical functions (+, log, abs, etc.) will coerce to a double or integer, and most logical operations (&, |, any, etc) will coerce to a logical. You will usually get a warning message if the coercion might lose information.

1 < "2"
#> [1] TRUE
"1" > 2
#> [1] FALSE

Sometimes coercions, especially nonsensical ones, will not work.

x <- c("a", "b", "c")
as.numeric(x)
#> Warning: NAs introduced by coercion
#> [1] NA NA NA
as.logical(x)
#> [1] NA NA NA

14.1.6 Naming a Vector

We can also attach names to our vector. This helps us understand what each element refers to.

You can give names to the elements of a vector with the names() function. Have a look at this example:

days_month <- c(31, 28, 31, 30, 31, 30, 31, 31, 30, 31, 30, 31)
names(days_month) <- c("Jan", "Feb", "Mar", "Apr", "May", "Jun", "Jul", "Aug", "Sep", "Oct", "Nov", "Dec")

days_month
#> Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec 
#>  31  28  31  30  31  30  31  31  30  31  30  31

You can name a vector when you create it:

some_vector <- c(name = "Rochelle Terman", profession = "Professor Extraordinaire")
some_vector
#>                       name                 profession 
#>          "Rochelle Terman" "Professor Extraordinaire"

Notice that in the first case, we surrounded each name with quotation marks. But we do not have to do this when creating a named vector.

Names do not have to be unique, and not all values need to have a name associated with them. However, names are most useful for subsetting, described in the next chapter. When subsetting, it is most useful when the names are unique.

14.1.7 Challenges

Challenge 1: Create and examine your vector.

Create a character vector called fruit that contains 4 of your favorite fruits. Then evaluate its structure using the commands below:


# First create your fruit vector 
# YOUR CODE HERE


# Examine your vector
length(fruit)
class(fruit)
str(fruit)

Challenge 2: Coercion.


# 1. Create a vector of a sequence of numbers from 1 to 10.

# 2. Coerce that vector into a character vector.

# 3. Add the element "11" to the end of the vector.

# 4. Coerce it back to a numeric vector.

Challenge 3: Calculations on Vectors.

Create a vector of the numbers 11 to 20 and multiply it by the original vector from Challenge 2.

14.2 Subsetting Vectors

Sometimes we want to isolate elements of a vector for inspection, modification, etc. This is often called indexing or subsetting.

By the way, indexing begins at 1 in R, unlike many other languages that index from 0.

14.2.1 Subsetting Types

Let’s explore the different types of subsetting with a simple vector, x:

x <- c(2.1, 4.2, 3.3, 5.4)

Note that the number after the decimal point gives the original position in the vector.

There are four things you can use to subset a vector:

1. Positive integers return elements at the specified positions.

The simplest way to subset a vector is with a single integer:

x <- c(2.1, 4.2, 3.3, 5.4)
x[1]
#> [1] 2.1
x[3]
#> [1] 3.3

We can also index multiple values by passing a vector of integers:

x <- c(2.1, 4.2, 3.3, 5.4)
x[c(3, 1)]
#> [1] 3.3 2.1

Note that you have to use c inside the [ for this to work!

More examples:

# `order(x)` gives the index positions of smallest to largest values
(x <- c(2.1, 4.2, 3.3, 5.4))
#> [1] 2.1 4.2 3.3 5.4
order(x)
#> [1] 1 3 2 4

# Use this to order values
x[order(x)]
#> [1] 2.1 3.3 4.2 5.4
x[c(1, 3, 2, 4)]
#> [1] 2.1 3.3 4.2 5.4

2. Negative integers omit elements at the specified positions.

x <- c(2.1, 4.2, 3.3, 5.4)
x[-1]
#> [1] 4.2 3.3 5.4
x[-c(1, 3)]
#> [1] 4.2 5.4

You cannot mix positive and negative integers in a single subset:

x <- c(2.1, 4.2, 3.3, 5.4)
x[c(-1, 2)]
#> Error in x[c(-1, 2)]: only 0's may be mixed with negative subscripts

3. Character vectors return elements with matching names. This only works if the vector is named.

x <- c(2.1, 4.2, 3.3, 5.4)

# Apply names
names(x) <- c("a", "b", "c", "d")

# Subset using names
x["d"]
#>   d 
#> 5.4
x[c("d", "c", "a")]
#>   d   c   a 
#> 5.4 3.3 2.1

4. Logical vectors select elements where the corresponding logical value is TRUE.

x <- c(2.1, 4.2, 3.3, 5.4)
x[c(TRUE, TRUE, FALSE, FALSE)]
#> [1] 2.1 4.2

Logical subsetting is the most useful type of subsetting, because you use it to subset based on comparative statements.

x <- c(2.1, 4.2, 3.3, 5.4)
x > 3
#> [1] FALSE  TRUE  TRUE  TRUE

This command tests if the condition stated by the comparison operator is TRUE or FALSE for every element of the vector, and it returns a logical vector!

We can now pass this statement between the square brackets that follow x to subset only those items that match TRUE:

x <- c(2.1, 4.2, 3.3, 5.4)
x[x > 3]
#> [1] 4.2 3.3 5.4

# With !
!x > 5 
#> [1]  TRUE  TRUE  TRUE FALSE
x[!x > 5]
#> [1] 2.1 4.2 3.3

# With %in% 
x %in% c(3.3, 4.2)
#> [1] FALSE  TRUE  TRUE FALSE
x[x %in% c(3.3, 4.2)]
#> [1] 4.2 3.3

Challenge.

Subset country_vector below to return every value EXCEPT "Canada" and "Brazil".

country_vector<-c("Afghanistan", "Canada", "Sierra Leone", "Denmark", "Japan", "Brazil")

# Do it using positive integers.

# Do it using negative integers.

# Do it using a logical vector.

# Do it using a conditional statement (and an implicit logical vector).

14.3 Factors

Factors are special vectors that represent categorical data: Variables that have a fixed and known set of possible values. Think: Democrat, Republican, Independent; Male, Female, Other; etc.

It is important that R knows whether it is dealing with a continuous or a categorical variable, as the statistical models you will develop in the future treat both types differently.

Historically, factors were much easier to work with than characters. As a result, many of the functions in base R automatically convert characters to factors. This means that factors often pop up in places where they are not actually helpful.

14.3.1 Creating Factors

To create factors in R, you use the function factor(). The first thing that you have to do is create a vector that contains all the observations that belong to a limited number of categories. For example, party_vector contains the partyID of 5 different individuals:

party_vector <- c("Rep", "Rep", "Dem", "Rep", "Dem")

It is clear that there are two categories – or, in R-terms, factor levels – at work here: Dem and Rep.

The function factor() will encode the vector as a factor:

party_factor <- factor(party_vector)
party_vector
#> [1] "Rep" "Rep" "Dem" "Rep" "Dem"
party_factor
#> [1] Rep Rep Dem Rep Dem
#> Levels: Dem Rep

14.3.2 Summarizing a Factor

One of your favorite functions in R will be summary(). This will give you a quick overview of the contents of a variable. Let’s compare using summary() on both the character vector and the factor:

summary(party_vector)
#>    Length     Class      Mode 
#>         5 character character
summary(party_factor)
#> Dem Rep 
#>   2   3

14.3.3 Changing Factor Levels

When you create the factor, the factor levels are set to specific values. We can access those values with the levels() function:

levels(party_factor)
#> [1] "Dem" "Rep"

Any values not in the set of levels will be silently converted to NA. Let’s say we want to add an Independent to our sample:

party_factor[5] <- "Ind"
#> Warning in `[<-.factor`(`*tmp*`, 5, value = "Ind"): invalid factor level, NA
#> generated
party_factor
#> [1] Rep  Rep  Dem  Rep  <NA>
#> Levels: Dem Rep

We first need to add “Ind” to our factor levels. This will allow us to add Independents to our sample:

levels(party_factor)
#> [1] "Dem" "Rep"
levels(party_factor) <- c("Dem", "Rep", "Ind")

party_factor[5] <- "Ind"
party_factor
#> [1] Rep Rep Dem Rep Ind
#> Levels: Dem Rep Ind

14.3.4 Factors Are Integers

Factors are pretty much integers that have labels on them. Underneath, they are really numbers (1, 2, 3…).

str(party_factor)
#>  Factor w/ 3 levels "Dem","Rep","Ind": 2 2 1 2 3

They are better than using simple integer labels, because factors are self-describing. For example, democrat and republican are more descriptive than 1s and 2s.

However, factors are NOT characters!!

While factors look (and often behave) like character vectors, they are actually integers. Be careful when treating them like strings.

x <- c("a", "b", "b", "a")
x <- as.factor(x)
c(x, "c")
#> [1] "1" "2" "2" "1" "c"

For this reason, it is usually best to explicitly convert factors to character vectors if you need string-like behavior.

x <- c("a", "b", "b", "a")
x <- as.factor(x)
x <- as.character(x)
c(x, "c")
#> [1] "a" "b" "b" "a" "c"

14.3.5 Challenges

Challenge 1.

What happens to a factor when you modify its levels?

f1 <- factor(letters)
levels(f1) <- rev(levels(f1))
f1
#>  [1] z y x w v u t s r q p o n m l k j i h g f e d c b a
#> Levels: z y x w v u t s r q p o n m l k j i h g f e d c b a

Challenge 2.

What does this code do? How do f2 and f3 differ from f1?

f2 <- rev(factor(letters))
f3 <- factor(letters, levels = rev(letters))

14.4 Lists

Lists are different from vectors because their elements can be of any type. Lists are sometimes called recursive vectors, because a list can contain other lists. This makes them fundamentally different from vectors.

In data analysis, you will not make lists very often, at least not consciously, but you should still know about them. Why?

  1. Dataframes are lists! They are a special case where each element is an atomic vector, all having the same length.
  2. Many functions will return lists to you, and you will want to extract goodies from them, such as the p-value for a hypothesis test or the estimated error variance in a regression model.

14.4.1 Creating Lists

You construct lists by using list() instead of c():

x <- list(1, "a", TRUE, c(4, 5, 6))
x
#> [[1]]
#> [1] 1
#> 
#> [[2]]
#> [1] "a"
#> 
#> [[3]]
#> [1] TRUE
#> 
#> [[4]]
#> [1] 4 5 6

14.4.2 Naming Lists

As with vectors, we can attach names to each element on our list:

my_list <- list(name1 = elem1, 
                name2 = elem2)

This creates a list with components that are named name1, name2, and so on. If you want to name your lists after you have created them, you can use the names() function as you did with vectors. The following commands are fully equivalent to the assignment above:

my_list <- list(elem1, elem2)
names(my_list) <- c("name1", "name2")

14.4.3 List Structure

A very useful tool for working with lists is str(), because it focuses on reviewing the structure of a list, not its contents.

x <- list(a = c(1, 2, 3),
          b = c("Hello", "there"),
          c = 1:10)
str(x)
#> List of 3
#>  $ a: num [1:3] 1 2 3
#>  $ b: chr [1:2] "Hello" "there"
#>  $ c: int [1:10] 1 2 3 4 5 6 7 8 9 10

A list does not print to the console like a vector. Instead, each element of the list starts on a new line.

x_vec <- c(1,2,3)
x_list <- list(1,2,3)
x_vec
#> [1] 1 2 3
x_list
#> [[1]]
#> [1] 1
#> 
#> [[2]]
#> [1] 2
#> 
#> [[3]]
#> [1] 3

Lists are used to build up many of the more complicated data structures in R. For example, both dataframes and linear model objects (as produced by lm()) are lists:

head(mtcars)
#>                    mpg cyl disp  hp drat   wt qsec vs am gear carb
#> Mazda RX4         21.0   6  160 110 3.90 2.62 16.5  0  1    4    4
#> Mazda RX4 Wag     21.0   6  160 110 3.90 2.88 17.0  0  1    4    4
#> Datsun 710        22.8   4  108  93 3.85 2.32 18.6  1  1    4    1
#> Hornet 4 Drive    21.4   6  258 110 3.08 3.21 19.4  1  0    3    1
#> Hornet Sportabout 18.7   8  360 175 3.15 3.44 17.0  0  0    3    2
#> Valiant           18.1   6  225 105 2.76 3.46 20.2  1  0    3    1
is.list(mtcars)
#> [1] TRUE
mod <- lm(mpg ~ wt, data = mtcars)
is.list(mod)
#> [1] TRUE

You could say that a list is some kind of super data type: You can store practically any piece of information in it!

For this reason, lists are extremely useful inside functions. You can “staple” together lots of different kinds of results into a single object that a function can return.

mod <- lm(mpg ~ wt, data = mtcars)
str(mod)
#> List of 12
#>  $ coefficients : Named num [1:2] 37.29 -5.34
#>   ..- attr(*, "names")= chr [1:2] "(Intercept)" "wt"
#>  $ residuals    : Named num [1:32] -2.28 -0.92 -2.09 1.3 -0.2 ...
#>   ..- attr(*, "names")= chr [1:32] "Mazda RX4" "Mazda RX4 Wag" "Datsun 710" "Hornet 4 Drive" ...
#>  $ effects      : Named num [1:32] -113.65 -29.116 -1.661 1.631 0.111 ...
#>   ..- attr(*, "names")= chr [1:32] "(Intercept)" "wt" "" "" ...
#>  $ rank         : int 2
#>  $ fitted.values: Named num [1:32] 23.3 21.9 24.9 20.1 18.9 ...
#>   ..- attr(*, "names")= chr [1:32] "Mazda RX4" "Mazda RX4 Wag" "Datsun 710" "Hornet 4 Drive" ...
#>  $ assign       : int [1:2] 0 1
#>  $ qr           :List of 5
#>   ..$ qr   : num [1:32, 1:2] -5.657 0.177 0.177 0.177 0.177 ...
#>   .. ..- attr(*, "dimnames")=List of 2
#>   .. .. ..$ : chr [1:32] "Mazda RX4" "Mazda RX4 Wag" "Datsun 710" "Hornet 4 Drive" ...
#>   .. .. ..$ : chr [1:2] "(Intercept)" "wt"
#>   .. ..- attr(*, "assign")= int [1:2] 0 1
#>   ..$ qraux: num [1:2] 1.18 1.05
#>   ..$ pivot: int [1:2] 1 2
#>   ..$ tol  : num 1e-07
#>   ..$ rank : int 2
#>   ..- attr(*, "class")= chr "qr"
#>  $ df.residual  : int 30
#>  $ xlevels      : Named list()
#>  $ call         : language lm(formula = mpg ~ wt, data = mtcars)
#>  $ terms        :Classes 'terms', 'formula'  language mpg ~ wt
#>   .. ..- attr(*, "variables")= language list(mpg, wt)
#>   .. ..- attr(*, "factors")= int [1:2, 1] 0 1
#>   .. .. ..- attr(*, "dimnames")=List of 2
#>   .. .. .. ..$ : chr [1:2] "mpg" "wt"
#>   .. .. .. ..$ : chr "wt"
#>   .. ..- attr(*, "term.labels")= chr "wt"
#>   .. ..- attr(*, "order")= int 1
#>   .. ..- attr(*, "intercept")= int 1
#>   .. ..- attr(*, "response")= int 1
#>   .. ..- attr(*, ".Environment")=<environment: R_GlobalEnv> 
#>   .. ..- attr(*, "predvars")= language list(mpg, wt)
#>   .. ..- attr(*, "dataClasses")= Named chr [1:2] "numeric" "numeric"
#>   .. .. ..- attr(*, "names")= chr [1:2] "mpg" "wt"
#>  $ model        :'data.frame':   32 obs. of  2 variables:
#>   ..$ mpg: num [1:32] 21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
#>   ..$ wt : num [1:32] 2.62 2.88 2.32 3.21 3.44 ...
#>   ..- attr(*, "terms")=Classes 'terms', 'formula'  language mpg ~ wt
#>   .. .. ..- attr(*, "variables")= language list(mpg, wt)
#>   .. .. ..- attr(*, "factors")= int [1:2, 1] 0 1
#>   .. .. .. ..- attr(*, "dimnames")=List of 2
#>   .. .. .. .. ..$ : chr [1:2] "mpg" "wt"
#>   .. .. .. .. ..$ : chr "wt"
#>   .. .. ..- attr(*, "term.labels")= chr "wt"
#>   .. .. ..- attr(*, "order")= int 1
#>   .. .. ..- attr(*, "intercept")= int 1
#>   .. .. ..- attr(*, "response")= int 1
#>   .. .. ..- attr(*, ".Environment")=<environment: R_GlobalEnv> 
#>   .. .. ..- attr(*, "predvars")= language list(mpg, wt)
#>   .. .. ..- attr(*, "dataClasses")= Named chr [1:2] "numeric" "numeric"
#>   .. .. .. ..- attr(*, "names")= chr [1:2] "mpg" "wt"
#>  - attr(*, "class")= chr "lm"

14.5 Subsetting Lists

Subsetting a list works in the same way as subsetting an atomic vector. However, there is one important difference: [ will always return a list. [[ and $, as described below, let you pull out the components of the list.

The “pepper shaker photos” in R for Data Science are a splendid visual explanation of the different ways to get stuff out of a list. Highly recommended.

Let’s illustrate with the following list my_list:

my_list <- list(a = 1:3, 
                b = "a string", 
                c = pi, 
                d = list(-1, -5))

14.5.1 With [

[ extracts a sub-list where the result will always be a list. Like with vectors, you can subset with a logical, integer, or character vector.

my_list[1]
#> $a
#> [1] 1 2 3
str(my_list[1])
#> List of 1
#>  $ a: int [1:3] 1 2 3

my_list[1:2]
#> $a
#> [1] 1 2 3
#> 
#> $b
#> [1] "a string"
str(my_list[1:2])
#> List of 2
#>  $ a: int [1:3] 1 2 3
#>  $ b: chr "a string"

14.5.2 With [[

[[ extracts a single component from a list. In other words, it removes that hierarchy and returns whatever object is stored inside.

my_list[[1]]
#> [1] 1 2 3
str(my_list[[1]])
#>  int [1:3] 1 2 3

# Compare to
my_list[1]
#> $a
#> [1] 1 2 3
str(my_list[1])
#> List of 1
#>  $ a: int [1:3] 1 2 3

The distinction between [ and [[ is really important for lists, because [[ drills down into the list while [ returns a new, smaller list.

“If list x is a train carrying objects, then x[[5]] is the object in car 5; x[4:6] is a train of cars 4-6.”

(???)

14.5.3 with $

$ is a shorthand for extracting a single named element of a list. It works especially well when coupled with tab completion.

my_list$a
#> [1] 1 2 3

14.5.4 Challenges

Challenge 1.

What are the four basic types of atomic vectors? How does a list differ from an atomic vector?

Challenge 2.

Why is 1 == "1" true? Why is -1 < FALSE true? Why is "one" < 2 false?

Challenge 3.

Create three vectors and combine them into a list. Assign them names.

Challenge 4.

If x is a list, what is the class of x[1]? How about x[[1]]?

Challenge 5.

Take a look at the linear model below:

mod <- lm(mpg ~ wt, data = mtcars)
summary(mod)
#> 
#> Call:
#> lm(formula = mpg ~ wt, data = mtcars)
#> 
#> Residuals:
#>    Min     1Q Median     3Q    Max 
#> -4.543 -2.365 -0.125  1.410  6.873 
#> 
#> Coefficients:
#>             Estimate Std. Error t value Pr(>|t|)    
#> (Intercept)   37.285      1.878   19.86  < 2e-16 ***
#> wt            -5.344      0.559   -9.56  1.3e-10 ***
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Residual standard error: 3.05 on 30 degrees of freedom
#> Multiple R-squared:  0.753,  Adjusted R-squared:  0.745 
#> F-statistic: 91.4 on 1 and 30 DF,  p-value: 1.29e-10

Extract the R squared from the model summary.

14.6 Matrices

Matrices are like 2-d vectors, that is, they are a collection of elements of the same data type (numeric, character, or logical) arranged into a fixed number of rows and columns.

m <- matrix(1:6, nrow = 2, ncol = 3)
m
#>      [,1] [,2] [,3]
#> [1,]    1    3    5
#> [2,]    2    4    6

General arrays are available in R, where a matrix is an important special case having dimension 2.

Practically speaking, matrices are good for large tables of numbers. However, as social scientists, we rarely work with purely numerical data.

By definition, if you want to combine different types of data (one column numbers, another column characters…), you want a dataframe, not a matrix.

Let’s make a simple matrix and give it decent row and column names. You will see familiar or self-explanatory functions below for getting to know a matrix.

## Do not worry if the construction of this matrix confuses you; 
## just focus on the product
m <- outer(as.character(1:4), as.character(1:4),
              function(x, y) {
                paste0('x', x, '-', y)
                })
m
#>      [,1]   [,2]   [,3]   [,4]  
#> [1,] "x1-1" "x1-2" "x1-3" "x1-4"
#> [2,] "x2-1" "x2-2" "x2-3" "x2-4"
#> [3,] "x3-1" "x3-2" "x3-3" "x3-4"
#> [4,] "x4-1" "x4-2" "x4-3" "x4-4"
str(m)
#>  chr [1:4, 1:4] "x1-1" "x2-1" "x3-1" "x4-1" "x1-2" "x2-2" "x3-2" "x4-2" ...
class(m)
#> [1] "matrix" "array"
dim(m)
#> [1] 4 4
nrow(m)
#> [1] 4
ncol(m)
#> [1] 4
rownames(m)
#> NULL

rownames(m) <- c("row1", "row2", "row3", "row4")
colnames(m) <- c("col1", "col2", "col3", "col4")

m
#>      col1   col2   col3   col4  
#> row1 "x1-1" "x1-2" "x1-3" "x1-4"
#> row2 "x2-1" "x2-2" "x2-3" "x2-4"
#> row3 "x3-1" "x3-2" "x3-3" "x3-4"
#> row4 "x4-1" "x4-2" "x4-3" "x4-4"

14.7 Indexing a Matrix

Similar to vectors, you can use the square brackets [ ] to select one or multiple elements from a matrix. But whereas vectors have one dimension, matrices have two dimensions. We therefore have to use two subsetting vectors – one for rows and another for columns to select – separated by a comma. Blank subsetting is also useful because it lets you keep all rows or all columns.

m[2, 3] # Selects the value at the second row and third column
#> [1] "x2-3"

m[2, ] # We get row 2
#>   col1   col2   col3   col4 
#> "x2-1" "x2-2" "x2-3" "x2-4"

m[ , 3, drop = FALSE] # We get column 3
#>      col3  
#> row1 "x1-3"
#> row2 "x2-3"
#> row3 "x3-3"
#> row4 "x4-3"

dim(m[ , 3, drop = FALSE]) # We get column 3 as a 4 x 1 matrix
#> [1] 4 1

m[c("row1", "row4"), c("col2", "col3")] # We get rows 1, 4 and columns 2, 3
#>      col2   col3  
#> row1 "x1-2" "x1-3"
#> row4 "x4-2" "x4-3"

m[-c(2, 3), c(TRUE, TRUE, FALSE, FALSE)] # Wacky but possible
#>      col1   col2  
#> row1 "x1-1" "x1-2"
#> row4 "x4-1" "x4-2"

14.8 Dataframes

Dataframes are a very important data type in R. It is pretty much the de facto data structure for most tabular data and it is also what we use for statistics.

Hopefully the slog through vectors, matrices, and lists will be redeemed by greater prowess at manipulating data.frames. Why should this be true?

  1. A dataframe is a list.
  2. The list elements are the variables, and they are atomic vectors.
  3. Dataframes are rectangular, like their matrix friends, so your intuition – and even some syntax – can be borrowed from the matrix world.

NB: You might have heard of “tibbles,” used in the tidyverse suite of packages. Tibbles are like dataframes 2.0, tweaking some of the behavior of dataframes to make life easier for data anlysis. For now, just think of tibbles and dataframes as the same thing and do not worry about the difference.

14.8.1 Creating Dataframes

We have already worked extensively with dataframes that we have imported through a package or read.csv.

library(gapminder)
gap <- gapminder

We can create a dataframe from scratch using data.frame(). This function takes vectors as input:

vec_1 <- 1:3
vec_2 <- c("a", "b", "c")
df <- data.frame(vec_1, vec_2)

14.8.2 The Structure of Dataframes

Under the hood, a dataframe is a list of equal-length vectors. This makes it a 2-dimensional structure, so it shares properties of both the matrix and the list.

vec_1 <- 1:3
vec_2 <- c("a", "b", "c")
df <- data.frame(vec_1, vec_2)

str(df)
#> 'data.frame':    3 obs. of  2 variables:
#>  $ vec_1: int  1 2 3
#>  $ vec_2: chr  "a" "b" "c"

The length() of a dataframe is the length of the underlying list and so is the same as ncol(); nrow() gives the number of rows.

vec_1 <- 1:3
vec_2 <- c("a", "b", "c")
df <- data.frame(vec_1, vec_2)

# These two are equivalent - number of columns
length(df)
#> [1] 2
ncol(df)
#> [1] 2

# Get number of rows
nrow(df)
#> [1] 3

# Get number of both columns and rows
dim(df)
#> [1] 3 2

14.8.3 Naming Dataframes

Dataframes have colnames() and rownames(). However, since dataframes are really lists (of vectors) under the hood, names() and colnames() are the same thing.

vec_1 <- 1:3
vec_2 <- c("a", "b", "c")
df <- data.frame(vec_1, vec_2)

# These two are equivalent
names(df)
#> [1] "vec_1" "vec_2"
colnames(df)
#> [1] "vec_1" "vec_2"

# Change the colnames
colnames(df) <- c("Number", "Character")

# Change the rownames
rownames(df) 
#> [1] "1" "2" "3"
rownames(df) <- c("donut", "pickle", "pretzel")
df
#>         Number Character
#> donut        1         a
#> pickle       2         b
#> pretzel      3         c

14.9 Indexing Dataframes

A dataframe is a list that quacks like a matrix.

Remember that dataframes are really lists of vectors (one vector per column). That means that dataframes have both list- and matrix-like behavior.

For example, just as list$name selects the name element from the list, df$name selects the name column (vector) from the dataframe:

library(gapminder)
gap <- gapminder

head(gap$country)
#> [1] Afghanistan Afghanistan Afghanistan Afghanistan Afghanistan Afghanistan
#> 142 Levels: Afghanistan Albania Algeria Angola Argentina Australia ... Zimbabwe

Likewise, we can use square brackets to subset rows and columns:

# Row 1, column 3
gap[1, 3]
#> # A tibble: 1 x 1
#>    year
#>   <int>
#> 1  1952

# Fourth row
gap[4, ]
#> # A tibble: 1 x 6
#>   country     continent  year lifeExp      pop gdpPercap
#>   <fct>       <fct>     <int>   <dbl>    <int>     <dbl>
#> 1 Afghanistan Asia       1967    34.0 11537966      836.

# First two rows of the columns 1 and 5
gap[c(1,2), c(1, 5)]
#> # A tibble: 2 x 2
#>   country         pop
#>   <fct>         <int>
#> 1 Afghanistan 8425333
#> 2 Afghanistan 9240934

We can also use subsetting in conjunction with assignment to quickly add a column:

names(gap)
#> [1] "country"   "continent" "year"      "lifeExp"   "pop"       "gdpPercap"
gap$new_col <- NA
head(gap)
#> # A tibble: 6 x 7
#>   country     continent  year lifeExp      pop gdpPercap new_col
#>   <fct>       <fct>     <int>   <dbl>    <int>     <dbl> <lgl>  
#> 1 Afghanistan Asia       1952    28.8  8425333      779. NA     
#> 2 Afghanistan Asia       1957    30.3  9240934      821. NA     
#> 3 Afghanistan Asia       1962    32.0 10267083      853. NA     
#> 4 Afghanistan Asia       1967    34.0 11537966      836. NA     
#> 5 Afghanistan Asia       1972    36.1 13079460      740. NA     
#> 6 Afghanistan Asia       1977    38.4 14880372      786. NA

14.9.1 Challenges

Challenge 1.

Create a 3x2 dataframe called basket. The first column should contain the names of 3 fruits. The second column should contain the price of those fruits. Now give your dataframe appropriate column and row names.

Challenge 2.

Add a third column called color that tells what color each fruit is.