# Chapter 13 Data Classes and Structures

To make the best use of the R language, you will need a strong understanding of basic data structures and how to operate on them.

This is **critical** to understand because these are the objects you will manipulate on a day-to-day basis in R. But they are not always as easy to work with as they seem at the outset. Dealing with object types and conversions is one of the most common sources of frustration for beginners.

R's base data structures can be organized by their dimensionality (1d, 2d, or nd) and whether they are homogeneous (all contents must be of the same type) or heterogeneous (the contents can be of different types). This gives rise to the five data types most often used in data analysis:

Homogeneous | Heterogeneous | |
---|---|---|

1d | Atomic vector | List |

2d | Matrix | Dataframe |

nd | Array |

Each data structure has its own specifications and behavior. In the rest of this chapter, we will cover the types of data objects that exist in R and how they work.

## 13.1 Vectors

Your garden variety R object is a vector. Vectors are 1-dimensional chains of values. We call each value an *element* of a vector.

### 13.1.1 Creating Vectors

A single piece of information that you regard as a scalar is just a vector of length 1. R will cheerfully let you add stuff to it with `c()`

, which is short for 'combine':

```
x <- 3 * 4
x
#> [1] 12
is.vector(x)
#> [1] TRUE
length(x)
#> [1] 1
x <- c(1, 2, 3)
x
#> [1] 1 2 3
length(x)
#> [1] 3
# Other ways to make a vector
x <- 1:3
```

We can also add elements to the end of a vector by passing the original vector into the `c`

function, like so:

```
z <- c("Beyonce", "Kelly", "Michelle", "LeToya")
z <- c(z, "Farrah")
z
#> [1] "Beyonce" "Kelly" "Michelle" "LeToya" "Farrah"
```

Notice that vectors are always flat, even if you nest `c()`

's:

```
# These are equivalent
c(1, c(2, c(3, 4)))
#> [1] 1 2 3 4
c(1, 2, 3, 4)
#> [1] 1 2 3 4
```

### 13.1.2 Vectors Are Everywhere

R is built to work with vectors. Many operations are vectorized, meaning they will perform calculations on each component by default. Novices often do not internalize or exploit this and they write lots of unnecessary for loops.

```
a <- c(1, -2, 3)
a^2
#> [1] 1 4 9
```

We can also add two vectors. It is important to know that if you **sum** two vectors in R, it takes the element-wise sum. For example, the following three statements are completely equivalent:

```
c(1, 2, 3) + c(4, 5, 6)
c(1 + 4, 2 + 5, 3 + 6)
c(5, 7, 9)
```

When reading function documentation, keep your eyes peeled for arguments that can be vectors. You will be surprised how common they are. For example, the mean of random normal variables can be provided as a vector.

```
set.seed(1999)
rnorm(5, mean = c(10, 100, 1000, 10000, 100000))
#> [1] 10.7 100.0 1001.2 10001.5 100000.1
```

This could be awesome in some settings, but dangerous in others, i.e., if you exploit this by mistake and get no warning. This is one of the reasons it is so important to keep close tabs on your R objects: Are they what you expect in terms of their flavor and length or dimensions? Check early and check often.

### 13.1.3 Recycling

R recycles vectors if they are not the necessary length. You will get a warning if R suspects recycling is unintended, i.e., when one length is not an integer multiple of another, but recycling is silent if it seems like you know what you are doing. This can be a beautiful thing when you are doing it deliberately, but devastating when you are not.

```
(y <- 1:3)
#> [1] 1 2 3
(z <- 3:7)
#> [1] 3 4 5 6 7
y + z
#> Warning in y + z: longer object length is not a multiple of shorter object
#> length
#> [1] 4 6 8 7 9
(y <- 1:10)
#> [1] 1 2 3 4 5 6 7 8 9 10
(z <- 3:7)
#> [1] 3 4 5 6 7
y + z
#> [1] 4 6 8 10 12 9 11 13 15 17
```

### 13.1.4 Types of Vectors

There are four common types of vectors, depending on the class: * `integer`

* `numeric`

(same as `double`

) * `character`

* `logical`

#### Numeric Vectors

Numeric vectors contain numbers. They can be stored as *integers* (whole numbers) or *doubles* (numbers with decimal points). In practice, you rarely need to concern yourself with this difference, but just know that they are different but related things.

```
c(1, 2, 335)
#> [1] 1 2 335
c(4.2, 4, 6, 53.2)
#> [1] 4.2 4.0 6.0 53.2
```

#### Character Vectors

Character vectors contain character (or 'string') values. Note that each value has to be surrounded by quotation marks *before* the comma.

```
c("Beyonce", "Kelly", "Michelle", "LeToya")
#> [1] "Beyonce" "Kelly" "Michelle" "LeToya"
```

#### Logical (Boolean) Vectors

Logical vectors take on one of three possible values:

`TRUE`

`FALSE`

`NA`

(missing value)

They are often used in conjunction with Boolean expressions.

```
b1 <- c(TRUE, TRUE, FALSE, NA)
b1
#> [1] TRUE TRUE FALSE NA
vec_1 <- c(1, 2, 3)
vec_2 <- c(1, 9, 3)
vec_1 == vec_2
#> [1] TRUE FALSE TRUE
b2 <- vec_1 == vec_2
```

### 13.1.5 Coercion

We can change or convert a vector's type using `as....`

.

```
num_var <- c(1, 2.5, 4.5)
class(num_var)
#> [1] "numeric"
as.character(num_var)
#> [1] "1" "2.5" "4.5"
```

Remember that all elements of a vector must be the same type. So when you attempt to combine different types, they will be **coerced** to the most "flexible" type.

For example, combining a character and an integer yields a character:

```
c("a", 1)
#> [1] "a" "1"
```

Guess what the following do without running them first:

```
c(1.7, "a")
c(TRUE, 2)
c("a", TRUE)
```

#### TRUE == 1 and FALSE == 0

Notice that when a logical vector is coerced to an integer or double, `TRUE`

becomes 1 and `FALSE`

becomes 0. This is very useful in conjunction with `sum()`

and `mean()`

.

```
vec_1 <- c(1, 2, 3)
vec_2 <- c(1, 9, 3)
boo_1 <- vec_1 == vec_2
# Total number of TRUEs
sum(boo_1)
#> [1] 2
# Proportion that are TRUE
mean(boo_1)
#> [1] 0.667
```

#### Coercion often happens automatically.

This is called *implicit coercion*. Most mathematical functions (`+`

, `log`

, `abs`

, etc.) will coerce to a double or integer, and most logical operations (`&`

, `|`

, `any`

, etc) will coerce to a logical. You will usually get a warning message if the coercion might lose information.

```
1 < "2"
#> [1] TRUE
"1" > 2
#> [1] FALSE
```

Sometimes coercions, especially nonsensical ones, will not work.

```
x <- c("a", "b", "c")
as.numeric(x)
#> Warning: NAs introduced by coercion
#> [1] NA NA NA
as.logical(x)
#> [1] NA NA NA
```

### 13.1.6 Naming a Vector

We can also attach names to our vector. This helps us understand what each element refers to.

You can give names to the elements of a vector with the `names()`

function. Have a look at this example:

```
days_month <- c(31, 28, 31, 30, 31, 30, 31, 31, 30, 31, 30, 31)
names(days_month) <- c("Jan", "Feb", "Mar", "Apr", "May", "Jun", "Jul", "Aug", "Sep", "Oct", "Nov", "Dec")
days_month
#> Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
#> 31 28 31 30 31 30 31 31 30 31 30 31
```

You can name a vector when you create it:

```
some_vector <- c(name = "Rochelle Terman", profession = "Professor Extraordinaire")
some_vector
#> name profession
#> "Rochelle Terman" "Professor Extraordinaire"
```

Notice that in the first case, we surrounded each name with quotation marks. But we do not have to do this when creating a named vector.

Names do not have to be unique, and not all values need to have a name associated with them. However, names are most useful for subsetting, described in the next chapter. When subsetting, it is most useful when the names are unique.

### 13.1.7 Challenges

#### Challenge 1: Create and examine your vector.

Create a character vector called `fruit`

that contains 4 of your favorite fruits. Then evaluate its structure using the commands below:

```
# First create your fruit vector
# YOUR CODE HERE
# Examine your vector
length(fruit)
class(fruit)
str(fruit)
```

#### Challenge 2: Coercion.

```
# 1. Create a vector of a sequence of numbers from 1 to 10.
# 2. Coerce that vector into a character vector.
# 3. Add the element "11" to the end of the vector.
# 4. Coerce it back to a numeric vector.
```

#### Challenge 3: Calculations on Vectors.

Create a vector of the numbers 11 to 20 and multiply it by the original vector from Challenge 2.

## 13.2 Subsetting Vectors

Sometimes we want to isolate elements of a vector for inspection, modification, etc. This is often called **indexing** or **subsetting**.

By the way, indexing begins at 1 in R, unlike many other languages that index from 0.

### 13.2.1 Subsetting Types

Let's explore the different types of subsetting with a simple vector, `x`

:

`x <- c(2.1, 4.2, 3.3, 5.4)`

Note that the number after the decimal point gives the original position in the vector.

There are four things you can use to subset a vector:

#### 1. **Positive integers** return elements at the specified positions.

The simplest way to subset a vector is with a single integer:

```
x <- c(2.1, 4.2, 3.3, 5.4)
x[1]
#> [1] 2.1
x[3]
#> [1] 3.3
```

We can also index multiple values by passing a vector of integers:

```
x <- c(2.1, 4.2, 3.3, 5.4)
x[c(3, 1)]
#> [1] 3.3 2.1
```

Note that you *have* to use `c`

inside the `[`

for this to work!

More examples:

```
# `order(x)` gives the index positions of smallest to largest values
(x <- c(2.1, 4.2, 3.3, 5.4))
#> [1] 2.1 4.2 3.3 5.4
order(x)
#> [1] 1 3 2 4
# Use this to order values
x[order(x)]
#> [1] 2.1 3.3 4.2 5.4
x[c(1, 3, 2, 4)]
#> [1] 2.1 3.3 4.2 5.4
```

#### 2. **Negative integers** omit elements at the specified positions.

```
x <- c(2.1, 4.2, 3.3, 5.4)
x[-1]
#> [1] 4.2 3.3 5.4
x[-c(1, 3)]
#> [1] 4.2 5.4
```

You cannot mix positive and negative integers in a single subset:

```
x <- c(2.1, 4.2, 3.3, 5.4)
x[c(-1, 2)]
#> Error in x[c(-1, 2)]: only 0's may be mixed with negative subscripts
```

#### 3. **Character vectors** return elements with matching names. This only works if the vector is named.

```
x <- c(2.1, 4.2, 3.3, 5.4)
# Apply names
names(x) <- c("a", "b", "c", "d")
# Subset using names
x["d"]
#> d
#> 5.4
x[c("d", "c", "a")]
#> d c a
#> 5.4 3.3 2.1
```

#### 4. **Logical vectors** select elements where the corresponding logical value is `TRUE`

.

```
x <- c(2.1, 4.2, 3.3, 5.4)
x[c(TRUE, TRUE, FALSE, FALSE)]
#> [1] 2.1 4.2
```

Logical subsetting is the most useful type of subsetting, because you use it to subset based on **comparative** statements.

```
x <- c(2.1, 4.2, 3.3, 5.4)
x > 3
#> [1] FALSE TRUE TRUE TRUE
```

This command tests if the condition stated by the comparison operator is `TRUE`

or `FALSE`

for every element of the vector, and it returns a logical vector!

We can now pass this statement between the square brackets that follow `x`

to subset only those items that match `TRUE`

:

```
x <- c(2.1, 4.2, 3.3, 5.4)
x[x > 3]
#> [1] 4.2 3.3 5.4
# With !
!x > 5
#> [1] TRUE TRUE TRUE FALSE
x[!x > 5]
#> [1] 2.1 4.2 3.3
# With %in%
x %in% c(3.3, 4.2)
#> [1] FALSE TRUE TRUE FALSE
x[x %in% c(3.3, 4.2)]
#> [1] 4.2 3.3
```

#### Challenge.

Subset `country_vector`

below to return every value EXCEPT "Canada" and "Brazil".

```
country_vector<-c("Afghanistan", "Canada", "Sierra Leone", "Denmark", "Japan", "Brazil")
# Do it using positive integers.
# Do it using negative integers.
# Do it using a logical vector.
# Do it using a conditional statement (and an implicit logical vector).
```

## 13.3 Lists

Lists are different from vectors because their elements can be of **any type**. Lists are sometimes called recursive vectors, because a list can contain other lists. This makes them fundamentally different from vectors.

In data analysis, you will not make lists very often, at least not consciously, but you should still know about them. Why?

- Dataframes are lists! They are a special case where each element is an atomic vector, all having the same length.
- Many functions will return lists to you, and you will want to extract goodies from them, such as the p-value for a hypothesis test or the estimated error variance in a regression model.

### 13.3.1 Creating Lists

You construct lists by using `list()`

instead of `c()`

:

```
x <- list(1, "a", TRUE, c(4, 5, 6))
x
#> [[1]]
#> [1] 1
#>
#> [[2]]
#> [1] "a"
#>
#> [[3]]
#> [1] TRUE
#>
#> [[4]]
#> [1] 4 5 6
```

### 13.3.2 Naming Lists

As with vectors, we can attach names to each element on our list:

```
my_list <- list(name1 = elem1,
name2 = elem2)
```

This creates a list with components that are named `name1`

, `name2`

, and so on. If you want to name your lists after you have created them, you can use the `names()`

function as you did with vectors. The following commands are fully equivalent to the assignment above:

```
my_list <- list(elem1, elem2)
names(my_list) <- c("name1", "name2")
```

### 13.3.3 List Structure

A very useful tool for working with lists is `str()`

, because it focuses on reviewing the structure of a list, not its contents.

```
x <- list(a = c(1, 2, 3),
b = c("Hello", "there"),
c = 1:10)
str(x)
#> List of 3
#> $ a: num [1:3] 1 2 3
#> $ b: chr [1:2] "Hello" "there"
#> $ c: int [1:10] 1 2 3 4 5 6 7 8 9 10
```

A list does not print to the console like a vector. Instead, each element of the list starts on a new line.

```
x_vec <- c(1,2,3)
x_list <- list(1,2,3)
x_vec
#> [1] 1 2 3
x_list
#> [[1]]
#> [1] 1
#>
#> [[2]]
#> [1] 2
#>
#> [[3]]
#> [1] 3
```

Lists are used to build up many of the more complicated data structures in R. For example, both dataframes and linear model objects (as produced by `lm()`

) are lists:

```
head(mtcars)
#> mpg cyl disp hp drat wt qsec vs am gear carb
#> Mazda RX4 21.0 6 160 110 3.90 2.62 16.5 0 1 4 4
#> Mazda RX4 Wag 21.0 6 160 110 3.90 2.88 17.0 0 1 4 4
#> Datsun 710 22.8 4 108 93 3.85 2.32 18.6 1 1 4 1
#> Hornet 4 Drive 21.4 6 258 110 3.08 3.21 19.4 1 0 3 1
#> Hornet Sportabout 18.7 8 360 175 3.15 3.44 17.0 0 0 3 2
#> Valiant 18.1 6 225 105 2.76 3.46 20.2 1 0 3 1
is.list(mtcars)
#> [1] TRUE
mod <- lm(mpg ~ wt, data = mtcars)
is.list(mod)
#> [1] TRUE
```

You could say that a list is some kind of super data type: You can store practically any piece of information in it!

For this reason, lists are extremely useful inside functions. You can "staple" together lots of different kinds of results into a single object that a function can return.

```
mod <- lm(mpg ~ wt, data = mtcars)
str(mod)
#> List of 12
#> $ coefficients : Named num [1:2] 37.29 -5.34
#> ..- attr(*, "names")= chr [1:2] "(Intercept)" "wt"
#> $ residuals : Named num [1:32] -2.28 -0.92 -2.09 1.3 -0.2 ...
#> ..- attr(*, "names")= chr [1:32] "Mazda RX4" "Mazda RX4 Wag" "Datsun 710" "Hornet 4 Drive" ...
#> $ effects : Named num [1:32] -113.65 -29.116 -1.661 1.631 0.111 ...
#> ..- attr(*, "names")= chr [1:32] "(Intercept)" "wt" "" "" ...
#> $ rank : int 2
#> $ fitted.values: Named num [1:32] 23.3 21.9 24.9 20.1 18.9 ...
#> ..- attr(*, "names")= chr [1:32] "Mazda RX4" "Mazda RX4 Wag" "Datsun 710" "Hornet 4 Drive" ...
#> $ assign : int [1:2] 0 1
#> $ qr :List of 5
#> ..$ qr : num [1:32, 1:2] -5.657 0.177 0.177 0.177 0.177 ...
#> .. ..- attr(*, "dimnames")=List of 2
#> .. .. ..$ : chr [1:32] "Mazda RX4" "Mazda RX4 Wag" "Datsun 710" "Hornet 4 Drive" ...
#> .. .. ..$ : chr [1:2] "(Intercept)" "wt"
#> .. ..- attr(*, "assign")= int [1:2] 0 1
#> ..$ qraux: num [1:2] 1.18 1.05
#> ..$ pivot: int [1:2] 1 2
#> ..$ tol : num 1e-07
#> ..$ rank : int 2
#> ..- attr(*, "class")= chr "qr"
#> $ df.residual : int 30
#> $ xlevels : Named list()
#> $ call : language lm(formula = mpg ~ wt, data = mtcars)
#> $ terms :Classes 'terms', 'formula' language mpg ~ wt
#> .. ..- attr(*, "variables")= language list(mpg, wt)
#> .. ..- attr(*, "factors")= int [1:2, 1] 0 1
#> .. .. ..- attr(*, "dimnames")=List of 2
#> .. .. .. ..$ : chr [1:2] "mpg" "wt"
#> .. .. .. ..$ : chr "wt"
#> .. ..- attr(*, "term.labels")= chr "wt"
#> .. ..- attr(*, "order")= int 1
#> .. ..- attr(*, "intercept")= int 1
#> .. ..- attr(*, "response")= int 1
#> .. ..- attr(*, ".Environment")=<environment: R_GlobalEnv>
#> .. ..- attr(*, "predvars")= language list(mpg, wt)
#> .. ..- attr(*, "dataClasses")= Named chr [1:2] "numeric" "numeric"
#> .. .. ..- attr(*, "names")= chr [1:2] "mpg" "wt"
#> $ model :'data.frame': 32 obs. of 2 variables:
#> ..$ mpg: num [1:32] 21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
#> ..$ wt : num [1:32] 2.62 2.88 2.32 3.21 3.44 ...
#> ..- attr(*, "terms")=Classes 'terms', 'formula' language mpg ~ wt
#> .. .. ..- attr(*, "variables")= language list(mpg, wt)
#> .. .. ..- attr(*, "factors")= int [1:2, 1] 0 1
#> .. .. .. ..- attr(*, "dimnames")=List of 2
#> .. .. .. .. ..$ : chr [1:2] "mpg" "wt"
#> .. .. .. .. ..$ : chr "wt"
#> .. .. ..- attr(*, "term.labels")= chr "wt"
#> .. .. ..- attr(*, "order")= int 1
#> .. .. ..- attr(*, "intercept")= int 1
#> .. .. ..- attr(*, "response")= int 1
#> .. .. ..- attr(*, ".Environment")=<environment: R_GlobalEnv>
#> .. .. ..- attr(*, "predvars")= language list(mpg, wt)
#> .. .. ..- attr(*, "dataClasses")= Named chr [1:2] "numeric" "numeric"
#> .. .. .. ..- attr(*, "names")= chr [1:2] "mpg" "wt"
#> - attr(*, "class")= chr "lm"
```

## 13.4 Subsetting Lists

Subsetting a list works in the same way as subsetting an atomic vector. However, there is one important difference: `[`

will always return a list. `[[`

and `$`

, as described below, let you pull out the components of the list.

The "pepper shaker photos" in R for Data Science are a splendid visual explanation of the different ways to get stuff out of a list. Highly recommended.

Let's illustrate with the following list `my_list`

:

```
my_list <- list(a = 1:3,
b = "a string",
c = pi,
d = list(-1, -5))
```

### 13.4.1 With `[`

`[`

extracts a sub-list where the result will always be a list. Like with vectors, you can subset with a logical, integer, or character vector.

```
my_list[1]
#> $a
#> [1] 1 2 3
str(my_list[1])
#> List of 1
#> $ a: int [1:3] 1 2 3
my_list[1:2]
#> $a
#> [1] 1 2 3
#>
#> $b
#> [1] "a string"
str(my_list[1:2])
#> List of 2
#> $ a: int [1:3] 1 2 3
#> $ b: chr "a string"
```

### 13.4.2 With `[[`

`[[`

extracts a single *component* from a list. In other words, it removes that hierarchy and returns whatever object is stored inside.

```
my_list[[1]]
#> [1] 1 2 3
str(my_list[[1]])
#> int [1:3] 1 2 3
# Compare to
my_list[1]
#> $a
#> [1] 1 2 3
str(my_list[1])
#> List of 1
#> $ a: int [1:3] 1 2 3
```

The distinction between `[`

and `[[`

is really important for lists, because `[[`

drills down into the list while `[`

returns a new, smaller list.

"If list

`x`

is a train carrying objects, then`x[[5]]`

is the object in car 5;`x[4:6]`

is a train of cars 4-6."--- (

???)

### 13.4.3 with `$`

`$`

is a shorthand for extracting a single named element of a list. It works especially well when coupled with tab completion.

```
my_list$a
#> [1] 1 2 3
```

### 13.4.4 Challenges

#### Challenge 1.

What are the four basic types of atomic vectors? How does a list differ from an atomic vector?

#### Challenge 2.

Why is `1 == "1"`

true? Why is `-1 < FALSE`

true? Why is `"one" < 2`

false?

#### Challenge 3.

Create three vectors and combine them into a list. Assign them names.

#### Challenge 4.

If `x`

is a list, what is the class of `x[1]`

? How about `x[[1]]`

?

#### Challenge 5.

Take a look at the linear model below:

```
mod <- lm(mpg ~ wt, data = mtcars)
summary(mod)
#>
#> Call:
#> lm(formula = mpg ~ wt, data = mtcars)
#>
#> Residuals:
#> Min 1Q Median 3Q Max
#> -4.543 -2.365 -0.125 1.410 6.873
#>
#> Coefficients:
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) 37.285 1.878 19.86 < 2e-16 ***
#> wt -5.344 0.559 -9.56 1.3e-10 ***
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Residual standard error: 3.05 on 30 degrees of freedom
#> Multiple R-squared: 0.753, Adjusted R-squared: 0.745
#> F-statistic: 91.4 on 1 and 30 DF, p-value: 1.29e-10
```

Extract the R squared from the model summary.

## 13.5 Matrices

Matrices are like 2-d vectors, that is, they are a collection of elements of the same data type (numeric, character, or logical) arranged into a fixed number of rows and columns.

```
m <- matrix(1:6, nrow = 2, ncol = 3)
m
#> [,1] [,2] [,3]
#> [1,] 1 3 5
#> [2,] 2 4 6
```

General arrays are available in R, where a matrix is an important special case having dimension 2.

Practically speaking, matrices are good for large tables of numbers. However, as social scientists, we rarely work with purely numerical data.

By definition, if you want to combine different types of data (one column numbers, another column characters...), you want a **dataframe**, not a matrix.

Let’s make a simple matrix and give it decent row and column names. You will see familiar or self-explanatory functions below for getting to know a matrix.

```
## Do not worry if the construction of this matrix confuses you;
## just focus on the product
m <- outer(as.character(1:4), as.character(1:4),
function(x, y) {
paste0('x', x, '-', y)
})
m
#> [,1] [,2] [,3] [,4]
#> [1,] "x1-1" "x1-2" "x1-3" "x1-4"
#> [2,] "x2-1" "x2-2" "x2-3" "x2-4"
#> [3,] "x3-1" "x3-2" "x3-3" "x3-4"
#> [4,] "x4-1" "x4-2" "x4-3" "x4-4"
str(m)
#> chr [1:4, 1:4] "x1-1" "x2-1" "x3-1" "x4-1" "x1-2" "x2-2" "x3-2" "x4-2" ...
class(m)
#> [1] "matrix" "array"
dim(m)
#> [1] 4 4
nrow(m)
#> [1] 4
ncol(m)
#> [1] 4
rownames(m)
#> NULL
rownames(m) <- c("row1", "row2", "row3", "row4")
colnames(m) <- c("col1", "col2", "col3", "col4")
m
#> col1 col2 col3 col4
#> row1 "x1-1" "x1-2" "x1-3" "x1-4"
#> row2 "x2-1" "x2-2" "x2-3" "x2-4"
#> row3 "x3-1" "x3-2" "x3-3" "x3-4"
#> row4 "x4-1" "x4-2" "x4-3" "x4-4"
```

## 13.6 Indexing a Matrix

Similar to vectors, you can use the square brackets `[ ]`

to select one or multiple elements from a matrix. But whereas vectors have one dimension, matrices have two dimensions. We therefore have to use two subsetting vectors – one for rows and another for columns to select – separated by a comma. Blank subsetting is also useful because it lets you keep all rows or all columns.

```
m[2, 3] # Selects the value at the second row and third column
#> [1] "x2-3"
m[2, ] # We get row 2
#> col1 col2 col3 col4
#> "x2-1" "x2-2" "x2-3" "x2-4"
m[ , 3, drop = FALSE] # We get column 3
#> col3
#> row1 "x1-3"
#> row2 "x2-3"
#> row3 "x3-3"
#> row4 "x4-3"
dim(m[ , 3, drop = FALSE]) # We get column 3 as a 4 x 1 matrix
#> [1] 4 1
m[c("row1", "row4"), c("col2", "col3")] # We get rows 1, 4 and columns 2, 3
#> col2 col3
#> row1 "x1-2" "x1-3"
#> row4 "x4-2" "x4-3"
m[-c(2, 3), c(TRUE, TRUE, FALSE, FALSE)] # Wacky but possible
#> col1 col2
#> row1 "x1-1" "x1-2"
#> row4 "x4-1" "x4-2"
```

## 13.7 Dataframes

Dataframes are a very important data type in R. It is pretty much the *de facto* data structure for most tabular data and it is also what we use for statistics.

Hopefully the slog through vectors, matrices, and lists will be redeemed by greater prowess at manipulating `data.frames`

. Why should this be true?

- A dataframe is a
*list*. - The list elements are the variables, and they are atomic vectors.
- Dataframes are rectangular, like their matrix friends, so your intuition – and even some syntax – can be borrowed from the matrix world.

NB: You might have heard of "tibbles," used in the`tidyverse`

suite of packages. Tibbles are like dataframes 2.0, tweaking some of the behavior of dataframes to make life easier for data anlysis. For now, just think of tibbles and dataframes as the same thing and do not worry about the difference.

### 13.7.1 Creating Dataframes

We have already worked extensively with dataframes that we have imported through a package or `read.csv`

.

```
library(gapminder)
gap <- gapminder
```

We can create a dataframe from scratch using `data.frame()`

. This function takes vectors as input:

```
vec_1 <- 1:3
vec_2 <- c("a", "b", "c")
df <- data.frame(vec_1, vec_2)
```

### 13.7.2 The Structure of Dataframes

Under the hood, a dataframe is a list of equal-length vectors. This makes it a 2-dimensional structure, so it shares properties of both the matrix and the list.

```
vec_1 <- 1:3
vec_2 <- c("a", "b", "c")
df <- data.frame(vec_1, vec_2)
str(df)
#> 'data.frame': 3 obs. of 2 variables:
#> $ vec_1: int 1 2 3
#> $ vec_2: chr "a" "b" "c"
```

The `length()`

of a dataframe is the length of the underlying list and so is the same as `ncol()`

; `nrow()`

gives the number of rows.

```
vec_1 <- 1:3
vec_2 <- c("a", "b", "c")
df <- data.frame(vec_1, vec_2)
# These two are equivalent - number of columns
length(df)
#> [1] 2
ncol(df)
#> [1] 2
# Get number of rows
nrow(df)
#> [1] 3
# Get number of both columns and rows
dim(df)
#> [1] 3 2
```

### 13.7.3 Naming Dataframes

Dataframes have `colnames()`

and `rownames()`

. However, since dataframes are really lists (of vectors) under the hood, `names()`

and `colnames()`

are the same thing.

```
vec_1 <- 1:3
vec_2 <- c("a", "b", "c")
df <- data.frame(vec_1, vec_2)
# These two are equivalent
names(df)
#> [1] "vec_1" "vec_2"
colnames(df)
#> [1] "vec_1" "vec_2"
# Change the colnames
colnames(df) <- c("Number", "Character")
# Change the rownames
rownames(df)
#> [1] "1" "2" "3"
rownames(df) <- c("donut", "pickle", "pretzel")
df
#> Number Character
#> donut 1 a
#> pickle 2 b
#> pretzel 3 c
```

## 13.8 Indexing Dataframes

A dataframe is a list that quacks like a matrix.

Remember that dataframes are really lists of vectors (one vector per column). That means that dataframes have both list- and matrix-like behavior.

For example, just as `list$name`

selects the `name`

element from the list, `df$name`

selects the `name`

column (vector) from the dataframe:

```
library(gapminder)
gap <- gapminder
head(gap$country)
#> [1] Afghanistan Afghanistan Afghanistan Afghanistan Afghanistan Afghanistan
#> 142 Levels: Afghanistan Albania Algeria Angola Argentina Australia ... Zimbabwe
```

Likewise, we can use square brackets to subset rows and columns:

```
# Row 1, column 3
gap[1, 3]
#> # A tibble: 1 x 1
#> year
#> <int>
#> 1 1952
# Fourth row
gap[4, ]
#> # A tibble: 1 x 6
#> country continent year lifeExp pop gdpPercap
#> <fct> <fct> <int> <dbl> <int> <dbl>
#> 1 Afghanistan Asia 1967 34.0 11537966 836.
# First two rows of the columns 1 and 5
gap[c(1,2), c(1, 5)]
#> # A tibble: 2 x 2
#> country pop
#> <fct> <int>
#> 1 Afghanistan 8425333
#> 2 Afghanistan 9240934
```

We can also use subsetting in conjunction with assignment to quickly add a column:

```
names(gap)
#> [1] "country" "continent" "year" "lifeExp" "pop" "gdpPercap"
gap$new_col <- NA
head(gap)
#> # A tibble: 6 x 7
#> country continent year lifeExp pop gdpPercap new_col
#> <fct> <fct> <int> <dbl> <int> <dbl> <lgl>
#> 1 Afghanistan Asia 1952 28.8 8425333 779. NA
#> 2 Afghanistan Asia 1957 30.3 9240934 821. NA
#> 3 Afghanistan Asia 1962 32.0 10267083 853. NA
#> 4 Afghanistan Asia 1967 34.0 11537966 836. NA
#> 5 Afghanistan Asia 1972 36.1 13079460 740. NA
#> 6 Afghanistan Asia 1977 38.4 14880372 786. NA
```

### 13.8.1 Challenges

#### Challenge 1.

Create a 3x2 dataframe called `basket`

. The first column should contain the names of 3 fruits. The second column should contain the price of those fruits. Now give your dataframe appropriate column and row names.

#### Challenge 2.

Add a third column called `color`

that tells what color each fruit is.