At the end of this exercise, you will be able to:
1. Define data structure.
2. Build a new vector and call elements within it.
3. Combine a series of vectors into a data frame.
4. Name columns and rows in a data frame.
5. Select columns and rows and use summary functions.
6. Write your data frame to a csv file!
A library is a collection of R functions and data sets. The tidyverse is a collection of R packages designed for data science. For this course, we will be using many of the packages in the tidyverse. We load the tidyverse with the command below.
library("tidyverse")
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.6
## ✔ forcats 1.0.1 ✔ stringr 1.6.0
## ✔ ggplot2 4.0.1 ✔ tibble 3.3.0
## ✔ lubridate 1.9.4 ✔ tidyr 1.3.2
## ✔ purrr 1.2.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
In addition to classes of data, R also organizes data in different ways. These are called data structures and include vectors, lists, matrices, data frames, and factors. Here, we will introduce vectors and data frames.
Vectors are a common way of organizing data in R. We create vectors
using the c command. The c stands for
concatenate. We used this command in lab 2.
A numeric vector.
my_vector <- c(10, 20, 30)
A character vector.
days_of_the_week <- c("Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday", "Sunday")
A convenient trick for creating a vector to play around with is to generate a sequence of numbers.
my_vector_sequence <- c(1:100)
We can use [] to pull out elements in a vector. We just
need to specify their position in the vector; i.e. day 3 is
Wednesday.
days_of_the_week[4]
## [1] "Thursday"
my_vector_sequence[10]
## [1] 10
[] to determine which element in
my_vector_sequence has a value of 15.my_vector_sequence[15]
## [1] 15
my_vector_sequence that are less than or equal to
10.my_vector_sequence <= 10
## [1] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE FALSE FALSE
## [13] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [25] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [37] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [49] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [61] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [73] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [85] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [97] FALSE FALSE FALSE FALSE
[] then you only get the values, not the
logical evaluation of the entire vector. Experiment with this by
adjusting the chunk below.my_vector_sequence[my_vector_sequence <= 10]
## [1] 1 2 3 4 5 6 7 8 9 10
The data frame is the most common way to organize data within R. A data frame stores data of many different classes. Essentially, data frames are spreadsheets like you would find in Excel. We usually don’t build data frames in RStudio from scratch, but this example will show you how they are structured.
Let’s build separate vectors that include length (in), weight (oz), and sex of three ruby-throated hummingbirds.
Sex <- c("male", "female", "male")
Length <- c(3.2, 3.7, 3.4)
Weight <- c(2.9, 4.0, 3.1)
Since we work in the tidyverse, we use tibble() to
create a data frame.
hbirds <- tibble(Sex, Length, Weight)
Notice that not only are the data neat and clean looking, there is
also information provided about the class of data. dbl
means that the value is a type of numeric double precision floating
point.
hbirds
## # A tibble: 3 × 3
## Sex Length Weight
## <chr> <dbl> <dbl>
## 1 male 3.2 2.9
## 2 female 3.7 4
## 3 male 3.4 3.1
What are the column names of our data frame? Notice that R defaulted to using the names of our vectors, but we could name them something else when creating the data frame, or rename them later.
names(hbirds)
## [1] "Sex" "Length" "Weight"
What are the dimensions of the hbirds data frame? The
dim() and str() commands provide this
information.
dim(hbirds)
## [1] 3 3
str(hbirds)
## tibble [3 × 3] (S3: tbl_df/tbl/data.frame)
## $ Sex : chr [1:3] "male" "female" "male"
## $ Length: num [1:3] 3.2 3.7 3.4
## $ Weight: num [1:3] 2.9 4 3.1
Let’s use lowercase names when we create the data frame. We just changed to lowercase here, but we could use any names we wish.
hbirds <- tibble(sex=Sex, length=Length, weight_g=Weight)
hbirds
## # A tibble: 3 × 3
## sex length weight_g
## <chr> <dbl> <dbl>
## 1 male 3.2 2.9
## 2 female 3.7 4
## 3 male 3.4 3.1
The same methods of selecting elements in vectors and data matrices
apply to data frames. We use []. We have two positions
where the first applies to the rows, and the second to the columns.
The first row.
hbirds[1,]
## # A tibble: 1 × 3
## sex length weight_g
## <chr> <dbl> <dbl>
## 1 male 3.2 2.9
The third column.
hbirds[ ,3]
## # A tibble: 3 × 1
## weight_g
## <dbl>
## 1 2.9
## 2 4
## 3 3.1
We can use the $ to access a column (variable) in a data
frame. Here we calculate the mean length of the hummingbirds.
mean(hbirds$length)
## [1] 3.433333
We should save our hbirds data frame so we can use it again later!
There are many ways to save data in R, here we write our data frame to a
csv file. We use row.names = FALSE to avoid row numbers
from printing out.
write.csv(hbirds, "hbirds_data.csv", row.names = FALSE)