At the end of this exercise, you will be able to:
1. Define data structure.
2. Build a new vector and call elements within it.
3. Combine a series of vectors into a data frame.
4. Name columns and rows in a data frame.
5. Select columns and rows and use summary functions.
6. Write your data frame to a csv file!
library("tidyverse")
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 3.5.1 ✔ tibble 3.2.1
## ✔ lubridate 1.9.4 ✔ tidyr 1.3.1
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
In addition to classes of data, R also organizes data in different ways. These are called data structures and include vectors, lists, matrices, data frames, and factors. Here, we will introduce vectors and data frames.
Vectors are a common way of organizing data in R. We create vectors
using the c
command. The c
stands for
concatenate. We used this in the first part of today’s lab.
A numeric vector.
my_vector <- c(10, 20, 30)
A character vector.
days_of_the_week <- c("Monday", "Tuesday", "Wednesday", "Thrusday", "Friday", "Saturday", "Sunday")
A convenient trick for creating a vector to play around with is to generate a sequence of numbers.
my_vector_sequence <- c(1:100)
We can use []
to pull out elements in a vector. We just
need to specify their position in the vector; i.e. day 3 is
Wednesday.
days_of_the_week[4]
## [1] "Thrusday"
my_vector_sequence[10]
## [1] 10
[]
to determine which element in
my_vector_sequence
has a value of 15.my_vector_sequence[15]
## [1] 15
my_vector_sequence
that are less than or equal to
10.my_vector_sequence <= 10
## [1] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE FALSE FALSE
## [13] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [25] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [37] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [49] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [61] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [73] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [85] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [97] FALSE FALSE FALSE FALSE
[]
then you only get the values, not the
logical evaluation of the entire vector. Experiment with this by
adjusting the chunk below.my_vector_sequence[my_vector_sequence <= 10]
## [1] 1 2 3 4 5 6 7 8 9 10
The data frame is the most common way to organize data within R. A data frame can store data of many different classes. We usually don’t build data frames in RStudio, but this example will show you how they are structured.
Let’s build separate vectors that include length (in), weight (oz), and sex of three ruby-throated hummingbirds.
Sex <- c("male", "female", "male")
Length <- c(3.2, 3.7, 3.4)
Weight <- c(2.9, 4.0, 3.1)
Here we combine our three vectors to create a data frame with the
function data.frame()
.
hbirds <- data.frame(Sex, Length, Weight)
Since we work in the tidyverse, we can also use the
tibble()
function to create a data frame. A tibble is a
modern take on data frames. Tibbles are data frames, but they tweak some
older behaviors to make life a little easier.
hbirds <- tibble(Sex, Length, Weight)
Notice that not only are the data neat and clean looking, there is
also information provided about the class of data. dbl
means that the value is a type of numeric double precision floating
point.
hbirds
## # A tibble: 3 × 3
## Sex Length Weight
## <chr> <dbl> <dbl>
## 1 male 3.2 2.9
## 2 female 3.7 4
## 3 male 3.4 3.1
What are the column names of our data frame? Notice that R defaulted to using the names of our vectors, but we could name them something else when creating the data frame, or rename them later.
names(hbirds)
## [1] "Sex" "Length" "Weight"
What are the dimensions of the hbirds
data frame? The
dim()
and str()
commands provide this
information.
dim(hbirds)
## [1] 3 3
str(hbirds)
## tibble [3 × 3] (S3: tbl_df/tbl/data.frame)
## $ Sex : chr [1:3] "male" "female" "male"
## $ Length: num [1:3] 3.2 3.7 3.4
## $ Weight: num [1:3] 2.9 4 3.1
Let’s use lowercase names when we create the data frame. We just changed to lowercase here, but we could use any names we wish.
hbirds <- tibble(sex=Sex, length=Length, weight_g=Weight)
hbirds
## # A tibble: 3 × 3
## sex length weight_g
## <chr> <dbl> <dbl>
## 1 male 3.2 2.9
## 2 female 3.7 4
## 3 male 3.4 3.1
The same methods of selecting elements in vectors and data matrices
apply to data frames. We use []
. We have two positions
where the first applies to the rows, and the second to the columns.
The first row.
hbirds[1,]
## # A tibble: 1 × 3
## sex length weight_g
## <chr> <dbl> <dbl>
## 1 male 3.2 2.9
The third column.
hbirds[ ,3]
## # A tibble: 3 × 1
## weight_g
## <dbl>
## 1 2.9
## 2 4
## 3 3.1
We can use the $
to access a column (variable) in a data
frame. Here we calculate the mean length of the hummingbirds.
mean(hbirds$length)
## [1] 3.433333
We should save our hbirds data frame so we can use it again later!
There are many ways to save data in R, here we write our data frame to a
csv file. We use row.names = FALSE
to avoid row numbers
from printing out.
write.csv(hbirds, "hbirds_data.csv", row.names = FALSE)
Please review the learning goals and be sure to use the code here as
a reference when completing the homework.
–>Home