Lab 3.1

Learning Goals

At the end of this exercise, you will be able to:
1. Define data structure.
2. Build a new vector and call elements within it.
3. Combine a series of vectors into a data frame.
4. Name columns and rows in a data frame.
5. Select columns and rows and use summary functions.
6. Write your data frame to a csv file!

Working directories

getwd() #by default you are in the directory that you first opened

## [1] "/Users/switters/Desktop/datascibiol/lab3"

#setwd()

Load the tidyverse

A library is a collection of R functions and data sets. The tidyverse is a collection of R packages designed for data science. For this course, we will be using many of the packages in the tidyverse. We load the tidyverse with the command below.

library("tidyverse")

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.2.0     ✔ readr     2.2.0
## ✔ forcats   1.0.1     ✔ stringr   1.6.0
## ✔ ggplot2   4.0.2     ✔ tibble    3.3.1
## ✔ lubridate 1.9.5     ✔ tidyr     1.3.2
## ✔ purrr     1.2.1     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

#install.packages("package_name")

Data Structures

In addition to classes of data, R also organizes data in different ways. These are called data structures and include vectors, lists, matrices, data frames, and factors. Here, we will introduce vectors and data frames.

Vectors

Vectors are a common way of organizing data in R. We create vectors using the c command. The c stands for concatenate. We used this command in lab 2.

A numeric vector.

my_vector <- c(10, 20, 30) #this makes a new vector

A character vector.

days_of_the_week <- c("Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday", "Sunday")

A convenient trick for creating a vector to play around with is to generate a sequence of numbers.

my_vector_sequence <- c(1:100)
my_vector_sequence

##   [1]   1   2   3   4   5   6   7   8   9  10  11  12  13  14  15  16  17  18
##  [19]  19  20  21  22  23  24  25  26  27  28  29  30  31  32  33  34  35  36
##  [37]  37  38  39  40  41  42  43  44  45  46  47  48  49  50  51  52  53  54
##  [55]  55  56  57  58  59  60  61  62  63  64  65  66  67  68  69  70  71  72
##  [73]  73  74  75  76  77  78  79  80  81  82  83  84  85  86  87  88  89  90
##  [91]  91  92  93  94  95  96  97  98  99 100

Identifying vector elements

We can use [] to pull out elements in a vector. We just need to specify their position in the vector; i.e. day 3 is Wednesday.

days_of_the_week[4]

## [1] "Thursday"

my_vector_sequence[10]

## [1] 10

Practice

Use [] to determine which element in my_vector_sequence has a value of 15.

my_vector_sequence[15]

## [1] 15

We can use operators such as <, >, ==, <==, etc. Show all values in my_vector_sequence that are less than or equal to 10.

my_vector_sequence<=10

##   [1]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE FALSE FALSE
##  [13] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
##  [25] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
##  [37] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
##  [49] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
##  [61] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
##  [73] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
##  [85] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
##  [97] FALSE FALSE FALSE FALSE

If you use [] then you only get the values, not the logical evaluation of the entire vector. Experiment with this by adjusting the chunk below.

my_vector_sequence[my_vector_sequence <= 10]

##  [1]  1  2  3  4  5  6  7  8  9 10

Data Frames

The data frame is the most common way to organize data within R. A data frame stores data of many different classes. Essentially, data frames are spreadsheets like you would find in Excel. We usually don’t build data frames in RStudio from scratch, but this example will show you how they are structured.

Let’s build separate vectors that include length (in), weight (oz), and sex of three ruby-throated hummingbirds.

Sex <- c("male", "female", "male")
Length <- c(3.2, 3.7, 3.4)
Weight <- c(2.9, 4.0, 3.1)

Since we work in the tidyverse, we use tibble() to create a data frame.

hbirds <- tibble(Sex, Length, Weight)

Notice that not only are the data neat and clean looking, there is also information provided about the class of data. dbl means that the value is a type of numeric double precision floating point.

hbirds

## # A tibble: 3 × 3
##   Sex    Length Weight
##   <chr>   <dbl>  <dbl>
## 1 male      3.2    2.9
## 2 female    3.7    4  
## 3 male      3.4    3.1

What are the column names of our data frame? Notice that R defaulted to using the names of our vectors, but we could name them something else when creating the data frame, or rename them later.

names(hbirds)

## [1] "Sex"    "Length" "Weight"

What are the dimensions of the hbirds data frame? The dim() and str() commands provide this information.

dim(hbirds)

## [1] 3 3

str(hbirds)

## tibble [3 × 3] (S3: tbl_df/tbl/data.frame)
##  $ Sex   : chr [1:3] "male" "female" "male"
##  $ Length: num [1:3] 3.2 3.7 3.4
##  $ Weight: num [1:3] 2.9 4 3.1

Let’s use lowercase names when we create the data frame. We just changed to lowercase here, but we could use any names we wish.

Use lowercase always!

names(hbirds)

## [1] "Sex"    "Length" "Weight"

hbirds <- tibble(sex=Sex, length=Length, weight=Weight)
hbirds

## # A tibble: 3 × 3
##   sex    length weight
##   <chr>   <dbl>  <dbl>
## 1 male      3.2    2.9
## 2 female    3.7    4  
## 3 male      3.4    3.1

Accessing Data Frame Columns and Rows

The same methods of selecting elements in vectors and data matrices apply to data frames. We use []. We have two positions where the first applies to the rows, and the second to the columns.

hbirds

## # A tibble: 3 × 3
##   sex    length weight
##   <chr>   <dbl>  <dbl>
## 1 male      3.2    2.9
## 2 female    3.7    4  
## 3 male      3.4    3.1

The first row.

hbirds[1,]

## # A tibble: 1 × 3
##   sex   length weight
##   <chr>  <dbl>  <dbl>
## 1 male     3.2    2.9

hbirds[1,3]

## # A tibble: 1 × 1
##   weight
##    <dbl>
## 1    2.9

The third column.

hbirds[,3]

## # A tibble: 3 × 1
##   weight
##    <dbl>
## 1    2.9
## 2    4  
## 3    3.1

Calculations

We can use the $ to access a column (variable) in a data frame. Here we calculate the mean length of the hummingbirds.

mean(hbirds$length)

## [1] 3.433333

Writing Data to File

We should save our hbirds data frame so we can use it again later! There are many ways to save data in R, here we write our data frame to a csv file. We use row.names = FALSE to avoid row numbers from printing out.

write.csv(hbirds, "hbirds_data.csv", row.names=FALSE)