Learning Goals

At the end of this exercise, you will be able to:
1. Import .csv files as data frames using read_csv().
2. Understand the importance of paths and working directories to import data.
2. Use summary functions to explore the dimensions, structure, and contents of a data frame.

Load tidyverse

library("tidyverse")

Data Frames

During lab 2, you learned how to work with vectors and data frames. For the remainder of the course, we will work exclusively with data frames. Recall that data frames store multiple classes of data. Last time, you were shown how to build data frames using the data.frame() and tibble commands.

Review

Below are data collected by three scientists (Jill, Steve, Susan in order) measuring temperatures of three hot springs near Mono Lake.

temp <- c(36.25, 35.40, 35.30, 35.15, 35.35, 33.35, 30.70, 29.65, 29.20)
name <- c("Jill", "Susan", "Steve", "Jill", "Susan", "Steve", "Jill", "Susan", "Steve")
spring <- c("Buckeye", "Buckeye", "Buckeye", "Benton", "Benton", "Benton", "Travertine", "Travertine", "Travertine")
  1. Build a data frame called hsprings with the above data. Name the temperature column temp_C. Print out the data frame.
hsprings <- data.frame(temp_C = temp, name = name, spring = spring)
  1. Change the column titled name to scientist, leave the other column names the same. Print out the data frame names.
names(hsprings)[2] <- "scientist"
  1. Our scientists forgot to record the depth data for each spring. Here it is as a vector, add it as a new column called depth_ft: c(4.15, 4.13, 4.12, 3.21, 3.23, 3.20, 5.67, 5.65, 5.66). Print out the data frame.
depth <- c(4.15, 4.13, 4.12, 3.21, 3.23, 3.20, 5.67, 5.65, 5.66)
  1. Calculate the mean temperature of all of the temperature measurements.
mean(hsprings$temp_C)
## [1] 33.37222
  1. Save your hot springs data as a .csv file! Do not allow row names.
write_csv(hsprings, "hsprings_data.csv")
  1. Go to the environment panel and push the broom to clear your environment.

Importing Data

Scientists often make their data available as supplementary material associated with a publication. This is excellent scientific practice as it insures repeatability by showing exactly how analyses were performed. As data scientists, we capitalize on this by importing data directly into R.

R allows us to import a wide variety of data types. The most common type of file is a .csv file which stands for comma separated values. Spreadsheets are often developed in Excel then saved as .csv files for use in R. There are packages that allow you to open excel files and many other formats but .csv is the most common.

An opinionated word about excel. It is fine to use excel for data entry and basic analysis. But, it often adds proprietary formatting that makes excel files difficult to work with in any program besides excel. R can read excel files, but I know of no R programmers that routinely use them. Instead they save copies of their excel files as .csv which strips away formatting but makes them easier to use in a variety of programs. We won’t work with excel files in BIS 15L, but we will learn to import them.

The same goes for Google Sheets. Google Sheets is a great tool for collaboration and data entry. But, while Google states that you are the owner of the data you enter they store the files. Also, it is a little unclear what Google can do with your data so for sensitive projects it may be best to use a different tool.

To import any file, first make sure that you are in the correct working directory. If you are not in the correct directory, R will not “see” the file.

getwd()
## [1] "/Users/switters/Desktop/datascibiol/lab3"

Load the data

Here we open a .csv file. Since we are using the tidyverse, we open the file using read_csv(). readr is included in the tidyverse set of packages.

In the previous part of the lab, you exported a .csv of hot springs data. Let’s try to reload that .csv.

hot_springs <- read_csv("hsprings_data.csv")
## Rows: 9 Columns: 3
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (2): scientist, spring
## dbl (1): temp_C
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Use the str() function to get an idea of the data structure of hot_springs.

str(hot_springs)
## spc_tbl_ [9 × 3] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ temp_C   : num [1:9] 36.2 35.4 35.3 35.1 35.4 ...
##  $ scientist: chr [1:9] "Jill" "Susan" "Steve" "Jill" ...
##  $ spring   : chr [1:9] "Buckeye" "Buckeye" "Buckeye" "Benton" ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   temp_C = col_double(),
##   ..   scientist = col_character(),
##   ..   spring = col_character()
##   .. )
##  - attr(*, "problems")=<externalptr>

What is the class of the scientist column? Change it to factor and then show the levels of that factor.

class(hot_springs$scientist)
## [1] "character"
hot_springs$scientist <- as.factor(hot_springs$scientist)

Change the class of the springs column to factor.

hot_springs$spring <- as.factor(hot_springs$spring)

Did our changes work?

str(hot_springs)
## spc_tbl_ [9 × 3] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ temp_C   : num [1:9] 36.2 35.4 35.3 35.1 35.4 ...
##  $ scientist: Factor w/ 3 levels "Jill","Steve",..: 1 3 2 1 3 2 1 3 2
##  $ spring   : Factor w/ 3 levels "Benton","Buckeye",..: 2 2 2 1 1 1 3 3 3
##  - attr(*, "spec")=
##   .. cols(
##   ..   temp_C = col_double(),
##   ..   scientist = col_character(),
##   ..   spring = col_character()
##   .. )
##  - attr(*, "problems")=<externalptr>

We can also check the levels of each column.

levels(hot_springs$scientist)
## [1] "Jill"  "Steve" "Susan"
levels(hot_springs$spring)
## [1] "Benton"     "Buckeye"    "Travertine"

Practice

  1. As part of homework 2, you saved data on the five tallest mountains in the world. Reload that data below and use the str() function to explore it’s structure.

That’s it! Let’s take a break and then move on to part 2!

–>Home