At the end of this exercise, you will be able to:
1. Import .csv files as data frames using read_csv()
.
2. Understand the importance of paths and working directories to import
data.
2. Use summary functions to explore the dimensions, structure, and
contents of a data frame.
library("tidyverse")
During lab 2, you learned how to work with vectors and data frames.
For the remainder of the course, we will work exclusively with data
frames. Recall that data frames store multiple classes of data. Last
time, you were shown how to build data frames using the
data.frame()
and tibble
commands.
Below are data collected by three scientists (Jill, Steve, Susan in order) measuring temperatures of three hot springs near Mono Lake.
temp <- c(36.25, 35.40, 35.30, 35.15, 35.35, 33.35, 30.70, 29.65, 29.20)
name <- c("Jill", "Susan", "Steve", "Jill", "Susan", "Steve", "Jill", "Susan", "Steve")
spring <- c("Buckeye", "Buckeye", "Buckeye", "Benton", "Benton", "Benton", "Travertine", "Travertine", "Travertine")
hsprings
with the above data.
Name the temperature column temp_C
. Print out the data
frame.hsprings <- data.frame(temp_C = temp, name = name, spring = spring)
name
to
scientist
, leave the other column names the same. Print out
the data frame names.names(hsprings)[2] <- "scientist"
c(4.15, 4.13, 4.12, 3.21, 3.23, 3.20, 5.67, 5.65, 5.66)
.
Print out the data frame.depth <- c(4.15, 4.13, 4.12, 3.21, 3.23, 3.20, 5.67, 5.65, 5.66)
mean(hsprings$temp_C)
## [1] 33.37222
.csv
file! Do not allow
row names.write_csv(hsprings, "hsprings_data.csv")
Scientists often make their data available as supplementary material associated with a publication. This is excellent scientific practice as it insures repeatability by showing exactly how analyses were performed. As data scientists, we capitalize on this by importing data directly into R.
R allows us to import a wide variety of data types. The most common type of file is a .csv file which stands for comma separated values. Spreadsheets are often developed in Excel then saved as .csv files for use in R. There are packages that allow you to open excel files and many other formats but .csv is the most common.
An opinionated word about excel. It is fine to use excel for data entry and basic analysis. But, it often adds proprietary formatting that makes excel files difficult to work with in any program besides excel. R can read excel files, but I know of no R programmers that routinely use them. Instead they save copies of their excel files as .csv which strips away formatting but makes them easier to use in a variety of programs. We won’t work with excel files in BIS 15L, but we will learn to import them.
The same goes for Google Sheets. Google Sheets is a great tool for collaboration and data entry. But, while Google states that you are the owner of the data you enter they store the files. Also, it is a little unclear what Google can do with your data so for sensitive projects it may be best to use a different tool.
To import any file, first make sure that you are in the correct working directory. If you are not in the correct directory, R will not “see” the file.
getwd()
## [1] "/Users/switters/Desktop/datascibiol/lab3"
Here we open a .csv file. Since we are using the tidyverse, we open
the file using read_csv()
. readr
is included
in the tidyverse set of packages.
In the previous part of the lab, you exported a .csv
of
hot springs data. Let’s try to reload that .csv
.
hot_springs <- read_csv("hsprings_data.csv")
## Rows: 9 Columns: 3
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (2): scientist, spring
## dbl (1): temp_C
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Use the str()
function to get an idea of the data
structure of hot_springs
.
str(hot_springs)
## spc_tbl_ [9 × 3] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
## $ temp_C : num [1:9] 36.2 35.4 35.3 35.1 35.4 ...
## $ scientist: chr [1:9] "Jill" "Susan" "Steve" "Jill" ...
## $ spring : chr [1:9] "Buckeye" "Buckeye" "Buckeye" "Benton" ...
## - attr(*, "spec")=
## .. cols(
## .. temp_C = col_double(),
## .. scientist = col_character(),
## .. spring = col_character()
## .. )
## - attr(*, "problems")=<externalptr>
What is the class of the scientist column? Change it to factor and then show the levels of that factor.
class(hot_springs$scientist)
## [1] "character"
hot_springs$scientist <- as.factor(hot_springs$scientist)
Change the class of the springs column to factor.
hot_springs$spring <- as.factor(hot_springs$spring)
Did our changes work?
str(hot_springs)
## spc_tbl_ [9 × 3] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
## $ temp_C : num [1:9] 36.2 35.4 35.3 35.1 35.4 ...
## $ scientist: Factor w/ 3 levels "Jill","Steve",..: 1 3 2 1 3 2 1 3 2
## $ spring : Factor w/ 3 levels "Benton","Buckeye",..: 2 2 2 1 1 1 3 3 3
## - attr(*, "spec")=
## .. cols(
## .. temp_C = col_double(),
## .. scientist = col_character(),
## .. spring = col_character()
## .. )
## - attr(*, "problems")=<externalptr>
We can also check the levels of each column.
levels(hot_springs$scientist)
## [1] "Jill" "Steve" "Susan"
levels(hot_springs$spring)
## [1] "Benton" "Buckeye" "Travertine"
str()
function to explore it’s structure.–>Home