For Big-Data Scientists, “Janitor Work” Is Key Hurdle to Insights.
“Data scientists, according to interviews and expert estimates, spend from 50 percent to 80 percent of their time mired in the mundane labor of collecting and preparing data, before it can be explored for useful information.”
New York Times
At the end of this exercise, you will be able to:
1. Explain the difference between tidy and messy data.
2. Evaluate a data set as tidy or messy.
3. Use the pivot_longer() function of tidyr to
transform data from wide to long format.
4. Use separate() to split observations within a
column.
The quote above sums up much of the work in data science. Simply put, most of the data that you end up working with will be messy, disorganized, and not tidy. By the end of this week, you will have the tools necessary to wrangle messy data into tidy data that are ready for analysis. I know that we have spent a lot of time of data transformation, but this last step is essential to building plots and performing other analyses.
library("tidyverse")
library("janitor")
Most “wild” data are organized incorrectly for work in R and, as you might expect, the tools used to transform data are a core part of the tidyverse. I will attempt to summarize the most important points below, but you should read chapter 5 of the data science text for a more thorough explanation.
Tidy data follows three conventions:
(1) each variable has its own column
(2) each observation has its own row
(3) each value has its own cell
This is summarized in the image below. The package used to tidy data
is called tidyr and is a core part of the
tidyverse.
pivot_longer()Scientists use spreadsheets that are organized to make data entry
efficient. This is often referred to as wide format.
Unfortunately, the wide format creates a problem because column names
may actually represent values of a variable. The command
pivot_longer() shifts data from wide to long format.
Rules:
+ pivot_longer(cols, names_to, values_to) +
cols - Columns to pivot to longer format +
names_to - Name of the new column; it will contain the
column names of gathered columns as values + values_to -
Name of the new column; it will contain the data stored in the values of
gathered columns
The following data show results from a drug trial with four treatments on six patients. The values represent resting heart rate.
heartrate <- read_csv("data/heartrate.csv")
## Rows: 6 Columns: 5
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): patient
## dbl (4): a, b, c, d
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
heartrate
## # A tibble: 6 × 5
## patient a b c d
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 Margaret 72 74 80 68
## 2 Frank 84 84 88 76
## 3 Hawkeye 64 66 68 64
## 4 Trapper 60 58 64 58
## 5 Radar 74 72 78 70
## 6 Henry 88 87 88 72
Let’s assess whether or not these data are tidy.
(1) each variable has its own column?
each observation has its own row?
each value has its own cell?
Let’s think about this another way- can we make a plot that shows the effect of each drug on Margaret’s heart rate? No, the data are not tidy.
To fix this problem, we need to reshape the table to long format
while keeping track of column names and values. We do this using
pivot_longer(). Notice that the dimensions of the data
frame change.
Import the file relig_income.csv and store it as a
new object relig_income.
Why are these data untidy?
Use pivot_longer() to make the data tidy.
Some (but not all) of the column names are data. We also have NA’s.
billboard <- read_csv("data/billboard.csv")
## Rows: 317 Columns: 79
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (2): artist, track
## dbl (65): wk1, wk2, wk3, wk4, wk5, wk6, wk7, wk8, wk9, wk10, wk11, wk12, wk...
## lgl (11): wk66, wk67, wk68, wk69, wk70, wk71, wk72, wk73, wk74, wk75, wk76
## date (1): date.entered
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
billboard
## # A tibble: 317 × 79
## artist track date.entered wk1 wk2 wk3 wk4 wk5 wk6 wk7 wk8
## <chr> <chr> <date> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 2 Pac Baby… 2000-02-26 87 82 72 77 87 94 99 NA
## 2 2Ge+her The … 2000-09-02 91 87 92 NA NA NA NA NA
## 3 3 Doors D… Kryp… 2000-04-08 81 70 68 67 66 57 54 53
## 4 3 Doors D… Loser 2000-10-21 76 76 72 69 67 65 55 59
## 5 504 Boyz Wobb… 2000-04-15 57 34 25 17 17 31 36 49
## 6 98^0 Give… 2000-08-19 51 39 34 26 26 19 2 2
## 7 A*Teens Danc… 2000-07-08 97 97 96 95 100 NA NA NA
## 8 Aaliyah I Do… 2000-01-29 84 62 51 41 38 35 35 38
## 9 Aaliyah Try … 2000-03-18 59 53 38 28 21 18 16 14
## 10 Adams, Yo… Open… 2000-08-26 76 76 74 69 68 67 61 58
## # ℹ 307 more rows
## # ℹ 68 more variables: wk9 <dbl>, wk10 <dbl>, wk11 <dbl>, wk12 <dbl>,
## # wk13 <dbl>, wk14 <dbl>, wk15 <dbl>, wk16 <dbl>, wk17 <dbl>, wk18 <dbl>,
## # wk19 <dbl>, wk20 <dbl>, wk21 <dbl>, wk22 <dbl>, wk23 <dbl>, wk24 <dbl>,
## # wk25 <dbl>, wk26 <dbl>, wk27 <dbl>, wk28 <dbl>, wk29 <dbl>, wk30 <dbl>,
## # wk31 <dbl>, wk32 <dbl>, wk33 <dbl>, wk34 <dbl>, wk35 <dbl>, wk36 <dbl>,
## # wk37 <dbl>, wk38 <dbl>, wk39 <dbl>, wk40 <dbl>, wk41 <dbl>, wk42 <dbl>, …
Solution 1: specify a range of columns that you want to pivot.
Solution 2: OR, specify columns that you want to stay fixed.
Solution 3: identify columns by a prefix, remove the prefix and all NA’s.
Import plant_data.csv as a new object
plant_data.
Why are these data not tidy?
Use pivot_longer() to make the data tidy. Focus the
data only on genotype, water_sched_prog, and greenhouse.
In this case, there are two observations in each column name.
qpcr_untidy <- read_csv("data/qpcr_untidy.csv")
## Rows: 5 Columns: 10
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): gene
## dbl (9): exp1_rep1, exp1_rep2, exp1_rep3, exp2_rep1, exp2_rep2, exp2_rep3, e...
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
qpcr_untidy
## # A tibble: 5 × 10
## gene exp1_rep1 exp1_rep2 exp1_rep3 exp2_rep1 exp2_rep2 exp2_rep3 exp3_rep1
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 A 21.7 19.8 20.7 18.3 20.4 17.6 20.6
## 2 B 24.3 24.8 25.2 26.0 29.9 26.4 25.4
## 3 C 20.7 21.5 21.3 25.5 18.7 22.3 21.9
## 4 D 26.9 28.0 27.7 33.1 24.3 28.9 28.5
## 5 E 25.0 22.7 23.8 21.1 23.4 20.2 23.7
## # ℹ 2 more variables: exp3_rep2 <dbl>, exp3_rep3 <dbl>
names_sep helps pull these apart, but we still have
“exp” and “rep” to deal with.
–>Home