pivot_long()
For Big-Data Scientists, “Janitor Work” Is Key Hurdle to Insights.
“Data scientists, according to interviews and expert estimates, spend from 50 percent to 80 percent of their time mired in the mundane labor of collecting and preparing data, before it can be explored for useful information.”
New York Times
At the end of this exercise, you will be able to:
1. Explain the difference between tidy and messy data.
2. Evaluate a data set as tidy or messy.
3. Use the pivot_longer()
function of tidyr
to
transform data from wide to long format.
4. Use separate()
to split observations within a
column.
The quote above sums up much of the work in data science. Simply put, most of the data that you end up working with will be messy, disorganized, and not tidy. By the end of this week, you will have the tools necessary to wrangle messy data into tidy data that are ready for analysis. I know that we have spent a lot of time of data transformation, but this last step is essential to building plots and performing other analyses.
library("tidyverse")
library("naniar")
library("janitor")
Most “wild” data are organized incorrectly for work in R and, as you might expect, the tools used to transform data are a core part of the tidyverse. I will attempt to summarize the most important points below, but you should read chapter 12 of the data science text for a more thorough explanation.
Tidy
data follows three conventions:
(1) each variable has its own column
(2) each observation has its own row
(3) each value has its own cell
This is summarized in the image below. The package used to tidy data
is called tidyr and is a core part of the
tidyverse.
pivot_longer()
Scientists frequently use spreadsheets that are organized to make
data entry efficient. This is often referred to as wide
format. Unfortunately, the wide format creates a problem
because column names may actually represent values of a variable. The
command pivot_longer()
shifts data from wide to long
format.
Rules:
+ pivot_longer
(cols, names_to, values_to) +
cols
- Columns to pivot to longer format +
names_to
- Name of the new column; it will contain the
column names of gathered columns as values + values_to
-
Name of the new column; it will contain the data stored in the values of
gathered columns
The following data show results from a drug trial with four treatments on six patients. The values represent resting heart rate.
heartrate <- read_csv("data/heartrate.csv")
## Rows: 6 Columns: 5
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): patient
## dbl (4): a, b, c, d
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
heartrate
## # A tibble: 6 × 5
## patient a b c d
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 Margaret 72 74 80 68
## 2 Frank 84 84 88 76
## 3 Hawkeye 64 66 68 64
## 4 Trapper 60 58 64 58
## 5 Radar 74 72 78 70
## 6 Henry 88 87 88 72
Let’s assess whether or not these data are tidy.
(1) each variable has its own column?
each observation has its own row?
each value has its own cell?
To fix this problem, we need to reshape the table to long format
while keeping track of column names and values. We do this using
pivot_longer()
. Notice that the dimensions of the data
frame change.
heartrate %>%
pivot_longer(-patient, #patient will not move
names_to = "drug", #make a new column called "drug"
values_to="heartrate" #values moved to a new column called "heartrate"
)
## # A tibble: 24 × 3
## patient drug heartrate
## <chr> <chr> <dbl>
## 1 Margaret a 72
## 2 Margaret b 74
## 3 Margaret c 80
## 4 Margaret d 68
## 5 Frank a 84
## 6 Frank b 84
## 7 Frank c 88
## 8 Frank d 76
## 9 Hawkeye a 64
## 10 Hawkeye b 66
## # ℹ 14 more rows
relig_income.csv
and store it as a new
object relig_income
.relig_income <- read_csv("data/relig_income.csv")
## Rows: 18 Columns: 11
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): religion
## dbl (10): <$10k, $10-20k, $20-30k, $30-40k, $40-50k, $50-75k, $75-100k, $100...
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
relig_income
## # A tibble: 18 × 11
## religion `<$10k` `$10-20k` `$20-30k` `$30-40k` `$40-50k` `$50-75k` `$75-100k`
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Agnostic 27 34 60 81 76 137 122
## 2 Atheist 12 27 37 52 35 70 73
## 3 Buddhist 27 21 30 34 33 58 62
## 4 Catholic 418 617 732 670 638 1116 949
## 5 Don’t k… 15 14 15 11 10 35 21
## 6 Evangel… 575 869 1064 982 881 1486 949
## 7 Hindu 1 9 7 9 11 34 47
## 8 Histori… 228 244 236 238 197 223 131
## 9 Jehovah… 20 27 24 24 21 30 15
## 10 Jewish 19 19 25 25 30 95 69
## 11 Mainlin… 289 495 619 655 651 1107 939
## 12 Mormon 29 40 48 51 56 112 85
## 13 Muslim 6 7 9 10 9 23 16
## 14 Orthodox 13 17 23 32 32 47 38
## 15 Other C… 9 7 11 13 13 14 18
## 16 Other F… 20 33 40 46 49 63 46
## 17 Other W… 5 2 3 4 2 7 3
## 18 Unaffil… 217 299 374 365 341 528 407
## # ℹ 3 more variables: `$100-150k` <dbl>, `>150k` <dbl>,
## # `Don't know/refused` <dbl>
Why are these data untidy?
Variable names are actually data
Use pivot_longer()
to make the data tidy.
relig_income %>%
pivot_longer(-religion,
names_to = "income", #make a new column called "income"
values_to = "total" #values moved to a new column called "total"
)
## # A tibble: 180 × 3
## religion income total
## <chr> <chr> <dbl>
## 1 Agnostic <$10k 27
## 2 Agnostic $10-20k 34
## 3 Agnostic $20-30k 60
## 4 Agnostic $30-40k 81
## 5 Agnostic $40-50k 76
## 6 Agnostic $50-75k 137
## 7 Agnostic $75-100k 122
## 8 Agnostic $100-150k 109
## 9 Agnostic >150k 84
## 10 Agnostic Don't know/refused 96
## # ℹ 170 more rows
Some (but not all) of the column names are data. We also have NA’s.
billboard <- read_csv("data/billboard.csv")
## Rows: 317 Columns: 79
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (2): artist, track
## dbl (65): wk1, wk2, wk3, wk4, wk5, wk6, wk7, wk8, wk9, wk10, wk11, wk12, wk...
## lgl (11): wk66, wk67, wk68, wk69, wk70, wk71, wk72, wk73, wk74, wk75, wk76
## date (1): date.entered
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
billboard
## # A tibble: 317 × 79
## artist track date.entered wk1 wk2 wk3 wk4 wk5 wk6 wk7 wk8
## <chr> <chr> <date> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 2 Pac Baby… 2000-02-26 87 82 72 77 87 94 99 NA
## 2 2Ge+her The … 2000-09-02 91 87 92 NA NA NA NA NA
## 3 3 Doors D… Kryp… 2000-04-08 81 70 68 67 66 57 54 53
## 4 3 Doors D… Loser 2000-10-21 76 76 72 69 67 65 55 59
## 5 504 Boyz Wobb… 2000-04-15 57 34 25 17 17 31 36 49
## 6 98^0 Give… 2000-08-19 51 39 34 26 26 19 2 2
## 7 A*Teens Danc… 2000-07-08 97 97 96 95 100 NA NA NA
## 8 Aaliyah I Do… 2000-01-29 84 62 51 41 38 35 35 38
## 9 Aaliyah Try … 2000-03-18 59 53 38 28 21 18 16 14
## 10 Adams, Yo… Open… 2000-08-26 76 76 74 69 68 67 61 58
## # ℹ 307 more rows
## # ℹ 68 more variables: wk9 <dbl>, wk10 <dbl>, wk11 <dbl>, wk12 <dbl>,
## # wk13 <dbl>, wk14 <dbl>, wk15 <dbl>, wk16 <dbl>, wk17 <dbl>, wk18 <dbl>,
## # wk19 <dbl>, wk20 <dbl>, wk21 <dbl>, wk22 <dbl>, wk23 <dbl>, wk24 <dbl>,
## # wk25 <dbl>, wk26 <dbl>, wk27 <dbl>, wk28 <dbl>, wk29 <dbl>, wk30 <dbl>,
## # wk31 <dbl>, wk32 <dbl>, wk33 <dbl>, wk34 <dbl>, wk35 <dbl>, wk36 <dbl>,
## # wk37 <dbl>, wk38 <dbl>, wk39 <dbl>, wk40 <dbl>, wk41 <dbl>, wk42 <dbl>, …
Solution 1: specify a range of columns that you want to pivot.
billboard2 <-
billboard %>%
pivot_longer(wk1:wk76, # a range of columns
names_to = "week",
values_to = "rank",
values_drop_na = TRUE #this will drop the NA's
)
billboard2
## # A tibble: 5,307 × 5
## artist track date.entered week rank
## <chr> <chr> <date> <chr> <dbl>
## 1 2 Pac Baby Don't Cry (Keep... 2000-02-26 wk1 87
## 2 2 Pac Baby Don't Cry (Keep... 2000-02-26 wk2 82
## 3 2 Pac Baby Don't Cry (Keep... 2000-02-26 wk3 72
## 4 2 Pac Baby Don't Cry (Keep... 2000-02-26 wk4 77
## 5 2 Pac Baby Don't Cry (Keep... 2000-02-26 wk5 87
## 6 2 Pac Baby Don't Cry (Keep... 2000-02-26 wk6 94
## 7 2 Pac Baby Don't Cry (Keep... 2000-02-26 wk7 99
## 8 2Ge+her The Hardest Part Of ... 2000-09-02 wk1 91
## 9 2Ge+her The Hardest Part Of ... 2000-09-02 wk2 87
## 10 2Ge+her The Hardest Part Of ... 2000-09-02 wk3 92
## # ℹ 5,297 more rows
Solution 2: OR, specify columns that you want to stay fixed.
billboard3 <-
billboard %>%
pivot_longer(-c(artist, track, date.entered), #specific columns to stay fixed
names_to = "week",
values_to = "rank",
values_drop_na = TRUE
)
billboard3
## # A tibble: 5,307 × 5
## artist track date.entered week rank
## <chr> <chr> <date> <chr> <dbl>
## 1 2 Pac Baby Don't Cry (Keep... 2000-02-26 wk1 87
## 2 2 Pac Baby Don't Cry (Keep... 2000-02-26 wk2 82
## 3 2 Pac Baby Don't Cry (Keep... 2000-02-26 wk3 72
## 4 2 Pac Baby Don't Cry (Keep... 2000-02-26 wk4 77
## 5 2 Pac Baby Don't Cry (Keep... 2000-02-26 wk5 87
## 6 2 Pac Baby Don't Cry (Keep... 2000-02-26 wk6 94
## 7 2 Pac Baby Don't Cry (Keep... 2000-02-26 wk7 99
## 8 2Ge+her The Hardest Part Of ... 2000-09-02 wk1 91
## 9 2Ge+her The Hardest Part Of ... 2000-09-02 wk2 87
## 10 2Ge+her The Hardest Part Of ... 2000-09-02 wk3 92
## # ℹ 5,297 more rows
Solution 3: identify columns by a prefix, remove the prefix and all NA’s.
billboard %>%
pivot_longer(
cols = starts_with("wk"), #columns that start with "wk"
names_to = "week",
names_prefix = "wk",
values_to = "rank",
values_drop_na = TRUE)
## # A tibble: 5,307 × 5
## artist track date.entered week rank
## <chr> <chr> <date> <chr> <dbl>
## 1 2 Pac Baby Don't Cry (Keep... 2000-02-26 1 87
## 2 2 Pac Baby Don't Cry (Keep... 2000-02-26 2 82
## 3 2 Pac Baby Don't Cry (Keep... 2000-02-26 3 72
## 4 2 Pac Baby Don't Cry (Keep... 2000-02-26 4 77
## 5 2 Pac Baby Don't Cry (Keep... 2000-02-26 5 87
## 6 2 Pac Baby Don't Cry (Keep... 2000-02-26 6 94
## 7 2 Pac Baby Don't Cry (Keep... 2000-02-26 7 99
## 8 2Ge+her The Hardest Part Of ... 2000-09-02 1 91
## 9 2Ge+her The Hardest Part Of ... 2000-09-02 2 87
## 10 2Ge+her The Hardest Part Of ... 2000-09-02 3 92
## # ℹ 5,297 more rows
plant_data.csv
as a new object
plant_data
.plant_data <- read_csv("data/plant_data.csv")
## Rows: 3 Columns: 33
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (3): genotype, water_sched_prog, greenhouse
## dbl (30): day1, day2, day3, day4, day5, day6, day7, day8, day9, day10, day11...
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
plant_data
## # A tibble: 3 × 33
## genotype water_sched_prog greenhouse day1 day2 day3 day4 day5 day6 day7
## <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 control A A761 21.7 19.9 20.7 19.4 20.2 19.2 20.6
## 2 mutant1 A A761 24.3 24.9 25.2 25.3 27.4 25.5 25.4
## 3 mutant2 A A761 20.7 21.3 21.3 22.5 19.8 21.4 21.9
## # ℹ 23 more variables: day8 <dbl>, day9 <dbl>, day10 <dbl>, day11 <dbl>,
## # day12 <dbl>, day13 <dbl>, day14 <dbl>, day15 <dbl>, day16 <dbl>,
## # day17 <dbl>, day18 <dbl>, day19 <dbl>, day20 <dbl>, day21 <dbl>,
## # day22 <dbl>, day23 <dbl>, day24 <dbl>, day25 <dbl>, day26 <dbl>,
## # day27 <dbl>, day28 <dbl>, day29 <dbl>, day30 <dbl>
names(plant_data)
## [1] "genotype" "water_sched_prog" "greenhouse" "day1"
## [5] "day2" "day3" "day4" "day5"
## [9] "day6" "day7" "day8" "day9"
## [13] "day10" "day11" "day12" "day13"
## [17] "day14" "day15" "day16" "day17"
## [21] "day18" "day19" "day20" "day21"
## [25] "day22" "day23" "day24" "day25"
## [29] "day26" "day27" "day28" "day29"
## [33] "day30"
pivot_longer()
to make the data tidy. Focus the
data only on genotype, water_sched_prog, and greenhouse.plant_data %>%
pivot_longer(-c(genotype, water_sched_prog, greenhouse),
names_to = "v1",
values_to = "v2",
values_drop_na = T
)
## # A tibble: 90 × 5
## genotype water_sched_prog greenhouse v1 v2
## <chr> <chr> <chr> <chr> <dbl>
## 1 control A A761 day1 21.7
## 2 control A A761 day2 19.9
## 3 control A A761 day3 20.7
## 4 control A A761 day4 19.4
## 5 control A A761 day5 20.2
## 6 control A A761 day6 19.2
## 7 control A A761 day7 20.6
## 8 control A A761 day8 19.9
## 9 control A A761 day9 19.2
## 10 control A A761 day10 20.4
## # ℹ 80 more rows
In this case, there are two observations in each column name.
qpcr_untidy <- read_csv("data/qpcr_untidy.csv")
## Rows: 5 Columns: 10
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): gene
## dbl (9): exp1_rep1, exp1_rep2, exp1_rep3, exp2_rep1, exp2_rep2, exp2_rep3, e...
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
qpcr_untidy
## # A tibble: 5 × 10
## gene exp1_rep1 exp1_rep2 exp1_rep3 exp2_rep1 exp2_rep2 exp2_rep3 exp3_rep1
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 A 21.7 19.8 20.7 18.3 20.4 17.6 20.6
## 2 B 24.3 24.8 25.2 26.0 29.9 26.4 25.4
## 3 C 20.7 21.5 21.3 25.5 18.7 22.3 21.9
## 4 D 26.9 28.0 27.7 33.1 24.3 28.9 28.5
## 5 E 25.0 22.7 23.8 21.1 23.4 20.2 23.7
## # ℹ 2 more variables: exp3_rep2 <dbl>, exp3_rep3 <dbl>
names_sep
helps pull these apart, but we still have
“exp” and “rep” to deal with.
qpcr <- qpcr_untidy %>%
pivot_longer(
exp1_rep1:exp3_rep3,
names_to= c("experiment", "replicate"),
names_sep="_",
values_to="mRNA_expression"
)
qpcr %>%
separate(experiment, into = c("experiment", "number"), sep = 3)
## # A tibble: 45 × 5
## gene experiment number replicate mRNA_expression
## <chr> <chr> <chr> <chr> <dbl>
## 1 A exp 1 rep1 21.7
## 2 A exp 1 rep2 19.8
## 3 A exp 1 rep3 20.7
## 4 A exp 2 rep1 18.3
## 5 A exp 2 rep2 20.4
## 6 A exp 2 rep3 17.6
## 7 A exp 3 rep1 20.6
## 8 A exp 3 rep2 19.9
## 9 A exp 3 rep3 19.2
## 10 B exp 1 rep1 24.3
## # ℹ 35 more rows
Please review the learning goals and be sure to use the code here as a reference when completing the homework. –>Home