For Big-Data Scientists, “Janitor Work” Is Key Hurdle to Insights.
“Data scientists, according to interviews and expert estimates, spend from 50 percent to 80 percent of their time mired in the mundane labor of collecting and preparing data, before it can be explored for useful information.”
New York Times

Learning Goals

At the end of this exercise, you will be able to:
1. Explain the difference between tidy and messy data.
2. Evaluate a data set as tidy or messy.
3. Use the pivot_longer() function of tidyr to transform data from wide to long format.
4. Use separate() to split observations within a column.

Overview

The quote above sums up much of the work in data science. Simply put, most of the data that you end up working with will be messy, disorganized, and not tidy. By the end of this week, you will have the tools necessary to wrangle messy data into tidy data that are ready for analysis. I know that we have spent a lot of time of data transformation, but this last step is essential to building plots and performing other analyses.

Load the tidyverse

library("tidyverse")
library("janitor")

Tidy data

Most “wild” data are organized incorrectly for work in R and, as you might expect, the tools used to transform data are a core part of the tidyverse. I will attempt to summarize the most important points below, but you should read chapter 5 of the data science text for a more thorough explanation.

Tidy data follows three conventions:
(1) each variable has its own column
(2) each observation has its own row
(3) each value has its own cell

This is summarized in the image below. The package used to tidy data is called tidyr and is a core part of the tidyverse.
Tidy Data.

pivot_longer()

Scientists use spreadsheets that are organized to make data entry efficient. This is often referred to as wide format. Unfortunately, the wide format creates a problem because column names may actually represent values of a variable. The command pivot_longer() shifts data from wide to long format.

Rules:
+ pivot_longer(cols, names_to, values_to) + cols - Columns to pivot to longer format + names_to - Name of the new column; it will contain the column names of gathered columns as values + values_to - Name of the new column; it will contain the data stored in the values of gathered columns

Example 1: column names are data

The following data show results from a drug trial with four treatments on six patients. The values represent resting heart rate.

heartrate <- read_csv("data/heartrate.csv")
## Rows: 6 Columns: 5
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): patient
## dbl (4): a, b, c, d
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
heartrate
## # A tibble: 6 × 5
##   patient      a     b     c     d
##   <chr>    <dbl> <dbl> <dbl> <dbl>
## 1 Margaret    72    74    80    68
## 2 Frank       84    84    88    76
## 3 Hawkeye     64    66    68    64
## 4 Trapper     60    58    64    58
## 5 Radar       74    72    78    70
## 6 Henry       88    87    88    72

Let’s assess whether or not these data are tidy.
(1) each variable has its own column?

  1. each observation has its own row?

  2. each value has its own cell?

Let’s think about this another way- can we make a plot that shows the effect of each drug on Margaret’s heart rate? No, the data are not tidy.

To fix this problem, we need to reshape the table to long format while keeping track of column names and values. We do this using pivot_longer(). Notice that the dimensions of the data frame change.

Practice

  1. Import the file relig_income.csv and store it as a new object relig_income.

  2. Why are these data untidy?

  3. Use pivot_longer() to make the data tidy.

Example 2: some column names are data

Some (but not all) of the column names are data. We also have NA’s.

billboard <- read_csv("data/billboard.csv")
## Rows: 317 Columns: 79
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr   (2): artist, track
## dbl  (65): wk1, wk2, wk3, wk4, wk5, wk6, wk7, wk8, wk9, wk10, wk11, wk12, wk...
## lgl  (11): wk66, wk67, wk68, wk69, wk70, wk71, wk72, wk73, wk74, wk75, wk76
## date  (1): date.entered
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
billboard
## # A tibble: 317 × 79
##    artist     track date.entered   wk1   wk2   wk3   wk4   wk5   wk6   wk7   wk8
##    <chr>      <chr> <date>       <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
##  1 2 Pac      Baby… 2000-02-26      87    82    72    77    87    94    99    NA
##  2 2Ge+her    The … 2000-09-02      91    87    92    NA    NA    NA    NA    NA
##  3 3 Doors D… Kryp… 2000-04-08      81    70    68    67    66    57    54    53
##  4 3 Doors D… Loser 2000-10-21      76    76    72    69    67    65    55    59
##  5 504 Boyz   Wobb… 2000-04-15      57    34    25    17    17    31    36    49
##  6 98^0       Give… 2000-08-19      51    39    34    26    26    19     2     2
##  7 A*Teens    Danc… 2000-07-08      97    97    96    95   100    NA    NA    NA
##  8 Aaliyah    I Do… 2000-01-29      84    62    51    41    38    35    35    38
##  9 Aaliyah    Try … 2000-03-18      59    53    38    28    21    18    16    14
## 10 Adams, Yo… Open… 2000-08-26      76    76    74    69    68    67    61    58
## # ℹ 307 more rows
## # ℹ 68 more variables: wk9 <dbl>, wk10 <dbl>, wk11 <dbl>, wk12 <dbl>,
## #   wk13 <dbl>, wk14 <dbl>, wk15 <dbl>, wk16 <dbl>, wk17 <dbl>, wk18 <dbl>,
## #   wk19 <dbl>, wk20 <dbl>, wk21 <dbl>, wk22 <dbl>, wk23 <dbl>, wk24 <dbl>,
## #   wk25 <dbl>, wk26 <dbl>, wk27 <dbl>, wk28 <dbl>, wk29 <dbl>, wk30 <dbl>,
## #   wk31 <dbl>, wk32 <dbl>, wk33 <dbl>, wk34 <dbl>, wk35 <dbl>, wk36 <dbl>,
## #   wk37 <dbl>, wk38 <dbl>, wk39 <dbl>, wk40 <dbl>, wk41 <dbl>, wk42 <dbl>, …

Solution 1: specify a range of columns that you want to pivot.

Solution 2: OR, specify columns that you want to stay fixed.

Solution 3: identify columns by a prefix, remove the prefix and all NA’s.

Practice

  1. Import plant_data.csv as a new object plant_data.

  2. Why are these data not tidy?

  3. Use pivot_longer() to make the data tidy. Focus the data only on genotype, water_sched_prog, and greenhouse.

Example 3: more than one variable in a column name

In this case, there are two observations in each column name.

qpcr_untidy <- read_csv("data/qpcr_untidy.csv")
## Rows: 5 Columns: 10
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): gene
## dbl (9): exp1_rep1, exp1_rep2, exp1_rep3, exp2_rep1, exp2_rep2, exp2_rep3, e...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
qpcr_untidy
## # A tibble: 5 × 10
##   gene  exp1_rep1 exp1_rep2 exp1_rep3 exp2_rep1 exp2_rep2 exp2_rep3 exp3_rep1
##   <chr>     <dbl>     <dbl>     <dbl>     <dbl>     <dbl>     <dbl>     <dbl>
## 1 A          21.7      19.8      20.7      18.3      20.4      17.6      20.6
## 2 B          24.3      24.8      25.2      26.0      29.9      26.4      25.4
## 3 C          20.7      21.5      21.3      25.5      18.7      22.3      21.9
## 4 D          26.9      28.0      27.7      33.1      24.3      28.9      28.5
## 5 E          25.0      22.7      23.8      21.1      23.4      20.2      23.7
## # ℹ 2 more variables: exp3_rep2 <dbl>, exp3_rep3 <dbl>

names_sep helps pull these apart, but we still have “exp” and “rep” to deal with.

That’s it! Let’s move on to part 2!

–>Home