Learning Goals

At the end of this exercise, you will be able to:
1. Explain the difference between tidy and messy data.
2. Evaluate a data set as tidy or untidy.
3. Use the pivot_wider() function of tidyr to transform data from long to wide format.

Review

Last time we learned the fundamentals of tidy data and used the pivot_longer() function to wrangle a few examples of frequently encountered untidy data. In the second part of today’s lab we will look at a few more examples of pivot_longer() but also use the pivot_wider() function to deal with another type of untidy data.

Load the tidyverse

library("tidyverse")
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.6
## ✔ forcats   1.0.1     ✔ stringr   1.6.0
## ✔ ggplot2   4.0.1     ✔ tibble    3.3.0
## ✔ lubridate 1.9.4     ✔ tidyr     1.3.2
## ✔ purrr     1.2.1     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

pivot_longer()

Recall that we use pivot_longer() when our column names actually represent variables. A classic example would be that the column names represent observations of a variable.

datasets::USPersonalExpenditure
##                       1940   1945  1950 1955  1960
## Food and Tobacco    22.200 44.500 59.60 73.2 86.80
## Household Operation 10.500 15.500 29.00 36.5 46.20
## Medical and Health   3.530  5.760  9.71 14.0 21.10
## Personal Care        1.040  1.980  2.45  3.4  5.40
## Private Education    0.341  0.974  1.80  2.6  3.64
?USPersonalExpenditure

Here we add a new column of expenditure types, which are stored as rownames above, with mutate(). The USPersonalExpenditures data also needs to be converted to a data frame before we can use the tidyverse functions, because it comes as a matrix.

expenditures <- as_tibble(USPersonalExpenditure) %>% 
  mutate(expenditure = rownames(USPersonalExpenditure)) %>% 
  relocate(expenditure)
expenditures
## # A tibble: 5 × 6
##   expenditure         `1940` `1945` `1950` `1955` `1960`
##   <chr>                <dbl>  <dbl>  <dbl>  <dbl>  <dbl>
## 1 Food and Tobacco    22.2   44.5    59.6    73.2  86.8 
## 2 Household Operation 10.5   15.5    29      36.5  46.2 
## 3 Medical and Health   3.53   5.76    9.71   14    21.1 
## 4 Personal Care        1.04   1.98    2.45    3.4   5.4 
## 5 Private Education    0.341  0.974   1.8     2.6   3.64

Practice

  1. Are these data tidy? Please use pivot_longer() to tidy the data.

  2. Restrict the data to medical and health expenditures only. Sort in ascending order.

separate()

In this new heart rate example, we have the sex of each patient included with their name. Are these data tidy? No, there is more than one value per cell in the patient column and the columns a, b, c, d once again represent values.

heartrate2 <- read_csv("data/heartrate2.csv")
## Rows: 6 Columns: 5
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): patient
## dbl (4): a, b, c, d
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
heartrate2
## # A tibble: 6 × 5
##   patient        a     b     c     d
##   <chr>      <dbl> <dbl> <dbl> <dbl>
## 1 Margaret_f    72    74    80    68
## 2 Frank_m       84    84    88    76
## 3 Hawkeye_m     64    66    68    64
## 4 Trapper_m     60    58    64    58
## 5 Radar_m       74    72    78    70
## 6 Henry_m       88    87    88    72

We need to start by separating the patient names from their sexes. separate() needs to know which column you want to split, the names of the new columns, and what to look for in terms of breaks in the data.

Another way of using separate is to specify the position of the split. In this case, we want to split the patient column at the last character.

We would still need to deal with the “_” in the patient column. We can do this using the stringr package.

Practice

  1. Re-examine heartrate2. Use separate() for the sexes, pivot_longer() to tidy, and arrange() to organize by patient and drug. Store this as a new object heartrate3.

  2. unite() is the opposite of separate(). Its syntax is straightforward. You only need to give a new column name and then list the columns to combine with a separation character. Give it a try below by recombining patient and sex from heartrate3.

pivot_wider()

The opposite of pivot_longer(). You use pivot_wider() when you have an observation scattered across multiple rows. In the example below, cases and population represent variable names not observations.

Rules:
+ pivot_wider(names_from, values_from)
+ names_from - Values in the names_from column will become new column names
+ values_from - Cell values will be taken from the values_from column

tb_data <- read_csv("data/tb_data.csv")
## Rows: 12 Columns: 4
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (2): country, key
## dbl (2): year, value
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
tb_data
## # A tibble: 12 × 4
##    country      year key             value
##    <chr>       <dbl> <chr>           <dbl>
##  1 Afghanistan  1999 cases             745
##  2 Afghanistan  1999 population   19987071
##  3 Afghanistan  2000 cases            2666
##  4 Afghanistan  2000 population   20595360
##  5 Brazil       1999 cases           37737
##  6 Brazil       1999 population  172006362
##  7 Brazil       2000 cases           80488
##  8 Brazil       2000 population  174504898
##  9 China        1999 cases          212258
## 10 China        1999 population 1272915272
## 11 China        2000 cases          213766
## 12 China        2000 population 1280428583

When using pivot_wider() we use names_from to identify the variables (new column names) and values_from to identify the values associated with the new columns.

Practice

  1. Load the gene_exp.csv data as a new object gene_exp. Are these data tidy? Use pivot_wider() to tidy the data.

Practice

For the last practice example, I will use data from the R Ladies Sydney blog. This data is compiled by the NSW Office of Environment and contains the Enterococci counts in water samples obtained from Sydney beaches as part of the Beachwatch Water Quality Program.

  1. Load the beachbugs data and have a look.

  2. Use pivot_wider to transform the data into wide format.

  3. Now, use pivot_longer to transform them back to long!

Wrap-up

Please review the learning goals and be sure to use the code here as a reference when completing the homework.
–>Home