At the end of this exercise, you will be able to:
1. Explain the difference between tidy and messy data.
2. Evaluate a data set as tidy or untidy.
3. Use the pivot_wider() function of tidyr to
transform data from long to wide format.
Last time we learned the fundamentals of tidy data and used the
pivot_longer() function to wrangle a few examples of
frequently encountered untidy data. In the second part of today’s lab we
will look at a few more examples of pivot_longer() but also
use the pivot_wider() function to deal with another type of
untidy data.
library("tidyverse")
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.6
## ✔ forcats 1.0.1 ✔ stringr 1.6.0
## ✔ ggplot2 4.0.1 ✔ tibble 3.3.0
## ✔ lubridate 1.9.4 ✔ tidyr 1.3.2
## ✔ purrr 1.2.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
pivot_longer()Recall that we use pivot_longer() when our column names
actually represent variables. A classic example would be that the column
names represent observations of a variable.
datasets::USPersonalExpenditure
## 1940 1945 1950 1955 1960
## Food and Tobacco 22.200 44.500 59.60 73.2 86.80
## Household Operation 10.500 15.500 29.00 36.5 46.20
## Medical and Health 3.530 5.760 9.71 14.0 21.10
## Personal Care 1.040 1.980 2.45 3.4 5.40
## Private Education 0.341 0.974 1.80 2.6 3.64
?USPersonalExpenditure
Here we add a new column of expenditure types, which are stored as
rownames above, with mutate(). The
USPersonalExpenditures data also needs to be converted to a
data frame before we can use the tidyverse functions, because it comes
as a matrix.
expenditures <- as_tibble(USPersonalExpenditure) %>%
mutate(expenditure = rownames(USPersonalExpenditure)) %>%
relocate(expenditure)
expenditures
## # A tibble: 5 × 6
## expenditure `1940` `1945` `1950` `1955` `1960`
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Food and Tobacco 22.2 44.5 59.6 73.2 86.8
## 2 Household Operation 10.5 15.5 29 36.5 46.2
## 3 Medical and Health 3.53 5.76 9.71 14 21.1
## 4 Personal Care 1.04 1.98 2.45 3.4 5.4
## 5 Private Education 0.341 0.974 1.8 2.6 3.64
Are these data tidy? Please use pivot_longer() to
tidy the data.
Restrict the data to medical and health expenditures only. Sort in ascending order.
separate()In this new heart rate example, we have the sex of each patient included with their name. Are these data tidy? No, there is more than one value per cell in the patient column and the columns a, b, c, d once again represent values.
heartrate2 <- read_csv("data/heartrate2.csv")
## Rows: 6 Columns: 5
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): patient
## dbl (4): a, b, c, d
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
heartrate2
## # A tibble: 6 × 5
## patient a b c d
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 Margaret_f 72 74 80 68
## 2 Frank_m 84 84 88 76
## 3 Hawkeye_m 64 66 68 64
## 4 Trapper_m 60 58 64 58
## 5 Radar_m 74 72 78 70
## 6 Henry_m 88 87 88 72
We need to start by separating the patient names from their sexes.
separate() needs to know which column you want to split,
the names of the new columns, and what to look for in terms of breaks in
the data.
Another way of using separate is to specify the position of the split. In this case, we want to split the patient column at the last character.
We would still need to deal with the “_” in the patient column. We can do this using the stringr package.
Re-examine heartrate2. Use separate()
for the sexes, pivot_longer() to tidy, and
arrange() to organize by patient and drug. Store this as a
new object heartrate3.
unite() is the opposite of separate(). Its syntax is
straightforward. You only need to give a new column name and then list
the columns to combine with a separation character. Give it a try below
by recombining patient and sex from heartrate3.
pivot_wider()The opposite of pivot_longer(). You use
pivot_wider() when you have an observation scattered across
multiple rows. In the example below, cases and
population represent variable names not observations.
Rules:
+ pivot_wider(names_from, values_from)
+ names_from - Values in the names_from column
will become new column names
+ values_from - Cell values will be taken from the
values_from column
tb_data <- read_csv("data/tb_data.csv")
## Rows: 12 Columns: 4
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (2): country, key
## dbl (2): year, value
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
tb_data
## # A tibble: 12 × 4
## country year key value
## <chr> <dbl> <chr> <dbl>
## 1 Afghanistan 1999 cases 745
## 2 Afghanistan 1999 population 19987071
## 3 Afghanistan 2000 cases 2666
## 4 Afghanistan 2000 population 20595360
## 5 Brazil 1999 cases 37737
## 6 Brazil 1999 population 172006362
## 7 Brazil 2000 cases 80488
## 8 Brazil 2000 population 174504898
## 9 China 1999 cases 212258
## 10 China 1999 population 1272915272
## 11 China 2000 cases 213766
## 12 China 2000 population 1280428583
When using pivot_wider() we use names_from
to identify the variables (new column names) and
values_from to identify the values associated with the new
columns.
gene_exp.csv data as a new object
gene_exp. Are these data tidy? Use
pivot_wider() to tidy the data.For the last practice example, I will use data from the R Ladies Sydney blog. This data is compiled by the NSW Office of Environment and contains the Enterococci counts in water samples obtained from Sydney beaches as part of the Beachwatch Water Quality Program.
Load the beachbugs data and have a look.
Use pivot_wider to transform the data into wide
format.
Now, use pivot_longer to transform them back to
long!
Please review the learning goals and be sure to use the code here as
a reference when completing the homework.
–>Home