Learning Goals

At the end of this exercise, you will be able to:
1. Produce clear, concise summaries using a variety of functions in dplyr and janitor.
2. Use the across() operator to produce summaries across multiple variables.

Load the libraries

library("tidyverse")
library("janitor")

Load the data

For this lab, we will use palmerpenguins. They are from the publication: Gorman KB, Williams TD, Fraser WR (2014). Ecological sexual dimorphism and environmental variability within a community of Antarctic penguins (genus Pygoscelis). PLoS ONE 9(3):e90081. https://doi.org/10.1371/journal.pone.0090081.

To use these data, they need to be installed.

#install.packages("palmerpenguins")

Load the library.

library("palmerpenguins")
?penguins
penguins <- penguins

Review

summarize() is a function in dplyr that allows us to produce summaries of data. It’s output is a new dataframe. If we were interested in the mean body weight of Adeile penguins, we could use summarize().

penguins %>% #loads the data and pipes it to the filter function
  filter(species=="Adelie") %>% #pulls out only the Adelie penguins
  summarize(mean_body_mass=mean(body_mass_g, na.rm=T), #calculates the mean
            n=n()) #number of observations
## # A tibble: 1 × 2
##   mean_body_mass     n
##            <dbl> <int>
## 1          3701.   152

Notice that there are three species in the data.

penguins %>% 
  count(species)
## # A tibble: 3 × 2
##   species       n
##   <fct>     <int>
## 1 Adelie      152
## 2 Chinstrap    68
## 3 Gentoo      124

What if we wanted to summarize basic statistics for each species? We could still use filter() but this is cumbersome.

penguins %>% 
  filter(species=="Adelie") %>%
  summarize(mean_body_mass=mean(body_mass_g, na.rm=T),
            sd_body_mass=sd(body_mass_g, na.rm=T),
            n=n())
## # A tibble: 1 × 3
##   mean_body_mass sd_body_mass     n
##            <dbl>        <dbl> <int>
## 1          3701.         459.   152
penguins %>% 
  filter(species=="Chinstrap") %>%
  summarize(mean_body_mass=mean(body_mass_g, na.rm=T),
            sd_body_mass=sd(body_mass_g, na.rm=T),
            n=n())
## # A tibble: 1 × 3
##   mean_body_mass sd_body_mass     n
##            <dbl>        <dbl> <int>
## 1          3733.         384.    68
penguins %>% 
  filter(species=="Gentoo") %>%
  summarize(mean_body_mass=mean(body_mass_g, na.rm=T),
            sd_body_mass=sd(body_mass_g, na.rm=T),
            n=n())
## # A tibble: 1 × 3
##   mean_body_mass sd_body_mass     n
##            <dbl>        <dbl> <int>
## 1          5076.         504.   124

group_by()

The summarize() function is most useful when used in conjunction with group_by(). This allows us to group data by a categorical variable(s) of interest.

penguins %>% 
  group_by(species) %>% #we are grouping by species, a categorical variable
  summarize(mean_body_mass=mean(body_mass_g, na.rm=T),
            sd_body_mass=sd(body_mass_g, na.rm=T),
            n=n())
## # A tibble: 3 × 4
##   species   mean_body_mass sd_body_mass     n
##   <fct>              <dbl>        <dbl> <int>
## 1 Adelie             3701.         459.   152
## 2 Chinstrap          3733.         384.    68
## 3 Gentoo             5076.         504.   124

Practice

Use the built-in msleep data to answer the following questions.

  1. Calculate the min, max, mean, and total number of observations for body weight by feeding ecology in the msleep data.
msleep %>%
  group_by(vore) %>% #we are grouping by feeding ecology, a categorical variable
  summarize(min_bodywt = min(bodywt),
            max_bodywt = max(bodywt),
            mean_bodywt = mean(bodywt),
            total=n())
## # A tibble: 5 × 5
##   vore    min_bodywt max_bodywt mean_bodywt total
##   <chr>        <dbl>      <dbl>       <dbl> <int>
## 1 carni        0.028      800        90.8      19
## 2 herbi        0.022     6654       367.       32
## 3 insecti      0.01        60        12.9       5
## 4 omni         0.005       86.2      12.7      20
## 5 <NA>         0.021        3.6       0.858     7
  1. Calculate mean brain weight by taxonomic order in the msleep data.
msleep %>% 
  group_by(order) %>% 
  summarize(mean_brain_wt=mean(brainwt))
## # A tibble: 19 × 2
##    order           mean_brain_wt
##    <chr>                   <dbl>
##  1 Afrosoricida         0.0026  
##  2 Artiodactyla        NA       
##  3 Carnivora           NA       
##  4 Cetacea             NA       
##  5 Chiroptera           0.000275
##  6 Cingulata            0.0459  
##  7 Didelphimorphia     NA       
##  8 Diprotodontia       NA       
##  9 Erinaceomorpha       0.00295 
## 10 Hyracoidea           0.0152  
## 11 Lagomorpha           0.0121  
## 12 Monotremata          0.025   
## 13 Perissodactyla       0.414   
## 14 Pilosa              NA       
## 15 Primates            NA       
## 16 Proboscidea          5.16    
## 17 Rodentia            NA       
## 18 Scandentia           0.0025  
## 19 Soricomorpha         0.000592
  1. Try running the code again, but this time add na.rm=TRUE. What is the problem with Cetacea?
msleep %>% 
  group_by(order) %>% 
  summarize(mean_brain_wt=mean(brainwt, na.rm=T))
## # A tibble: 19 × 2
##    order           mean_brain_wt
##    <chr>                   <dbl>
##  1 Afrosoricida         0.0026  
##  2 Artiodactyla         0.198   
##  3 Carnivora            0.0986  
##  4 Cetacea            NaN       
##  5 Chiroptera           0.000275
##  6 Cingulata            0.0459  
##  7 Didelphimorphia      0.0063  
##  8 Diprotodontia        0.0114  
##  9 Erinaceomorpha       0.00295 
## 10 Hyracoidea           0.0152  
## 11 Lagomorpha           0.0121  
## 12 Monotremata          0.025   
## 13 Perissodactyla       0.414   
## 14 Pilosa             NaN       
## 15 Primates             0.254   
## 16 Proboscidea          5.16    
## 17 Rodentia             0.00357 
## 18 Scandentia           0.0025  
## 19 Soricomorpha         0.000592
msleep %>% 
  filter(order=="Cetacea") %>% 
  select(order, genus, brainwt)
## # A tibble: 3 × 3
##   order   genus         brainwt
##   <chr>   <chr>           <dbl>
## 1 Cetacea Globicephalus      NA
## 2 Cetacea Phocoena           NA
## 3 Cetacea Tursiops           NA

Review

The summarize() and group_by() functions are powerful tools that we can use to produce clean summaries of data. Especially when used together, we can quickly group variables of interest and save time. You can also group by more than one categorical variable.

What if we are interested in the number of observations (penguins) by species and island?

penguins %>% 
  group_by(species, island) %>% 
  summarize(n=n(), .groups= 'keep') #.groups= 'keep' is used to keep the grouping variable in the output
## # A tibble: 5 × 3
## # Groups:   species, island [5]
##   species   island        n
##   <fct>     <fct>     <int>
## 1 Adelie    Biscoe       44
## 2 Adelie    Dream        56
## 3 Adelie    Torgersen    52
## 4 Chinstrap Dream        68
## 5 Gentoo    Biscoe      124

Practice

  1. How does the mean of bill_length_mm compare by island and penguin species?
x <- penguins %>% 
  group_by(island, species) %>% 
  summarize(mean_bill_length_mm=mean(bill_length_mm, na.rm=T), .groups= 'keep')
x %>% summarize(n=n())
## `summarise()` has grouped output by 'island'. You can override using the
## `.groups` argument.
## # A tibble: 5 × 3
## # Groups:   island [3]
##   island    species       n
##   <fct>     <fct>     <int>
## 1 Biscoe    Adelie        1
## 2 Biscoe    Gentoo        1
## 3 Dream     Adelie        1
## 4 Dream     Chinstrap     1
## 5 Torgersen Adelie        1
y <- penguins %>% 
  group_by(island, species) %>% 
  summarize(mean_bill_length_mm=mean(bill_length_mm, na.rm=T), .groups= 'drop')
y %>% summarize(n=n())
## # A tibble: 1 × 1
##       n
##   <int>
## 1     5
z <- penguins %>% 
  group_by(island, species) %>% 
  summarize(mean_bill_length_mm=mean(bill_length_mm, na.rm=T), .groups= 'drop_last')
z %>% summarize(n=n())
## # A tibble: 3 × 2
##   island        n
##   <fct>     <int>
## 1 Biscoe        2
## 2 Dream         2
## 3 Torgersen     1
  1. For some penguins, their sex is listed as NA. Where do these penguins occur?
penguins %>% 
  count(sex, island)
## # A tibble: 9 × 3
##   sex    island        n
##   <fct>  <fct>     <int>
## 1 female Biscoe       80
## 2 female Dream        61
## 3 female Torgersen    24
## 4 male   Biscoe       83
## 5 male   Dream        62
## 6 male   Torgersen    23
## 7 <NA>   Biscoe        5
## 8 <NA>   Dream         1
## 9 <NA>   Torgersen     5

across()

What about using filter() and select() across multiple variables? There is a function in dplyr called across() which is designed to work across multiple variables. Have a look at Rebecca Barter’s awesome blog here.

What if we wanted to apply summarize() in order to produce distinct counts over multiple variables; i.e. species, island, and sex? Although this isn’t a lot of coding you can image that with a lot of variables it would be cumbersome.

penguins %>%
  summarize(distinct_species = n_distinct(species),
            distinct_island = n_distinct(island),
            distinct_sex = n_distinct(sex))
## # A tibble: 1 × 3
##   distinct_species distinct_island distinct_sex
##              <int>           <int>        <int>
## 1                3               3            3

By using across() we can reduce the clutter and make things cleaner.

penguins %>%
  summarize(across(c(species, island, sex), n_distinct))
## # A tibble: 1 × 3
##   species island   sex
##     <int>  <int> <int>
## 1       3      3     3

The new version. The reason why I show you this is because you don’t want to leave with code that is already outdated. Remember, R is under constant revision so expect changes.

penguins %>%
  summarize(across(c(species, island, sex), \(x) n_distinct(x, na.rm = TRUE)))
## # A tibble: 1 × 3
##   species island   sex
##     <int>  <int> <int>
## 1       3      3     2
# The function takes x (each column in the specified range) and applies n_distinct(x, na.rm = TRUE)

This is very helpful for numeric variables.

penguins %>%
  summarize(across(where(is.numeric), \(x) mean(x, na.rm = TRUE))) %>% 
  # The function takes x (each column in the specified range) and applies mean(x, na.rm = TRUE) to compute the mean while removing NA values
  select(!year) #removes the year variable
## # A tibble: 1 × 4
##   bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
##            <dbl>         <dbl>             <dbl>       <dbl>
## 1           43.9          17.2              201.       4202.

group_by also works.

penguins %>%
  group_by(sex) %>% 
  summarize(across(contains("mm"), \(x) mean(x, na.rm = TRUE)))
## # A tibble: 3 × 4
##   sex    bill_length_mm bill_depth_mm flipper_length_mm
##   <fct>           <dbl>         <dbl>             <dbl>
## 1 female           42.1          16.4              197.
## 2 male             45.9          17.9              205.
## 3 <NA>             41.3          16.6              199

Practice

  1. Produce separate summaries of the mean and standard deviation for bill_length_mm, bill_depth_mm, and flipper_length_mm for each penguin species. Be sure to provide the number of samples.

Mean

penguins %>%
  group_by(sex) %>% 
  summarize(across(contains("mm"), \(x) mean(x, na.rm = TRUE)))
## # A tibble: 3 × 4
##   sex    bill_length_mm bill_depth_mm flipper_length_mm
##   <fct>           <dbl>         <dbl>             <dbl>
## 1 female           42.1          16.4              197.
## 2 male             45.9          17.9              205.
## 3 <NA>             41.3          16.6              199

Standard Deviation

penguins %>%
  group_by(sex) %>% 
  summarize(across(contains("mm"), \(x) sd(x, na.rm = TRUE)))
## # A tibble: 3 × 4
##   sex    bill_length_mm bill_depth_mm flipper_length_mm
##   <fct>           <dbl>         <dbl>             <dbl>
## 1 female           4.90          1.80              12.5
## 2 male             5.37          1.86              14.5
## 3 <NA>             4.63          2.24              16.5

Supply a list of functions

penguins %>% 
  group_by(species) %>% 
  summarize(across(contains("mm"), list(mean = \(x) mean(x, na.rm = TRUE), 
                                        sd = \(x) sd(x, na.rm = TRUE))),
            n=n())
## # A tibble: 3 × 8
##   species   bill_length_mm_mean bill_length_mm_sd bill_depth_mm_mean
##   <fct>                   <dbl>             <dbl>              <dbl>
## 1 Adelie                   38.8              2.66               18.3
## 2 Chinstrap                48.8              3.34               18.4
## 3 Gentoo                   47.5              3.08               15.0
## # ℹ 4 more variables: bill_depth_mm_sd <dbl>, flipper_length_mm_mean <dbl>,
## #   flipper_length_mm_sd <dbl>, n <int>

Wrap-up

Please review the learning goals and be sure to use the code here as a reference when completing the homework.
–>Home