Learning Goals

At the end of this exercise, you will be able to:
1. Use group_by to group data by categorical variables.
2. Use summarize to produce summary statistics for grouped data.

Load the libraries

library("tidyverse")
library("janitor")

Load the data

For this lab, let’s go back to the elephants data from homework 7. The data are from: Chiyo, Patrick I., Vincent Obanda, and David K. Korir. “Illegal tusk harvest and the decline of tusk size in the African elephant.” Ecology and Evolution 5, 22: 5216–5229 (2015). Data deposited at Dryad Digital Repository.

elephants <- read_csv("data/elephants.csv") %>% 
  clean_names() %>% 
  mutate(across(where(is.character), as.factor))

Review

summarize() is a verb in dplyr that allows us to produce summaries of data. It’s output is a new dataframe. In lab 7, we learned how summarize() can be used to produce quick statistical summaries.

  1. What is the average age, shoulder height, tusk length, and tusk circumference for all elephants in the data set?
elephants %>% 
  summarize(across
            (where(is.numeric), 
              ~mean(.x, na.rm=TRUE)))
## # A tibble: 1 × 4
##   estimated_age_years shoulder_height_in_cm tusk_length_in_cm
##                 <dbl>                 <dbl>             <dbl>
## 1                15.0                  210.              91.6
## # ℹ 1 more variable: tusk_circumference_in_cm <dbl>

We can include multiple functions using this same structure, but it becomes cumbersome.

elephants %>% 
  summarize(
    across(
      where(is.numeric),
      list(
        mean   = ~mean(.x, na.rm = TRUE),
        median = ~median(.x, na.rm = TRUE),
        sd     = ~sd(.x, na.rm = TRUE),
        min    = ~min(.x, na.rm = TRUE),
        max    = ~max(.x, na.rm = TRUE),
        n      = ~sum(!is.na(.x))
      )
    )
  )
## # A tibble: 1 × 24
##   estimated_age_years_mean estimated_age_years_median estimated_age_years_sd
##                      <dbl>                      <dbl>                  <dbl>
## 1                     15.0                         12                   12.3
## # ℹ 21 more variables: estimated_age_years_min <dbl>,
## #   estimated_age_years_max <dbl>, estimated_age_years_n <int>,
## #   shoulder_height_in_cm_mean <dbl>, shoulder_height_in_cm_median <dbl>,
## #   shoulder_height_in_cm_sd <dbl>, shoulder_height_in_cm_min <dbl>,
## #   shoulder_height_in_cm_max <dbl>, shoulder_height_in_cm_n <int>,
## #   tusk_length_in_cm_mean <dbl>, tusk_length_in_cm_median <dbl>,
## #   tusk_length_in_cm_sd <dbl>, tusk_length_in_cm_min <dbl>, …

group_by()

The summarize() function is often more useful when used with group_by(). group_by() allows us to group data by a categorical variable(s) of interest.

elephants %>% 
  group_by(sex)
## # A tibble: 777 × 7
## # Groups:   sex [2]
##    years_of_sample_collection elephant_id sex   estimated_age_years
##    <fct>                      <fct>       <fct>               <dbl>
##  1 1966-68                    12          f                   0.08 
##  2 1966-68                    34          f                   0.08 
##  3 1966-68                    162         f                   0.083
##  4 1966-68                    292         f                   0.083
##  5 1966-68                    11          f                   0.25 
##  6 1966-68                    152         f                   0.25 
##  7 1966-68                    264         f                   0.25 
##  8 1966-68                    263         f                   0.5  
##  9 1966-68                    266         f                   0.5  
## 10 1966-68                    217         f                   1    
## # ℹ 767 more rows
## # ℹ 3 more variables: shoulder_height_in_cm <dbl>, tusk_length_in_cm <dbl>,
## #   tusk_circumference_in_cm <dbl>

Notice that the output doesn’t look different, but two groups are indicated. Any verbs that follow group_by() will be applied to each group separately. For example, what is the average age, shoulder height, tusk length, and tusk circumference for males and females?

elephants %>% 
  group_by(sex) %>%
  summarize(across
            (where(is.numeric), 
              ~mean(.x, na.rm=TRUE)))
## # A tibble: 2 × 5
##   sex   estimated_age_years shoulder_height_in_cm tusk_length_in_cm
##   <fct>               <dbl>                 <dbl>             <dbl>
## 1 f                    17.6                  211.              88.8
## 2 m                    12.1                  209.              94.9
## # ℹ 1 more variable: tusk_circumference_in_cm <dbl>

We can group by more than one variable.

elephants %>% 
  group_by(years_of_sample_collection, sex) %>% 
  summarize(mean_estimated_age_years=mean(estimated_age_years, na.rm=TRUE),
            n=n(),
            .groups= 'keep') %>%
  arrange(sex)
## # A tibble: 4 × 4
## # Groups:   years_of_sample_collection, sex [4]
##   years_of_sample_collection sex   mean_estimated_age_years     n
##   <fct>                      <fct>                    <dbl> <int>
## 1 1966-68                    f                         17.6   323
## 2 2005-13                    f                         17.9    93
## 3 1966-68                    m                         10.8   282
## 4 2005-13                    m                         16.7    79

Do you remember this question from homework 7?

“What is the mean, median, and standard deviation for age of males and females included in the study? Separate the results by year of sample collection. Does the sampling look even between years and sexes?”

Without group_by(), we had to produce multiple code chunks- we are now more efficient!

elephants %>% 
  filter(years_of_sample_collection=="1966-68") %>%
  group_by(sex) %>%
  summarize(mean_age_m=mean(estimated_age_years, na.rm=TRUE),
            median_age_m=median(estimated_age_years, na.rm=TRUE),
            sd_age_m=sd(estimated_age_years, na.rm=TRUE),
            n=n())
## # A tibble: 2 × 5
##   sex   mean_age_m median_age_m sd_age_m     n
##   <fct>      <dbl>        <dbl>    <dbl> <int>
## 1 f           17.6           15    13.6    323
## 2 m           10.8            8     9.19   282
elephants %>% 
  filter(years_of_sample_collection=="2005-13") %>%
  group_by(sex) %>%
  summarize(mean_age_m=mean(estimated_age_years, na.rm=TRUE),
            median_age_m=median(estimated_age_years, na.rm=TRUE),
            sd_age_m=sd(estimated_age_years, na.rm=TRUE),
            n=n())
## # A tibble: 2 × 5
##   sex   mean_age_m median_age_m sd_age_m     n
##   <fct>      <dbl>        <dbl>    <dbl> <int>
## 1 f           17.9         17.5     11.0    93
## 2 m           16.7          9       13.9    79
elephants %>% 
  group_by(sex, years_of_sample_collection) %>%
  summarize(mean_age_m=mean(estimated_age_years, na.rm=TRUE),
            median_age_m=median(estimated_age_years, na.rm=TRUE),
            sd_age_m=sd(estimated_age_years, na.rm=TRUE),
            n=n(), .groups= 'keep')
## # A tibble: 4 × 6
## # Groups:   sex, years_of_sample_collection [4]
##   sex   years_of_sample_collection mean_age_m median_age_m sd_age_m     n
##   <fct> <fct>                           <dbl>        <dbl>    <dbl> <int>
## 1 f     1966-68                          17.6         15      13.6    323
## 2 f     2005-13                          17.9         17.5    11.0     93
## 3 m     1966-68                          10.8          8       9.19   282
## 4 m     2005-13                          16.7          9      13.9     79

Practice

Use the built-in msleep data to answer the following questions.

  1. Calculate the min, max, mean, and total number of observations for body weight by vore.
msleep %>%
  group_by(vore) %>%
  summarize(min_bodywt = min(bodywt),
            max_bodywt = max(bodywt),
            mean_bodywt = mean(bodywt),
            total=n())
## # A tibble: 5 × 5
##   vore    min_bodywt max_bodywt mean_bodywt total
##   <chr>        <dbl>      <dbl>       <dbl> <int>
## 1 carni        0.028      800        90.8      19
## 2 herbi        0.022     6654       367.       32
## 3 insecti      0.01        60        12.9       5
## 4 omni         0.005       86.2      12.7      20
## 5 <NA>         0.021        3.6       0.858     7
  1. Calculate mean brain weight by taxonomic order in the msleep data.
msleep %>% 
  group_by(order) %>% 
  summarize(mean_brain_wt=mean(brainwt))
## # A tibble: 19 × 2
##    order           mean_brain_wt
##    <chr>                   <dbl>
##  1 Afrosoricida         0.0026  
##  2 Artiodactyla        NA       
##  3 Carnivora           NA       
##  4 Cetacea             NA       
##  5 Chiroptera           0.000275
##  6 Cingulata            0.0459  
##  7 Didelphimorphia     NA       
##  8 Diprotodontia       NA       
##  9 Erinaceomorpha       0.00295 
## 10 Hyracoidea           0.0152  
## 11 Lagomorpha           0.0121  
## 12 Monotremata          0.025   
## 13 Perissodactyla       0.414   
## 14 Pilosa              NA       
## 15 Primates            NA       
## 16 Proboscidea          5.16    
## 17 Rodentia            NA       
## 18 Scandentia           0.0025  
## 19 Soricomorpha         0.000592
  1. Try running the code again, but this time add na.rm=TRUE. What is the problem with Cetacea?
msleep %>% 
  group_by(order) %>% 
  summarize(mean_brain_wt=mean(brainwt, na.rm=T))
## # A tibble: 19 × 2
##    order           mean_brain_wt
##    <chr>                   <dbl>
##  1 Afrosoricida         0.0026  
##  2 Artiodactyla         0.198   
##  3 Carnivora            0.0986  
##  4 Cetacea            NaN       
##  5 Chiroptera           0.000275
##  6 Cingulata            0.0459  
##  7 Didelphimorphia      0.0063  
##  8 Diprotodontia        0.0114  
##  9 Erinaceomorpha       0.00295 
## 10 Hyracoidea           0.0152  
## 11 Lagomorpha           0.0121  
## 12 Monotremata          0.025   
## 13 Perissodactyla       0.414   
## 14 Pilosa             NaN       
## 15 Primates             0.254   
## 16 Proboscidea          5.16    
## 17 Rodentia             0.00357 
## 18 Scandentia           0.0025  
## 19 Soricomorpha         0.000592
msleep %>% 
  filter(order=="Cetacea") %>% 
  select(order, genus, brainwt)
## # A tibble: 3 × 3
##   order   genus         brainwt
##   <chr>   <chr>           <dbl>
## 1 Cetacea Globicephalus      NA
## 2 Cetacea Phocoena           NA
## 3 Cetacea Tursiops           NA

–>Home