At the end of this exercise, you will be able to:
1. Use group_by to group data by categorical
variables.
2. Use summarize to produce summary statistics for grouped
data.
library("tidyverse")
library("janitor")
For this lab, let’s go back to the elephants data from homework 7. The data are from: Chiyo, Patrick I., Vincent Obanda, and David K. Korir. “Illegal tusk harvest and the decline of tusk size in the African elephant.” Ecology and Evolution 5, 22: 5216–5229 (2015). Data deposited at Dryad Digital Repository.
elephants <- read_csv("data/elephants.csv") %>%
clean_names() %>%
mutate(across(where(is.character), as.factor))
summarize() is a verb in dplyr that allows
us to produce summaries of data. It’s output is a new dataframe. In lab
7, we learned how summarize() can be used to produce quick
statistical summaries.
elephants %>%
summarize(across
(where(is.numeric),
~mean(.x, na.rm=TRUE)))
## # A tibble: 1 × 4
## estimated_age_years shoulder_height_in_cm tusk_length_in_cm
## <dbl> <dbl> <dbl>
## 1 15.0 210. 91.6
## # ℹ 1 more variable: tusk_circumference_in_cm <dbl>
We can include multiple functions using this same structure, but it becomes cumbersome.
elephants %>%
summarize(
across(
where(is.numeric),
list(
mean = ~mean(.x, na.rm = TRUE),
median = ~median(.x, na.rm = TRUE),
sd = ~sd(.x, na.rm = TRUE),
min = ~min(.x, na.rm = TRUE),
max = ~max(.x, na.rm = TRUE),
n = ~sum(!is.na(.x))
)
)
)
## # A tibble: 1 × 24
## estimated_age_years_mean estimated_age_years_median estimated_age_years_sd
## <dbl> <dbl> <dbl>
## 1 15.0 12 12.3
## # ℹ 21 more variables: estimated_age_years_min <dbl>,
## # estimated_age_years_max <dbl>, estimated_age_years_n <int>,
## # shoulder_height_in_cm_mean <dbl>, shoulder_height_in_cm_median <dbl>,
## # shoulder_height_in_cm_sd <dbl>, shoulder_height_in_cm_min <dbl>,
## # shoulder_height_in_cm_max <dbl>, shoulder_height_in_cm_n <int>,
## # tusk_length_in_cm_mean <dbl>, tusk_length_in_cm_median <dbl>,
## # tusk_length_in_cm_sd <dbl>, tusk_length_in_cm_min <dbl>, …
group_by()The summarize() function is often more useful when used
with group_by(). group_by() allows us to group
data by a categorical variable(s) of interest.
elephants %>%
group_by(sex)
## # A tibble: 777 × 7
## # Groups: sex [2]
## years_of_sample_collection elephant_id sex estimated_age_years
## <fct> <fct> <fct> <dbl>
## 1 1966-68 12 f 0.08
## 2 1966-68 34 f 0.08
## 3 1966-68 162 f 0.083
## 4 1966-68 292 f 0.083
## 5 1966-68 11 f 0.25
## 6 1966-68 152 f 0.25
## 7 1966-68 264 f 0.25
## 8 1966-68 263 f 0.5
## 9 1966-68 266 f 0.5
## 10 1966-68 217 f 1
## # ℹ 767 more rows
## # ℹ 3 more variables: shoulder_height_in_cm <dbl>, tusk_length_in_cm <dbl>,
## # tusk_circumference_in_cm <dbl>
Notice that the output doesn’t look different, but two groups are
indicated. Any verbs that follow group_by() will be applied
to each group separately. For example, what is the average age, shoulder
height, tusk length, and tusk circumference for males and females?
elephants %>%
group_by(sex) %>%
summarize(across
(where(is.numeric),
~mean(.x, na.rm=TRUE)))
## # A tibble: 2 × 5
## sex estimated_age_years shoulder_height_in_cm tusk_length_in_cm
## <fct> <dbl> <dbl> <dbl>
## 1 f 17.6 211. 88.8
## 2 m 12.1 209. 94.9
## # ℹ 1 more variable: tusk_circumference_in_cm <dbl>
We can group by more than one variable.
elephants %>%
group_by(years_of_sample_collection, sex) %>%
summarize(mean_estimated_age_years=mean(estimated_age_years, na.rm=TRUE),
n=n(),
.groups= 'keep') %>%
arrange(sex)
## # A tibble: 4 × 4
## # Groups: years_of_sample_collection, sex [4]
## years_of_sample_collection sex mean_estimated_age_years n
## <fct> <fct> <dbl> <int>
## 1 1966-68 f 17.6 323
## 2 2005-13 f 17.9 93
## 3 1966-68 m 10.8 282
## 4 2005-13 m 16.7 79
Do you remember this question from homework 7?
“What is the mean, median, and standard deviation for age of males and females included in the study? Separate the results by year of sample collection. Does the sampling look even between years and sexes?”
Without group_by(), we had to produce multiple code
chunks- we are now more efficient!
elephants %>%
filter(years_of_sample_collection=="1966-68") %>%
group_by(sex) %>%
summarize(mean_age_m=mean(estimated_age_years, na.rm=TRUE),
median_age_m=median(estimated_age_years, na.rm=TRUE),
sd_age_m=sd(estimated_age_years, na.rm=TRUE),
n=n())
## # A tibble: 2 × 5
## sex mean_age_m median_age_m sd_age_m n
## <fct> <dbl> <dbl> <dbl> <int>
## 1 f 17.6 15 13.6 323
## 2 m 10.8 8 9.19 282
elephants %>%
filter(years_of_sample_collection=="2005-13") %>%
group_by(sex) %>%
summarize(mean_age_m=mean(estimated_age_years, na.rm=TRUE),
median_age_m=median(estimated_age_years, na.rm=TRUE),
sd_age_m=sd(estimated_age_years, na.rm=TRUE),
n=n())
## # A tibble: 2 × 5
## sex mean_age_m median_age_m sd_age_m n
## <fct> <dbl> <dbl> <dbl> <int>
## 1 f 17.9 17.5 11.0 93
## 2 m 16.7 9 13.9 79
elephants %>%
group_by(sex, years_of_sample_collection) %>%
summarize(mean_age_m=mean(estimated_age_years, na.rm=TRUE),
median_age_m=median(estimated_age_years, na.rm=TRUE),
sd_age_m=sd(estimated_age_years, na.rm=TRUE),
n=n(), .groups= 'keep')
## # A tibble: 4 × 6
## # Groups: sex, years_of_sample_collection [4]
## sex years_of_sample_collection mean_age_m median_age_m sd_age_m n
## <fct> <fct> <dbl> <dbl> <dbl> <int>
## 1 f 1966-68 17.6 15 13.6 323
## 2 f 2005-13 17.9 17.5 11.0 93
## 3 m 1966-68 10.8 8 9.19 282
## 4 m 2005-13 16.7 9 13.9 79
Use the built-in msleep data to answer the following
questions.
msleep %>%
group_by(vore) %>%
summarize(min_bodywt = min(bodywt),
max_bodywt = max(bodywt),
mean_bodywt = mean(bodywt),
total=n())
## # A tibble: 5 × 5
## vore min_bodywt max_bodywt mean_bodywt total
## <chr> <dbl> <dbl> <dbl> <int>
## 1 carni 0.028 800 90.8 19
## 2 herbi 0.022 6654 367. 32
## 3 insecti 0.01 60 12.9 5
## 4 omni 0.005 86.2 12.7 20
## 5 <NA> 0.021 3.6 0.858 7
msleep %>%
group_by(order) %>%
summarize(mean_brain_wt=mean(brainwt))
## # A tibble: 19 × 2
## order mean_brain_wt
## <chr> <dbl>
## 1 Afrosoricida 0.0026
## 2 Artiodactyla NA
## 3 Carnivora NA
## 4 Cetacea NA
## 5 Chiroptera 0.000275
## 6 Cingulata 0.0459
## 7 Didelphimorphia NA
## 8 Diprotodontia NA
## 9 Erinaceomorpha 0.00295
## 10 Hyracoidea 0.0152
## 11 Lagomorpha 0.0121
## 12 Monotremata 0.025
## 13 Perissodactyla 0.414
## 14 Pilosa NA
## 15 Primates NA
## 16 Proboscidea 5.16
## 17 Rodentia NA
## 18 Scandentia 0.0025
## 19 Soricomorpha 0.000592
na.rm=TRUE. What is the problem with Cetacea?msleep %>%
group_by(order) %>%
summarize(mean_brain_wt=mean(brainwt, na.rm=T))
## # A tibble: 19 × 2
## order mean_brain_wt
## <chr> <dbl>
## 1 Afrosoricida 0.0026
## 2 Artiodactyla 0.198
## 3 Carnivora 0.0986
## 4 Cetacea NaN
## 5 Chiroptera 0.000275
## 6 Cingulata 0.0459
## 7 Didelphimorphia 0.0063
## 8 Diprotodontia 0.0114
## 9 Erinaceomorpha 0.00295
## 10 Hyracoidea 0.0152
## 11 Lagomorpha 0.0121
## 12 Monotremata 0.025
## 13 Perissodactyla 0.414
## 14 Pilosa NaN
## 15 Primates 0.254
## 16 Proboscidea 5.16
## 17 Rodentia 0.00357
## 18 Scandentia 0.0025
## 19 Soricomorpha 0.000592
msleep %>%
filter(order=="Cetacea") %>%
select(order, genus, brainwt)
## # A tibble: 3 × 3
## order genus brainwt
## <chr> <chr> <dbl>
## 1 Cetacea Globicephalus NA
## 2 Cetacea Phocoena NA
## 3 Cetacea Tursiops NA
–>Home