goup_by()
At the end of this exercise, you will be able to:
1. Produce clear, concise summaries using a variety of functions in
dplyr
and janitor.
2. Use the across()
operator to produce summaries across
multiple variables.
library("tidyverse")
library("janitor")
For this lab, we will use palmerpenguins. They are from the publication: Gorman KB, Williams TD, Fraser WR (2014). Ecological sexual dimorphism and environmental variability within a community of Antarctic penguins (genus Pygoscelis). PLoS ONE 9(3):e90081. https://doi.org/10.1371/journal.pone.0090081.
To use these data, they need to be installed.
#install.packages("palmerpenguins")
Load the library.
library("palmerpenguins")
?penguins
penguins <- penguins
summarize()
is a function in dplyr
that
allows us to produce summaries of data. It’s output is a new dataframe.
If we were interested in the mean body weight of Adeile
penguins, we could use summarize()
.
penguins %>% #loads the data and pipes it to the filter function
filter(species=="Adelie") %>% #pulls out only the Adelie penguins
summarize(mean_body_mass=mean(body_mass_g, na.rm=T), #calculates the mean
n=n()) #number of observations
## # A tibble: 1 × 2
## mean_body_mass n
## <dbl> <int>
## 1 3701. 152
Notice that there are three species in the data.
penguins %>%
count(species)
## # A tibble: 3 × 2
## species n
## <fct> <int>
## 1 Adelie 152
## 2 Chinstrap 68
## 3 Gentoo 124
What if we wanted to summarize basic statistics for each species? We
could still use filter()
but this is cumbersome.
penguins %>%
filter(species=="Adelie") %>%
summarize(mean_body_mass=mean(body_mass_g, na.rm=T),
sd_body_mass=sd(body_mass_g, na.rm=T),
n=n())
## # A tibble: 1 × 3
## mean_body_mass sd_body_mass n
## <dbl> <dbl> <int>
## 1 3701. 459. 152
penguins %>%
filter(species=="Chinstrap") %>%
summarize(mean_body_mass=mean(body_mass_g, na.rm=T),
sd_body_mass=sd(body_mass_g, na.rm=T),
n=n())
## # A tibble: 1 × 3
## mean_body_mass sd_body_mass n
## <dbl> <dbl> <int>
## 1 3733. 384. 68
penguins %>%
filter(species=="Gentoo") %>%
summarize(mean_body_mass=mean(body_mass_g, na.rm=T),
sd_body_mass=sd(body_mass_g, na.rm=T),
n=n())
## # A tibble: 1 × 3
## mean_body_mass sd_body_mass n
## <dbl> <dbl> <int>
## 1 5076. 504. 124
group_by()
The summarize()
function is most useful when used in
conjunction with group_by()
. This allows us to group data
by a categorical variable(s) of interest.
penguins %>%
group_by(species) %>% #we are grouping by species, a categorical variable
summarize(mean_body_mass=mean(body_mass_g, na.rm=T),
sd_body_mass=sd(body_mass_g, na.rm=T),
n=n())
## # A tibble: 3 × 4
## species mean_body_mass sd_body_mass n
## <fct> <dbl> <dbl> <int>
## 1 Adelie 3701. 459. 152
## 2 Chinstrap 3733. 384. 68
## 3 Gentoo 5076. 504. 124
Use the built-in msleep
data to answer the following
questions.
msleep %>%
group_by(vore) %>% #we are grouping by feeding ecology, a categorical variable
summarize(min_bodywt = min(bodywt),
max_bodywt = max(bodywt),
mean_bodywt = mean(bodywt),
total=n())
## # A tibble: 5 × 5
## vore min_bodywt max_bodywt mean_bodywt total
## <chr> <dbl> <dbl> <dbl> <int>
## 1 carni 0.028 800 90.8 19
## 2 herbi 0.022 6654 367. 32
## 3 insecti 0.01 60 12.9 5
## 4 omni 0.005 86.2 12.7 20
## 5 <NA> 0.021 3.6 0.858 7
msleep %>%
group_by(order) %>%
summarize(mean_brain_wt=mean(brainwt))
## # A tibble: 19 × 2
## order mean_brain_wt
## <chr> <dbl>
## 1 Afrosoricida 0.0026
## 2 Artiodactyla NA
## 3 Carnivora NA
## 4 Cetacea NA
## 5 Chiroptera 0.000275
## 6 Cingulata 0.0459
## 7 Didelphimorphia NA
## 8 Diprotodontia NA
## 9 Erinaceomorpha 0.00295
## 10 Hyracoidea 0.0152
## 11 Lagomorpha 0.0121
## 12 Monotremata 0.025
## 13 Perissodactyla 0.414
## 14 Pilosa NA
## 15 Primates NA
## 16 Proboscidea 5.16
## 17 Rodentia NA
## 18 Scandentia 0.0025
## 19 Soricomorpha 0.000592
na.rm=TRUE
. What is the problem with Cetacea?msleep %>%
group_by(order) %>%
summarize(mean_brain_wt=mean(brainwt, na.rm=T))
## # A tibble: 19 × 2
## order mean_brain_wt
## <chr> <dbl>
## 1 Afrosoricida 0.0026
## 2 Artiodactyla 0.198
## 3 Carnivora 0.0986
## 4 Cetacea NaN
## 5 Chiroptera 0.000275
## 6 Cingulata 0.0459
## 7 Didelphimorphia 0.0063
## 8 Diprotodontia 0.0114
## 9 Erinaceomorpha 0.00295
## 10 Hyracoidea 0.0152
## 11 Lagomorpha 0.0121
## 12 Monotremata 0.025
## 13 Perissodactyla 0.414
## 14 Pilosa NaN
## 15 Primates 0.254
## 16 Proboscidea 5.16
## 17 Rodentia 0.00357
## 18 Scandentia 0.0025
## 19 Soricomorpha 0.000592
msleep %>%
filter(order=="Cetacea") %>%
select(order, genus, brainwt)
## # A tibble: 3 × 3
## order genus brainwt
## <chr> <chr> <dbl>
## 1 Cetacea Globicephalus NA
## 2 Cetacea Phocoena NA
## 3 Cetacea Tursiops NA
The summarize() and group_by() functions are powerful tools that we can use to produce clean summaries of data. Especially when used together, we can quickly group variables of interest and save time. You can also group by more than one categorical variable.
What if we are interested in the number of observations (penguins) by species and island?
penguins %>%
group_by(species, island) %>%
summarize(n=n(), .groups= 'keep') #.groups= 'keep' is used to keep the grouping variable in the output
## # A tibble: 5 × 3
## # Groups: species, island [5]
## species island n
## <fct> <fct> <int>
## 1 Adelie Biscoe 44
## 2 Adelie Dream 56
## 3 Adelie Torgersen 52
## 4 Chinstrap Dream 68
## 5 Gentoo Biscoe 124
bill_length_mm
compare by island
and penguin species?x <- penguins %>%
group_by(island, species) %>%
summarize(mean_bill_length_mm=mean(bill_length_mm, na.rm=T), .groups= 'keep')
x %>% summarize(n=n())
## `summarise()` has grouped output by 'island'. You can override using the
## `.groups` argument.
## # A tibble: 5 × 3
## # Groups: island [3]
## island species n
## <fct> <fct> <int>
## 1 Biscoe Adelie 1
## 2 Biscoe Gentoo 1
## 3 Dream Adelie 1
## 4 Dream Chinstrap 1
## 5 Torgersen Adelie 1
y <- penguins %>%
group_by(island, species) %>%
summarize(mean_bill_length_mm=mean(bill_length_mm, na.rm=T), .groups= 'drop')
y %>% summarize(n=n())
## # A tibble: 1 × 1
## n
## <int>
## 1 5
z <- penguins %>%
group_by(island, species) %>%
summarize(mean_bill_length_mm=mean(bill_length_mm, na.rm=T), .groups= 'drop_last')
z %>% summarize(n=n())
## # A tibble: 3 × 2
## island n
## <fct> <int>
## 1 Biscoe 2
## 2 Dream 2
## 3 Torgersen 1
penguins %>%
count(sex, island)
## # A tibble: 9 × 3
## sex island n
## <fct> <fct> <int>
## 1 female Biscoe 80
## 2 female Dream 61
## 3 female Torgersen 24
## 4 male Biscoe 83
## 5 male Dream 62
## 6 male Torgersen 23
## 7 <NA> Biscoe 5
## 8 <NA> Dream 1
## 9 <NA> Torgersen 5
across()
What about using filter()
and select()
across multiple variables? There is a function in dplyr called
across()
which is designed to work across multiple
variables. Have a look at Rebecca Barter’s awesome blog here.
What if we wanted to apply summarize()
in order to
produce distinct counts over multiple variables; i.e. species, island,
and sex? Although this isn’t a lot of coding you can image that with a
lot of variables it would be cumbersome.
penguins %>%
summarize(distinct_species = n_distinct(species),
distinct_island = n_distinct(island),
distinct_sex = n_distinct(sex))
## # A tibble: 1 × 3
## distinct_species distinct_island distinct_sex
## <int> <int> <int>
## 1 3 3 3
By using across()
we can reduce the clutter and make
things cleaner.
penguins %>%
summarize(across(c(species, island, sex), n_distinct))
## # A tibble: 1 × 3
## species island sex
## <int> <int> <int>
## 1 3 3 3
The new version. The reason why I show you this is because you don’t want to leave with code that is already outdated. Remember, R is under constant revision so expect changes.
penguins %>%
summarize(across(c(species, island, sex), \(x) n_distinct(x, na.rm = TRUE)))
## # A tibble: 1 × 3
## species island sex
## <int> <int> <int>
## 1 3 3 2
# The function takes x (each column in the specified range) and applies n_distinct(x, na.rm = TRUE)
This is very helpful for numeric variables.
penguins %>%
summarize(across(where(is.numeric), \(x) mean(x, na.rm = TRUE))) %>%
# The function takes x (each column in the specified range) and applies mean(x, na.rm = TRUE) to compute the mean while removing NA values
select(!year) #removes the year variable
## # A tibble: 1 × 4
## bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
## <dbl> <dbl> <dbl> <dbl>
## 1 43.9 17.2 201. 4202.
group_by
also works.
penguins %>%
group_by(sex) %>%
summarize(across(contains("mm"), \(x) mean(x, na.rm = TRUE)))
## # A tibble: 3 × 4
## sex bill_length_mm bill_depth_mm flipper_length_mm
## <fct> <dbl> <dbl> <dbl>
## 1 female 42.1 16.4 197.
## 2 male 45.9 17.9 205.
## 3 <NA> 41.3 16.6 199
Mean
penguins %>%
group_by(sex) %>%
summarize(across(contains("mm"), \(x) mean(x, na.rm = TRUE)))
## # A tibble: 3 × 4
## sex bill_length_mm bill_depth_mm flipper_length_mm
## <fct> <dbl> <dbl> <dbl>
## 1 female 42.1 16.4 197.
## 2 male 45.9 17.9 205.
## 3 <NA> 41.3 16.6 199
Standard Deviation
penguins %>%
group_by(sex) %>%
summarize(across(contains("mm"), \(x) sd(x, na.rm = TRUE)))
## # A tibble: 3 × 4
## sex bill_length_mm bill_depth_mm flipper_length_mm
## <fct> <dbl> <dbl> <dbl>
## 1 female 4.90 1.80 12.5
## 2 male 5.37 1.86 14.5
## 3 <NA> 4.63 2.24 16.5
Supply a list of functions
penguins %>%
group_by(species) %>%
summarize(across(contains("mm"), list(mean = \(x) mean(x, na.rm = TRUE),
sd = \(x) sd(x, na.rm = TRUE))),
n=n())
## # A tibble: 3 × 8
## species bill_length_mm_mean bill_length_mm_sd bill_depth_mm_mean
## <fct> <dbl> <dbl> <dbl>
## 1 Adelie 38.8 2.66 18.3
## 2 Chinstrap 48.8 3.34 18.4
## 3 Gentoo 47.5 3.08 15.0
## # ℹ 4 more variables: bill_depth_mm_sd <dbl>, flipper_length_mm_mean <dbl>,
## # flipper_length_mm_sd <dbl>, n <int>
Please review the learning goals and be sure to use the code here as
a reference when completing the homework.
–>Home