summarize()
and
count()
At the end of this exercise, you will be able to:
1. Produce clean summaries of data using summarize()
.
4. Use group_by()
in combination with
summarize()
to produce grouped summaries of data.
library("tidyverse")
library("janitor")
For this lab, we will use the built-in data on mammal sleep patterns. From: V. M. Savage and G. B. West. A quantitative, theoretical framework for understanding mammalian sleep. Proceedings of the National Academy of Sciences, 104 (3):1051-1056, 2007.
msleep <- msleep
summarize()
summarize()
will produce summary statistics for a given
variable in a data frame. For example, if you are asked to calculate the
mean of sleep_total
for large and small mammals you could
do this using a combination of commands, but it isn’t very efficient or
clean. We can do better!
For example, if we define “large” as having a bodywt
greater than 200 then we get the following:
large <- msleep %>%
filter(bodywt > 200) %>%
arrange(desc(bodywt))
large
## # A tibble: 7 × 11
## name genus vore order conservation sleep_total sleep_rem sleep_cycle awake
## <chr> <chr> <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl>
## 1 Africa… Loxo… herbi Prob… vu 3.3 NA NA 20.7
## 2 Asian … Elep… herbi Prob… en 3.9 NA NA 20.1
## 3 Giraffe Gira… herbi Arti… cd 1.9 0.4 NA 22.1
## 4 Pilot … Glob… carni Ceta… cd 2.7 0.1 NA 21.4
## 5 Cow Bos herbi Arti… domesticated 4 0.7 0.667 20
## 6 Horse Equus herbi Peri… domesticated 2.9 0.6 1 21.1
## 7 Brazil… Tapi… herbi Peri… vu 4.4 1 0.9 19.6
## # ℹ 2 more variables: brainwt <dbl>, bodywt <dbl>
mean(large$sleep_total)
## [1] 3.3
We can accomplish the same task using the summarize()
function.
Large mammals
msleep %>%
filter(bodywt>200) %>%
summarize(mean_sleep_lg=mean(sleep_total))
## # A tibble: 1 × 1
## mean_sleep_lg
## <dbl>
## 1 3.3
Small mammals
msleep %>%
filter(bodywt<10)%>%
summarize(mean_sleep_sm=mean(sleep_total))
## # A tibble: 1 × 1
## mean_sleep_sm
## <dbl>
## 1 12.0
You can also combine functions to make summaries for multiple variables.
msleep %>%
filter(bodywt>200) %>%
summarize(mean_sleep_lg=mean(sleep_total),
min_sleep_lg=min(sleep_total),
max_sleep_lg=max(sleep_total),
sd_sleep_lg=sd(sleep_total),
total=n())
## # A tibble: 1 × 5
## mean_sleep_lg min_sleep_lg max_sleep_lg sd_sleep_lg total
## <dbl> <dbl> <dbl> <dbl> <int>
## 1 3.3 1.9 4.4 0.870 7
Maybe you want to summarize all of the numeric variables in the data
frame. You can use the where()
function to select only the
numeric variables.
msleep %>%
select(where(is.numeric)) %>%
summarize_all(mean)
## # A tibble: 1 × 6
## sleep_total sleep_rem sleep_cycle awake brainwt bodywt
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 10.4 NA NA 13.6 NA 166.
msleep %>%
select(where(is.numeric)) %>%
summarize_all(mean, na.rm=TRUE)
## # A tibble: 1 × 6
## sleep_total sleep_rem sleep_cycle awake brainwt bodywt
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 10.4 1.88 0.440 13.6 0.282 166.
bodywt
for the taxonomic
order Primates? Provide the total number of observations.msleep %>%
filter(order=="Primates") %>%
summarize(mean_bodywt=mean(bodywt),
min_bodywt=min(bodywt),
max_bodywt=max(bodywt),
total=n())
## # A tibble: 1 × 4
## mean_bodywt min_bodywt max_bodywt total
## <dbl> <dbl> <dbl> <int>
## 1 13.9 0.2 62 12
n_distinct()
is a handy way of cleanly presenting the
number of distinct observations. Notice that there are multiple genera
with over 100 in body weight.
msleep %>%
filter(bodywt > 100)
## # A tibble: 11 × 11
## name genus vore order conservation sleep_total sleep_rem sleep_cycle awake
## <chr> <chr> <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl>
## 1 Cow Bos herbi Arti… domesticated 4 0.7 0.667 20
## 2 Asian… Elep… herbi Prob… en 3.9 NA NA 20.1
## 3 Horse Equus herbi Peri… domesticated 2.9 0.6 1 21.1
## 4 Donkey Equus herbi Peri… domesticated 3.1 0.4 NA 20.9
## 5 Giraf… Gira… herbi Arti… cd 1.9 0.4 NA 22.1
## 6 Pilot… Glob… carni Ceta… cd 2.7 0.1 NA 21.4
## 7 Afric… Loxo… herbi Prob… vu 3.3 NA NA 20.7
## 8 Tiger Pant… carni Carn… en 15.8 NA NA 8.2
## 9 Lion Pant… carni Carn… vu 13.5 NA NA 10.5
## 10 Brazi… Tapi… herbi Peri… vu 4.4 1 0.9 19.6
## 11 Bottl… Turs… carni Ceta… <NA> 5.2 NA NA 18.8
## # ℹ 2 more variables: brainwt <dbl>, bodywt <dbl>
Here we show the number of distinct genera over 100 in body weight.
msleep %>%
filter(bodywt > 100) %>%
summarize(n_genera=n_distinct(genus))
## # A tibble: 1 × 1
## n_genera
## <int>
## 1 9
bodywt_sm <- msleep %>%
filter(bodywt > 100)
unique(bodywt_sm$genus)
## [1] "Bos" "Elephas" "Equus" "Giraffa"
## [5] "Globicephalus" "Loxodonta" "Panthera" "Tapirus"
## [9] "Tursiops"
bodywt_sm %>%
distinct(genus) %>%
pull(genus)
## [1] "Bos" "Elephas" "Equus" "Giraffa"
## [5] "Globicephalus" "Loxodonta" "Panthera" "Tapirus"
## [9] "Tursiops"
There are many other useful summary statistics, depending on your needs: sd(), min(), max(), median(), sum(), n() (returns the length of a column), first() (returns first value in a column), last() (returns last value in a column) and n_distinct() (number of distinct values in a column).
msleep %>%
summarize(n_genera=n_distinct(genus))
## # A tibble: 1 × 1
## n_genera
## <int>
## 1 77
sleep_total
for all of
the mammals? Be sure to include the total n.msleep%>%
summarize(mean_sleep_total=mean(sleep_total),
minsleep_total=min(sleep_total),
max_sleep_total=max(sleep_total),
total=n())
## # A tibble: 1 × 4
## mean_sleep_total minsleep_total max_sleep_total total
## <dbl> <dbl> <dbl> <int>
## 1 10.4 1.9 19.9 83
Although summarize()
is helpful, oftentimes we are only
interested in counts. The janitor
package does a lot with counts, but there are also functions that
are part of dplyr that are useful.
For example, if we wanted to count the number of observations for
each vore
category in the msleep
data frame we
could do the following:
msleep %>%
count(vore)
## # A tibble: 5 × 2
## vore n
## <chr> <int>
## 1 carni 19
## 2 herbi 32
## 3 insecti 5
## 4 omni 20
## 5 <NA> 7
You can even combine multiple variables to get counts. For example,
if we wanted to count the number of observations for each
vore
category within each order
category we
could do the following:
msleep %>%
count(vore, order)
## # A tibble: 32 × 3
## vore order n
## <chr> <chr> <int>
## 1 carni Carnivora 12
## 2 carni Cetacea 3
## 3 carni Cingulata 1
## 4 carni Didelphimorphia 1
## 5 carni Primates 1
## 6 carni Rodentia 1
## 7 herbi Artiodactyla 5
## 8 herbi Diprotodontia 1
## 9 herbi Hyracoidea 2
## 10 herbi Lagomorpha 1
## # ℹ 22 more rows
The tabyl() function from the janitor package is also very useful for producing counts and percentages.
msleep %>%
tabyl(vore)
## vore n percent valid_percent
## carni 19 0.22891566 0.25000000
## herbi 32 0.38554217 0.42105263
## insecti 5 0.06024096 0.06578947
## omni 20 0.24096386 0.26315789
## <NA> 7 0.08433735 NA
msleep %>%
filter(order=="Carnivora") %>%
count(conservation)
## # A tibble: 6 × 2
## conservation n
## <chr> <int>
## 1 domesticated 2
## 2 en 1
## 3 lc 2
## 4 nt 1
## 5 vu 3
## 6 <NA> 3
msleep %>%
filter(vore=="herbi") %>%
count(order)
## # A tibble: 9 × 2
## order n
## <chr> <int>
## 1 Artiodactyla 5
## 2 Diprotodontia 1
## 3 Hyracoidea 2
## 4 Lagomorpha 1
## 5 Perissodactyla 3
## 6 Pilosa 1
## 7 Primates 1
## 8 Proboscidea 2
## 9 Rodentia 16
–>Home