At the end of this exercise, you will be able to:
1. Produce statistical summaries using summarize().
4. Produce counts using count().
library("tidyverse")
library("janitor")
For this lab, we will use the built-in data on mammal sleep patterns,
msleep. From: V. M. Savage and G. B. West. A
quantitative, theoretical framework for understanding mammalian sleep.
Proceedings of the National Academy of Sciences, 104 (3):1051-1056,
2007.
msleep
## # A tibble: 83 × 11
## name genus vore order conservation sleep_total sleep_rem sleep_cycle awake
## <chr> <chr> <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl>
## 1 Cheet… Acin… carni Carn… lc 12.1 NA NA 11.9
## 2 Owl m… Aotus omni Prim… <NA> 17 1.8 NA 7
## 3 Mount… Aplo… herbi Rode… nt 14.4 2.4 NA 9.6
## 4 Great… Blar… omni Sori… lc 14.9 2.3 0.133 9.1
## 5 Cow Bos herbi Arti… domesticated 4 0.7 0.667 20
## 6 Three… Brad… herbi Pilo… <NA> 14.4 2.2 0.767 9.6
## 7 North… Call… carni Carn… vu 8.7 1.4 0.383 15.3
## 8 Vespe… Calo… <NA> Rode… <NA> 7 NA NA 17
## 9 Dog Canis carni Carn… domesticated 10.1 2.9 0.333 13.9
## 10 Roe d… Capr… herbi Arti… lc 3 NA NA 21
## # ℹ 73 more rows
## # ℹ 2 more variables: brainwt <dbl>, bodywt <dbl>
When first exploring data, it is often a good idea to produce some
basic summary statistics. This can help you identify potential problems
with the data (e.g., outliers, missing values, skew) as well as get a
general sense of the distribution of values for a given variable. So
far, you have learned to calculate summary statistics using base
R functions such as mean(), sd(),
min(), and max().
summary(msleep)
## name genus vore order
## Length:83 Length:83 Length:83 Length:83
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
##
## conservation sleep_total sleep_rem sleep_cycle
## Length:83 Min. : 1.90 Min. :0.100 Min. :0.1167
## Class :character 1st Qu.: 7.85 1st Qu.:0.900 1st Qu.:0.1833
## Mode :character Median :10.10 Median :1.500 Median :0.3333
## Mean :10.43 Mean :1.875 Mean :0.4396
## 3rd Qu.:13.75 3rd Qu.:2.400 3rd Qu.:0.5792
## Max. :19.90 Max. :6.600 Max. :1.5000
## NA's :22 NA's :51
## awake brainwt bodywt
## Min. : 4.10 Min. :0.00014 Min. : 0.005
## 1st Qu.:10.25 1st Qu.:0.00290 1st Qu.: 0.174
## Median :13.90 Median :0.01240 Median : 1.670
## Mean :13.57 Mean :0.28158 Mean : 166.136
## 3rd Qu.:16.15 3rd Qu.:0.12550 3rd Qu.: 41.750
## Max. :22.10 Max. :5.71200 Max. :6654.000
## NA's :27
large <- msleep %>%
filter(bodywt > 150) %>%
arrange(desc(bodywt))
large
## # A tibble: 11 × 11
## name genus vore order conservation sleep_total sleep_rem sleep_cycle awake
## <chr> <chr> <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl>
## 1 Afric… Loxo… herbi Prob… vu 3.3 NA NA 20.7
## 2 Asian… Elep… herbi Prob… en 3.9 NA NA 20.1
## 3 Giraf… Gira… herbi Arti… cd 1.9 0.4 NA 22.1
## 4 Pilot… Glob… carni Ceta… cd 2.7 0.1 NA 21.4
## 5 Cow Bos herbi Arti… domesticated 4 0.7 0.667 20
## 6 Horse Equus herbi Peri… domesticated 2.9 0.6 1 21.1
## 7 Brazi… Tapi… herbi Peri… vu 4.4 1 0.9 19.6
## 8 Donkey Equus herbi Peri… domesticated 3.1 0.4 NA 20.9
## 9 Bottl… Turs… carni Ceta… <NA> 5.2 NA NA 18.8
## 10 Tiger Pant… carni Carn… en 15.8 NA NA 8.2
## 11 Lion Pant… carni Carn… vu 13.5 NA NA 10.5
## # ℹ 2 more variables: brainwt <dbl>, bodywt <dbl>
small <- msleep %>%
filter(bodywt<10) %>%
arrange(desc(bodywt))
small
## # A tibble: 56 × 11
## name genus vore order conservation sleep_total sleep_rem sleep_cycle awake
## <chr> <chr> <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl>
## 1 Macaq… Maca… omni Prim… <NA> 10.1 1.2 0.75 13.9
## 2 Grivet Cerc… omni Prim… lc 10 0.7 NA 14
## 3 Short… Tach… inse… Mono… <NA> 8.6 NA NA 15.4
## 4 Red f… Vulp… carni Carn… <NA> 9.8 2.4 0.35 14.2
## 5 Three… Brad… herbi Pilo… <NA> 14.4 2.2 0.767 9.6
## 6 Rock … Proc… <NA> Hyra… lc 5.4 0.5 NA 18.6
## 7 Long-… Dasy… carni Cing… lc 17.4 3.1 0.383 6.6
## 8 Arcti… Vulp… carni Carn… <NA> 12.5 NA NA 11.5
## 9 Domes… Felis carni Carn… domesticated 12.5 3.2 0.417 11.5
## 10 Tree … Dend… herbi Hyra… lc 5.3 0.5 NA 18.7
## # ℹ 46 more rows
## # ℹ 2 more variables: brainwt <dbl>, bodywt <dbl>
mean(large$bodywt)
## [1] 1173.99
mean(small$bodywt)
## [1] 1.201304
summarize()Fortunately, dplyr has some functions that make producing summary
statistics easier. The verb summarize() produces summary
statistics for a given variable in a data frame. Let’s try the examples
above using summarize().
Large mammals
msleep %>%
filter(bodywt>150) %>%
summarize(mean_bodywt_lg=mean(bodywt))
## # A tibble: 1 × 1
## mean_bodywt_lg
## <dbl>
## 1 1174.
Small mammals
msleep %>%
filter(bodywt<10) %>%
summarize(mean_bodywt_sm=mean(bodywt))
## # A tibble: 1 × 1
## mean_bodywt_sm
## <dbl>
## 1 1.20
You can also combine functions to make summaries for multiple variables.
msleep %>%
filter(bodywt>150) %>%
summarize(mean_bodywt_lg=mean(bodywt),
min_bodywt_lg=min(bodywt),
max_bodywt_lg=max(bodywt),
sd_bodywt_lg=sd(bodywt),
total=n())
## # A tibble: 1 × 5
## mean_bodywt_lg min_bodywt_lg max_bodywt_lg sd_bodywt_lg total
## <dbl> <dbl> <dbl> <dbl> <int>
## 1 1174. 161. 6654 1945. 11
Maybe you want to summarize all of the numeric variables in the data
frame. You can use the where() function to select only the
numeric variables.
msleep %>%
select(where(is.numeric)) %>%
summarize_all(mean)
## # A tibble: 1 × 6
## sleep_total sleep_rem sleep_cycle awake brainwt bodywt
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 10.4 NA NA 13.6 NA 166.
Don’t forget to remove NA’s.
msleep %>%
select(where(is.numeric)) %>%
summarize_all(mean, na.rm=T)
## # A tibble: 1 × 6
## sleep_total sleep_rem sleep_cycle awake brainwt bodywt
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 10.4 1.88 0.440 13.6 0.282 166.
This is a little more efficient.
msleep %>%
summarize(across(where(is.numeric),
~mean(.x, na.rm=T)))
## # A tibble: 1 × 6
## sleep_total sleep_rem sleep_cycle awake brainwt bodywt
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 10.4 1.88 0.440 13.6 0.282 166.
bodywt for the taxonomic
order Primates? Provide the total number of observations.msleep %>%
filter(order=="Primates") %>%
summarize(mean_bodywt=mean(bodywt),
min_bodywt=min(bodywt),
max_bodywt=max(bodywt))
## # A tibble: 1 × 3
## mean_bodywt min_bodywt max_bodywt
## <dbl> <dbl> <dbl>
## 1 13.9 0.2 62
n_distinct() is a handy way of cleanly presenting the
number of distinct observations. Notice that there are 9 genera with
over 100 in body weight.
msleep %>%
filter(bodywt>100) %>%
distinct()
## # A tibble: 11 × 11
## name genus vore order conservation sleep_total sleep_rem sleep_cycle awake
## <chr> <chr> <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl>
## 1 Cow Bos herbi Arti… domesticated 4 0.7 0.667 20
## 2 Asian… Elep… herbi Prob… en 3.9 NA NA 20.1
## 3 Horse Equus herbi Peri… domesticated 2.9 0.6 1 21.1
## 4 Donkey Equus herbi Peri… domesticated 3.1 0.4 NA 20.9
## 5 Giraf… Gira… herbi Arti… cd 1.9 0.4 NA 22.1
## 6 Pilot… Glob… carni Ceta… cd 2.7 0.1 NA 21.4
## 7 Afric… Loxo… herbi Prob… vu 3.3 NA NA 20.7
## 8 Tiger Pant… carni Carn… en 15.8 NA NA 8.2
## 9 Lion Pant… carni Carn… vu 13.5 NA NA 10.5
## 10 Brazi… Tapi… herbi Peri… vu 4.4 1 0.9 19.6
## 11 Bottl… Turs… carni Ceta… <NA> 5.2 NA NA 18.8
## # ℹ 2 more variables: brainwt <dbl>, bodywt <dbl>
Here we show the number of distinct genera over 100 in body weight.
msleep %>%
filter(bodywt>100) %>%
summarize(n_genera=n_distinct(genus))
## # A tibble: 1 × 1
## n_genera
## <int>
## 1 9
Not to be confused with n(), which counts the total number of observations.
msleep %>%
filter(bodywt>100) %>%
summarize(total=n())
## # A tibble: 1 × 1
## total
## <int>
## 1 11
pull() is another useful verb that extracts a single
column from a data frame as a vector.
msleep %>%
filter(bodywt>100) %>%
distinct(genus) %>%
pull(genus)
## [1] "Bos" "Elephas" "Equus" "Giraffa"
## [5] "Globicephalus" "Loxodonta" "Panthera" "Tapirus"
## [9] "Tursiops"
msleep %>%
summarize(n_genera=n_distinct(genus))
## # A tibble: 1 × 1
## n_genera
## <int>
## 1 77
Although summarize() is helpful, oftentimes we are only
interested in counts. The janitor
package does a lot with counts, but there are also functions that
are part of dplyr that are useful.
For example, if we wanted to count the number of observations for
each vore category in the msleep data frame we
could do the following:
msleep %>%
count(vore, sort=T)
## # A tibble: 5 × 2
## vore n
## <chr> <int>
## 1 herbi 32
## 2 omni 20
## 3 carni 19
## 4 <NA> 7
## 5 insecti 5
msleep %>%
count(vore) %>%
ggplot(aes(x=vore, y=n))+
geom_col() #use this when we have an x and a y
You can even combine multiple variables to get counts. For example,
if we wanted to count the number of observations for each
vore category within each order category we
could do the following:
msleep %>%
count(vore, order)
## # A tibble: 32 × 3
## vore order n
## <chr> <chr> <int>
## 1 carni Carnivora 12
## 2 carni Cetacea 3
## 3 carni Cingulata 1
## 4 carni Didelphimorphia 1
## 5 carni Primates 1
## 6 carni Rodentia 1
## 7 herbi Artiodactyla 5
## 8 herbi Diprotodontia 1
## 9 herbi Hyracoidea 2
## 10 herbi Lagomorpha 1
## # ℹ 22 more rows
The tabyl() function from the janitor package is also very useful for producing counts and percentages.
msleep %>%
tabyl(vore)
## vore n percent valid_percent
## carni 19 0.22891566 0.25000000
## herbi 32 0.38554217 0.42105263
## insecti 5 0.06024096 0.06578947
## omni 20 0.24096386 0.26315789
## <NA> 7 0.08433735 NA
msleep %>%
filter(order=="Carnivora") %>%
count(conservation, sort=T)
## # A tibble: 6 × 2
## conservation n
## <chr> <int>
## 1 vu 3
## 2 <NA> 3
## 3 domesticated 2
## 4 lc 2
## 5 en 1
## 6 nt 1
msleep %>%
filter(vore=="herbi") %>%
count(order, sort=T)
## # A tibble: 9 × 2
## order n
## <chr> <int>
## 1 Rodentia 16
## 2 Artiodactyla 5
## 3 Perissodactyla 3
## 4 Hyracoidea 2
## 5 Proboscidea 2
## 6 Diprotodontia 1
## 7 Lagomorpha 1
## 8 Pilosa 1
## 9 Primates 1
–>Home