Lab 7

Learning Goals

At the end of this exercise, you will be able to:
1. Produce statistical summaries using summarize().
4. Produce counts using count().

Load the libraries

library("tidyverse")
library("janitor")

Data

For this lab, we will use the built-in data on mammal sleep patterns, msleep. From: V. M. Savage and G. B. West. A quantitative, theoretical framework for understanding mammalian sleep. Proceedings of the National Academy of Sciences, 104 (3):1051-1056, 2007.

msleep

## # A tibble: 83 × 11
##    name   genus vore  order conservation sleep_total sleep_rem sleep_cycle awake
##    <chr>  <chr> <chr> <chr> <chr>              <dbl>     <dbl>       <dbl> <dbl>
##  1 Cheet… Acin… carni Carn… lc                  12.1      NA        NA      11.9
##  2 Owl m… Aotus omni  Prim… <NA>                17         1.8      NA       7  
##  3 Mount… Aplo… herbi Rode… nt                  14.4       2.4      NA       9.6
##  4 Great… Blar… omni  Sori… lc                  14.9       2.3       0.133   9.1
##  5 Cow    Bos   herbi Arti… domesticated         4         0.7       0.667  20  
##  6 Three… Brad… herbi Pilo… <NA>                14.4       2.2       0.767   9.6
##  7 North… Call… carni Carn… vu                   8.7       1.4       0.383  15.3
##  8 Vespe… Calo… <NA>  Rode… <NA>                 7        NA        NA      17  
##  9 Dog    Canis carni Carn… domesticated        10.1       2.9       0.333  13.9
## 10 Roe d… Capr… herbi Arti… lc                   3        NA        NA      21  
## # ℹ 73 more rows
## # ℹ 2 more variables: brainwt <dbl>, bodywt <dbl>

Summary statistics

When first exploring data, it is often a good idea to produce some basic summary statistics. This can help you identify potential problems with the data (e.g., outliers, missing values, skew) as well as get a general sense of the distribution of values for a given variable. So far, you have learned to calculate summary statistics using base R functions such as mean(), sd(), min(), and max().

summary(msleep)

##      name              genus               vore              order          
##  Length:83          Length:83          Length:83          Length:83         
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##                                                                             
##  conservation        sleep_total      sleep_rem      sleep_cycle    
##  Length:83          Min.   : 1.90   Min.   :0.100   Min.   :0.1167  
##  Class :character   1st Qu.: 7.85   1st Qu.:0.900   1st Qu.:0.1833  
##  Mode  :character   Median :10.10   Median :1.500   Median :0.3333  
##                     Mean   :10.43   Mean   :1.875   Mean   :0.4396  
##                     3rd Qu.:13.75   3rd Qu.:2.400   3rd Qu.:0.5792  
##                     Max.   :19.90   Max.   :6.600   Max.   :1.5000  
##                                     NA's   :22      NA's   :51      
##      awake          brainwt            bodywt        
##  Min.   : 4.10   Min.   :0.00014   Min.   :   0.005  
##  1st Qu.:10.25   1st Qu.:0.00290   1st Qu.:   0.174  
##  Median :13.90   Median :0.01240   Median :   1.670  
##  Mean   :13.57   Mean   :0.28158   Mean   : 166.136  
##  3rd Qu.:16.15   3rd Qu.:0.12550   3rd Qu.:  41.750  
##  Max.   :22.10   Max.   :5.71200   Max.   :6654.000  
##                  NA's   :27

Practice

Calculate the mean body weight of large and small mammals in msleep. Large mammals have a body weight greater than 150, while small mammals have a body weight less than 10. I did large mammals for you as an example.

large <- msleep %>% 
  filter(bodywt > 150) %>% 
  arrange(desc(bodywt))
large

## # A tibble: 11 × 11
##    name   genus vore  order conservation sleep_total sleep_rem sleep_cycle awake
##    <chr>  <chr> <chr> <chr> <chr>              <dbl>     <dbl>       <dbl> <dbl>
##  1 Afric… Loxo… herbi Prob… vu                   3.3      NA        NA      20.7
##  2 Asian… Elep… herbi Prob… en                   3.9      NA        NA      20.1
##  3 Giraf… Gira… herbi Arti… cd                   1.9       0.4      NA      22.1
##  4 Pilot… Glob… carni Ceta… cd                   2.7       0.1      NA      21.4
##  5 Cow    Bos   herbi Arti… domesticated         4         0.7       0.667  20  
##  6 Horse  Equus herbi Peri… domesticated         2.9       0.6       1      21.1
##  7 Brazi… Tapi… herbi Peri… vu                   4.4       1         0.9    19.6
##  8 Donkey Equus herbi Peri… domesticated         3.1       0.4      NA      20.9
##  9 Bottl… Turs… carni Ceta… <NA>                 5.2      NA        NA      18.8
## 10 Tiger  Pant… carni Carn… en                  15.8      NA        NA       8.2
## 11 Lion   Pant… carni Carn… vu                  13.5      NA        NA      10.5
## # ℹ 2 more variables: brainwt <dbl>, bodywt <dbl>

small <- msleep %>% 
  filter(bodywt<10) %>% 
  arrange(desc(bodywt))
small

## # A tibble: 56 × 11
##    name   genus vore  order conservation sleep_total sleep_rem sleep_cycle awake
##    <chr>  <chr> <chr> <chr> <chr>              <dbl>     <dbl>       <dbl> <dbl>
##  1 Macaq… Maca… omni  Prim… <NA>                10.1       1.2       0.75   13.9
##  2 Grivet Cerc… omni  Prim… lc                  10         0.7      NA      14  
##  3 Short… Tach… inse… Mono… <NA>                 8.6      NA        NA      15.4
##  4 Red f… Vulp… carni Carn… <NA>                 9.8       2.4       0.35   14.2
##  5 Three… Brad… herbi Pilo… <NA>                14.4       2.2       0.767   9.6
##  6 Rock … Proc… <NA>  Hyra… lc                   5.4       0.5      NA      18.6
##  7 Long-… Dasy… carni Cing… lc                  17.4       3.1       0.383   6.6
##  8 Arcti… Vulp… carni Carn… <NA>                12.5      NA        NA      11.5
##  9 Domes… Felis carni Carn… domesticated        12.5       3.2       0.417  11.5
## 10 Tree … Dend… herbi Hyra… lc                   5.3       0.5      NA      18.7
## # ℹ 46 more rows
## # ℹ 2 more variables: brainwt <dbl>, bodywt <dbl>

mean(large$bodywt)

## [1] 1173.99

mean(small$bodywt)

## [1] 1.201304

`summarize()`

Fortunately, dplyr has some functions that make producing summary statistics easier. The verb summarize() produces summary statistics for a given variable in a data frame. Let’s try the examples above using summarize().

Large mammals

msleep %>% 
  filter(bodywt>150) %>% 
  summarize(mean_bodywt_lg=mean(bodywt))

## # A tibble: 1 × 1
##   mean_bodywt_lg
##            <dbl>
## 1          1174.

Small mammals

msleep %>% 
  filter(bodywt<10) %>% 
  summarize(mean_bodywt_sm=mean(bodywt))

## # A tibble: 1 × 1
##   mean_bodywt_sm
##            <dbl>
## 1           1.20

You can also combine functions to make summaries for multiple variables.

msleep %>% 
  filter(bodywt>150) %>% 
  summarize(mean_bodywt_lg=mean(bodywt),
            min_bodywt_lg=min(bodywt),
            max_bodywt_lg=max(bodywt),
            sd_bodywt_lg=sd(bodywt),
            total=n())

## # A tibble: 1 × 5
##   mean_bodywt_lg min_bodywt_lg max_bodywt_lg sd_bodywt_lg total
##            <dbl>         <dbl>         <dbl>        <dbl> <int>
## 1          1174.          161.          6654        1945.    11

Maybe you want to summarize all of the numeric variables in the data frame. You can use the where() function to select only the numeric variables.

msleep %>% 
  select(where(is.numeric)) %>% 
  summarize_all(mean)

## # A tibble: 1 × 6
##   sleep_total sleep_rem sleep_cycle awake brainwt bodywt
##         <dbl>     <dbl>       <dbl> <dbl>   <dbl>  <dbl>
## 1        10.4        NA          NA  13.6      NA   166.

Don’t forget to remove NA’s.

msleep %>% 
  select(where(is.numeric)) %>% 
  summarize_all(mean, na.rm=T)

## # A tibble: 1 × 6
##   sleep_total sleep_rem sleep_cycle awake brainwt bodywt
##         <dbl>     <dbl>       <dbl> <dbl>   <dbl>  <dbl>
## 1        10.4      1.88       0.440  13.6   0.282   166.

This is a little more efficient.

msleep %>% 
  summarize(across(where(is.numeric),
                   ~mean(.x, na.rm=T)))

## # A tibble: 1 × 6
##   sleep_total sleep_rem sleep_cycle awake brainwt bodywt
##         <dbl>     <dbl>       <dbl> <dbl>   <dbl>  <dbl>
## 1        10.4      1.88       0.440  13.6   0.282   166.

Practice

What is the mean, min, and max bodywt for the taxonomic order Primates? Provide the total number of observations.

msleep %>% 
  filter(order=="Primates") %>% 
  summarize(mean_bodywt=mean(bodywt),
            min_bodywt=min(bodywt),
            max_bodywt=max(bodywt))

## # A tibble: 1 × 3
##   mean_bodywt min_bodywt max_bodywt
##         <dbl>      <dbl>      <dbl>
## 1        13.9        0.2         62

n_distinct() is a handy way of cleanly presenting the number of distinct observations. Notice that there are 9 genera with over 100 in body weight.

msleep %>% 
  filter(bodywt>100) %>% 
  distinct()

## # A tibble: 11 × 11
##    name   genus vore  order conservation sleep_total sleep_rem sleep_cycle awake
##    <chr>  <chr> <chr> <chr> <chr>              <dbl>     <dbl>       <dbl> <dbl>
##  1 Cow    Bos   herbi Arti… domesticated         4         0.7       0.667  20  
##  2 Asian… Elep… herbi Prob… en                   3.9      NA        NA      20.1
##  3 Horse  Equus herbi Peri… domesticated         2.9       0.6       1      21.1
##  4 Donkey Equus herbi Peri… domesticated         3.1       0.4      NA      20.9
##  5 Giraf… Gira… herbi Arti… cd                   1.9       0.4      NA      22.1
##  6 Pilot… Glob… carni Ceta… cd                   2.7       0.1      NA      21.4
##  7 Afric… Loxo… herbi Prob… vu                   3.3      NA        NA      20.7
##  8 Tiger  Pant… carni Carn… en                  15.8      NA        NA       8.2
##  9 Lion   Pant… carni Carn… vu                  13.5      NA        NA      10.5
## 10 Brazi… Tapi… herbi Peri… vu                   4.4       1         0.9    19.6
## 11 Bottl… Turs… carni Ceta… <NA>                 5.2      NA        NA      18.8
## # ℹ 2 more variables: brainwt <dbl>, bodywt <dbl>

Here we show the number of distinct genera over 100 in body weight.

msleep %>% 
  filter(bodywt>100) %>% 
  summarize(n_genera=n_distinct(genus))

## # A tibble: 1 × 1
##   n_genera
##      <int>
## 1        9

Not to be confused with n(), which counts the total number of observations.

msleep %>% 
  filter(bodywt>100) %>% 
  summarize(total=n())

## # A tibble: 1 × 1
##   total
##   <int>
## 1    11

pull() is another useful verb that extracts a single column from a data frame as a vector.

msleep %>% 
  filter(bodywt>100) %>% 
  distinct(genus) %>% 
  pull(genus)

## [1] "Bos"           "Elephas"       "Equus"         "Giraffa"      
## [5] "Globicephalus" "Loxodonta"     "Panthera"      "Tapirus"      
## [9] "Tursiops"

Practice

How many genera are represented in the msleep data frame?

msleep %>% 
  summarize(n_genera=n_distinct(genus))

## # A tibble: 1 × 1
##   n_genera
##      <int>
## 1       77

Counts

Although summarize() is helpful, oftentimes we are only interested in counts. The janitor package does a lot with counts, but there are also functions that are part of dplyr that are useful.

For example, if we wanted to count the number of observations for each vore category in the msleep data frame we could do the following:

msleep %>% 
  count(vore, sort=T)

## # A tibble: 5 × 2
##   vore        n
##   <chr>   <int>
## 1 herbi      32
## 2 omni       20
## 3 carni      19
## 4 <NA>        7
## 5 insecti     5

msleep %>% 
  count(vore) %>% 
  ggplot(aes(x=vore, y=n))+
  geom_col() #use this when we have an x and a y

You can even combine multiple variables to get counts. For example, if we wanted to count the number of observations for each vore category within each order category we could do the following:

msleep %>% 
  count(vore, order)

## # A tibble: 32 × 3
##    vore  order               n
##    <chr> <chr>           <int>
##  1 carni Carnivora          12
##  2 carni Cetacea             3
##  3 carni Cingulata           1
##  4 carni Didelphimorphia     1
##  5 carni Primates            1
##  6 carni Rodentia            1
##  7 herbi Artiodactyla        5
##  8 herbi Diprotodontia       1
##  9 herbi Hyracoidea          2
## 10 herbi Lagomorpha          1
## # ℹ 22 more rows

The tabyl() function from the janitor package is also very useful for producing counts and percentages.

msleep %>% 
  tabyl(vore)

##     vore  n    percent valid_percent
##    carni 19 0.22891566    0.25000000
##    herbi 32 0.38554217    0.42105263
##  insecti  5 0.06024096    0.06578947
##     omni 20 0.24096386    0.26315789
##     <NA>  7 0.08433735            NA

Practice

In the taxonomic order Carnivora, count the number of observations for each conservation status.

msleep %>% 
  filter(order=="Carnivora") %>% 
  count(conservation, sort=T)

## # A tibble: 6 × 2
##   conservation     n
##   <chr>        <int>
## 1 vu               3
## 2 <NA>             3
## 3 domesticated     2
## 4 lc               2
## 5 en               1
## 6 nt               1

Among herbivores, which order is most represented? Return the top 5 orders with counts, sorted in descending order.

msleep %>% 
  filter(vore=="herbi") %>% 
  count(order, sort=T)

## # A tibble: 9 × 2
##   order              n
##   <chr>          <int>
## 1 Rodentia          16
## 2 Artiodactyla       5
## 3 Perissodactyla     3
## 4 Hyracoidea         2
## 5 Proboscidea        2
## 6 Diprotodontia      1
## 7 Lagomorpha         1
## 8 Pilosa             1
## 9 Primates           1

–>Home

Lab 7

2026-03-09

Learning Goals

Load the libraries

Data

Summary statistics

Practice

summarize()

Practice

Practice

Counts

Practice

`summarize()`