Learning Goals

At the end of this exercise, you will be able to:
1. Use pipes to connect functions in dplyr.
2. Use arrange() to order dplyr outputs.
3. Use mutate() to add columns in a dataframe.

Review

At this point, you should be comfortable using the select() and filter() functions of dplyr. If you need help, please email me.

Load the tidyverse and janitor

library("tidyverse")
library("janitor")

Load the data

For this lab, we will use the following two datasets:

  1. Gaeta J., G. Sass, S. Carpenter. 2012. Biocomplexity at North Temperate Lakes LTER: Coordinated Field Studies: Large Mouth Bass Growth 2006. Environmental Data Initiative. link
fish <- read_csv("data/Gaeta_etal_CLC_data.csv")
## Rows: 4033 Columns: 6
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (2): lakeid, annnumber
## dbl (4): fish_id, length, radii_length_mm, scalelength
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
  1. S. K. Morgan Ernest. 2003. Life history characteristics of placental non-volant mammals. Ecology 84:3402. link
mammals <- read_csv("data/mammal_lifehistories_v2.csv")
## Rows: 1440 Columns: 13
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (4): order, family, Genus, species
## dbl (9): mass, gestation, newborn, weaning, wean mass, AFR, max. life, litte...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Rename the mammals variables using janitor.

mammals <- clean_names(mammals)

Pipes %>%

In order to start combining select(), filter(), and other functions efficiently, we need to learn pipes. Pipes feed the output from one function into the input of another function. This helps us keep our code sequential and clean.

In this example, we use a pipe to select only lakeid and scalelength then filter that output only for lakes “AL”. Notice that we only need to call the data one time.

select(fish, lakeid, scalelength)
## # A tibble: 4,033 × 2
##    lakeid scalelength
##    <chr>        <dbl>
##  1 AL            2.70
##  2 AL            2.70
##  3 AL            2.70
##  4 AL            3.02
##  5 AL            3.02
##  6 AL            3.02
##  7 AL            3.02
##  8 AL            3.34
##  9 AL            3.34
## 10 AL            3.34
## # ℹ 4,023 more rows
filter(fish, lakeid == "AL")
## # A tibble: 383 × 6
##    lakeid fish_id annnumber length radii_length_mm scalelength
##    <chr>    <dbl> <chr>      <dbl>           <dbl>       <dbl>
##  1 AL         299 EDGE         167            2.70        2.70
##  2 AL         299 2            167            2.04        2.70
##  3 AL         299 1            167            1.31        2.70
##  4 AL         300 EDGE         175            3.02        3.02
##  5 AL         300 3            175            2.67        3.02
##  6 AL         300 2            175            2.14        3.02
##  7 AL         300 1            175            1.23        3.02
##  8 AL         301 EDGE         194            3.34        3.34
##  9 AL         301 3            194            2.97        3.34
## 10 AL         301 2            194            2.29        3.34
## # ℹ 373 more rows
fish %>% 
  select(lakeid, scalelength) %>% 
  filter(lakeid == "AL")
## # A tibble: 383 × 2
##    lakeid scalelength
##    <chr>        <dbl>
##  1 AL            2.70
##  2 AL            2.70
##  3 AL            2.70
##  4 AL            3.02
##  5 AL            3.02
##  6 AL            3.02
##  7 AL            3.02
##  8 AL            3.34
##  9 AL            3.34
## 10 AL            3.34
## # ℹ 373 more rows

Here we select family, genus, and species then filter for gestation greater than 15 months.

mammals %>% 
  select("family", "genus", "species", "gestation") %>% 
  filter(gestation>=15)
## # A tibble: 8 × 4
##   family         genus         species       gestation
##   <chr>          <chr>         <chr>             <dbl>
## 1 Delphinidae    Globicephala  macrorhynchus      15.2
## 2 Physeteridae   Physeter      catodon            15.8
## 3 Rhinocerotidae Ceratotherium simum              15.9
## 4 Rhinocerotidae Diceros       bicornis           16.1
## 5 Rhinocerotidae Rhinoceros    unicornis          16.4
## 6 Rhinocerotidae Rhinoceros    sondaicus          16.5
## 7 Elephantidae   Elephas       maximus            21.1
## 8 Elephantidae   Loxodonta     africana           21.5

Practice

  1. We are interested in the fish from the lakes “AL” and “AR” with a radii length between 2 and 4. Extract this information from the fish data. Use pipes!
fish %>% # work with the fish data
  select(lakeid, radii_length_mm) %>% #pull out variables of int
  filter(lakeid=="AL" | lakeid=="AR") %>% #only these lakes
  filter(between(radii_length_mm, 2, 4)) %>% #between 2 and 4
  arrange(desc(radii_length_mm)) #sort to make easier to read
## # A tibble: 253 × 2
##    lakeid radii_length_mm
##    <chr>            <dbl>
##  1 AR                4.00
##  2 AR                3.99
##  3 AR                3.99
##  4 AR                3.98
##  5 AR                3.98
##  6 AR                3.96
##  7 AR                3.94
##  8 AR                3.94
##  9 AR                3.94
## 10 AL                3.94
## # ℹ 243 more rows

arrange()

The arrange() command is a bit like a sort command in excel. Note that the default is ascending order.

fish %>% 
  select(lakeid, scalelength) %>% 
  arrange(scalelength)
## # A tibble: 4,033 × 2
##    lakeid scalelength
##    <chr>        <dbl>
##  1 WS           0.628
##  2 WS           0.637
##  3 LSG          0.773
##  4 WS           0.832
##  5 LSG          1.15 
##  6 WS           1.41 
##  7 WS           1.41 
##  8 WS           1.41 
##  9 BO           1.43 
## 10 BO           1.43 
## # ℹ 4,023 more rows

To sort in decreasing order, wrap the variable name in desc().

fish %>% 
  select(lakeid, scalelength) %>% 
  arrange(desc(scalelength))
## # A tibble: 4,033 × 2
##    lakeid scalelength
##    <chr>        <dbl>
##  1 WS            11.0
##  2 WS            11.0
##  3 WS            11.0
##  4 WS            11.0
##  5 WS            11.0
##  6 WS            11.0
##  7 WS            11.0
##  8 WS            11.0
##  9 WS            11.0
## 10 WS            11.0
## # ℹ 4,023 more rows

arrange() can be very helpful in combination with the other functions.

fish %>% 
  select(lakeid, length, fish_id, scalelength) %>% 
  filter(lakeid=="AL") %>% 
  arrange(fish_id)
## # A tibble: 383 × 4
##    lakeid length fish_id scalelength
##    <chr>   <dbl>   <dbl>       <dbl>
##  1 AL        167     299        2.70
##  2 AL        167     299        2.70
##  3 AL        167     299        2.70
##  4 AL        175     300        3.02
##  5 AL        175     300        3.02
##  6 AL        175     300        3.02
##  7 AL        175     300        3.02
##  8 AL        194     301        3.34
##  9 AL        194     301        3.34
## 10 AL        194     301        3.34
## # ℹ 373 more rows

Practice

  1. We are interested in the radii length and scalelength of fish with the id 300. Extract this information from the fish data. Please sort the data by radii length.
fish %>% 
  select(fish_id, radii_length_mm, scalelength) %>% 
  filter(fish_id==300) %>% 
  arrange(radii_length_mm)
## # A tibble: 4 × 3
##   fish_id radii_length_mm scalelength
##     <dbl>           <dbl>       <dbl>
## 1     300            1.23        3.02
## 2     300            2.14        3.02
## 3     300            2.67        3.02
## 4     300            3.02        3.02
  1. Let’s re-do question #3 from part 1, but this time using pipes.

From the mammals data, filter all members of the family Felidae with a gestation period less than 3 months. As part of this, remove all variables except genus, family, and gestation. Lastly, remove all NA’s in the variable gestation- in this data they are represented by -999. Sort by gestation.

mammals %>% 
  select(genus, family, gestation) %>% 
  filter(family=="Felidae" & gestation<3) %>% 
  filter(gestation!="-999") %>% 
  arrange(gestation)
## # A tibble: 20 × 3
##    genus        family  gestation
##    <chr>        <chr>       <dbl>
##  1 Lynx         Felidae      1.86
##  2 Felis        Felidae      2.05
##  3 Felis        Felidae      2.15
##  4 Prionailurus Felidae      2.17
##  5 Lynx         Felidae      2.18
##  6 Lynx         Felidae      2.18
##  7 Felis        Felidae      2.2 
##  8 Felis        Felidae      2.2 
##  9 Prionailurus Felidae      2.23
## 10 Prionailurus Felidae      2.29
## 11 Herpailurus  Felidae      2.33
## 12 Lynx         Felidae      2.33
## 13 Leptailurus  Felidae      2.36
## 14 Otocolobus   Felidae      2.37
## 15 Caracal      Felidae      2.43
## 16 Oncifelis    Felidae      2.45
## 17 Leopardus    Felidae      2.51
## 18 Leopardus    Felidae      2.64
## 19 Catopuma     Felidae      2.67
## 20 Leopardus    Felidae      2.67

mutate()

Mutate allows us to create a new column from existing columns in a data frame. We are doing a small introduction here and will add some additional functions later.

Let’s convert the scalelength variable from millimeters to centimeters.

fish %>% 
  mutate(scalelength_cm = scalelength/10) %>% 
  select(lakeid, scalelength, scalelength_cm)
## # A tibble: 4,033 × 3
##    lakeid scalelength scalelength_cm
##    <chr>        <dbl>          <dbl>
##  1 AL            2.70          0.270
##  2 AL            2.70          0.270
##  3 AL            2.70          0.270
##  4 AL            3.02          0.302
##  5 AL            3.02          0.302
##  6 AL            3.02          0.302
##  7 AL            3.02          0.302
##  8 AL            3.34          0.334
##  9 AL            3.34          0.334
## 10 AL            3.34          0.334
## # ℹ 4,023 more rows

Practice

  1. In the mammals data, let’s convert the mass variable from grams to kilograms. Don’t forget to remove those pesky -999’s and arrange in descending order.
mammals %>% 
  filter(mass!="-999") %>% 
  mutate(mass_kg = mass/1000) %>% 
  select(genus, mass, mass_kg) %>% 
  arrange(desc(mass_kg))
## # A tibble: 1,355 × 3
##    genus              mass mass_kg
##    <chr>             <dbl>   <dbl>
##  1 Balaenoptera 149000000  149000 
##  2 Balaena       80000000   80000 
##  3 Balaenoptera  66800000   66800 
##  4 Megaptera     30000000   30000 
##  5 Eschrichtius  25066667.  25067.
##  6 Eubalaena     23000000   23000 
##  7 Eubalaena     23000000   23000 
##  8 Balaenoptera  20000000   20000 
##  9 Balaenoptera  16266667.  16267.
## 10 Physeter      15400000   15400 
## # ℹ 1,345 more rows
  1. Using the mammals data, compare wean mass to gestation. Create a new column that calculates the ratio of wean mass to gestation. Arrange the data by this ratio in descending order. Don’t forget to remove the -999’s from wean mass and gestation.
mammals %>% 
  select(genus, mass, wean_mass, gestation) %>% 
  mutate(mass_kg = mass/1000) %>% 
  filter(wean_mass!="-999" & gestation!="-999" & mass!="-999") %>% 
  mutate(wean_gestation_ratio = wean_mass/gestation) %>% 
  arrange(desc(wean_gestation_ratio))
## # A tibble: 372 × 6
##    genus              mass wean_mass gestation mass_kg wean_gestation_ratio
##    <chr>             <dbl>     <dbl>     <dbl>   <dbl>                <dbl>
##  1 Balaenoptera 149000000  19075000      10.7  149000              1781046.
##  2 Rhinoceros     1602333.   885812.     16.4    1602.               53882.
##  3 Hippopotamus   1258333.   237500       7.75   1258.               30645.
##  4 Loxodonta      3507000    600000      21.5    3507                27959.
##  5 Camelus         434000    253484      12.6     434                20054.
##  6 Bison           497667.   157500       8.93    498.               17637.
##  7 Odobenus        650000    200000      11.5     650                17391.
##  8 Mirounga        579400    130192.      7.63    579.               17063.
##  9 Mirounga        716667.   138037.      8.45    717.               16336.
## 10 Syncerus        504667.   166000      11.0     505.               15050.
## # ℹ 362 more rows

Wrap-up

Please review the learning goals and be sure to use the code here as a reference when completing the homework.
–>Home