Learning Goals

At the end of this exercise, you will be able to:
1. Use the filter() function of dplyr to extract observations of interest.
2. Use filter() on multiple conditions to extract observations of interest.
3. Use arrange() to sort data frames by specific variables.
4. Use pipes %>% to connect multiple functions.

Load the tidyverse

library("tidyverse")
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.6
## ✔ forcats   1.0.1     ✔ stringr   1.6.0
## ✔ ggplot2   4.0.1     ✔ tibble    3.3.0
## ✔ lubridate 1.9.4     ✔ tidyr     1.3.2
## ✔ purrr     1.2.1     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library("palmerpenguins") #load the palmerpenguins package
## 
## Attaching package: 'palmerpenguins'
## 
## The following objects are masked from 'package:datasets':
## 
##     penguins, penguins_raw

Palmerpenguins

These data are from: Gorman KB, Williams TD, Fraser WR (2014). Ecological sexual dimorphism and environmental variability within a community of Antarctic penguins (genus Pygoscelis). PLoS ONE 9(3):e90081. https://doi.org/10.1371/journal.pone.0090081

Review

In the previous lab, we used select() to extract columns of interest from a data frame. This helps us focus our attention on the variables relevant to our question. However, it doesn’t allow us to extract information in rows. The filter() function allows us to extract data in rows that meet specific criteria.

filter()

Unlike select(), filter() allows us to extract data that meet specific criteria within a row. Let’s say that we are interested only in the Adelie penguins from the penguins data frame. We can use filter() to extract these observations.

glimpse(penguins)
## Rows: 344
## Columns: 8
## $ species           <fct> Adelie, Adelie, Adelie, Adelie, Adelie, Adelie, Adel…
## $ island            <fct> Torgersen, Torgersen, Torgersen, Torgersen, Torgerse…
## $ bill_length_mm    <dbl> 39.1, 39.5, 40.3, NA, 36.7, 39.3, 38.9, 39.2, 34.1, …
## $ bill_depth_mm     <dbl> 18.7, 17.4, 18.0, NA, 19.3, 20.6, 17.8, 19.6, 18.1, …
## $ flipper_length_mm <int> 181, 186, 195, NA, 193, 190, 181, 195, 193, 190, 186…
## $ body_mass_g       <int> 3750, 3800, 3250, NA, 3450, 3650, 3625, 4675, 3475, …
## $ sex               <fct> male, female, female, NA, female, male, female, male…
## $ year              <int> 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007…
filter(penguins, species=="Adelie")
## # A tibble: 152 × 8
##    species island    bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
##    <fct>   <fct>              <dbl>         <dbl>             <int>       <int>
##  1 Adelie  Torgersen           39.1          18.7               181        3750
##  2 Adelie  Torgersen           39.5          17.4               186        3800
##  3 Adelie  Torgersen           40.3          18                 195        3250
##  4 Adelie  Torgersen           NA            NA                  NA          NA
##  5 Adelie  Torgersen           36.7          19.3               193        3450
##  6 Adelie  Torgersen           39.3          20.6               190        3650
##  7 Adelie  Torgersen           38.9          17.8               181        3625
##  8 Adelie  Torgersen           39.2          19.6               195        4675
##  9 Adelie  Torgersen           34.1          18.1               193        3475
## 10 Adelie  Torgersen           42            20.2               190        4250
## # ℹ 142 more rows
## # ℹ 2 more variables: sex <fct>, year <int>
names(penguins)
## [1] "species"           "island"            "bill_length_mm"   
## [4] "bill_depth_mm"     "flipper_length_mm" "body_mass_g"      
## [7] "sex"               "year"

Similarly, if we are only interested in penguins with a body mass greater than or equal to 5500g we can use filter() to extract these observations.

filter(penguins, body_mass_g>=5500)
## # A tibble: 33 × 8
##    species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
##    <fct>   <fct>           <dbl>         <dbl>             <int>       <int>
##  1 Gentoo  Biscoe           50            16.3               230        5700
##  2 Gentoo  Biscoe           50            15.2               218        5700
##  3 Gentoo  Biscoe           49            16.1               216        5550
##  4 Gentoo  Biscoe           48.4          14.6               213        5850
##  5 Gentoo  Biscoe           49.3          15.7               217        5850
##  6 Gentoo  Biscoe           49.2          15.2               221        6300
##  7 Gentoo  Biscoe           50.2          14.3               218        5700
##  8 Gentoo  Biscoe           47.8          15                 215        5650
##  9 Gentoo  Biscoe           50            15.3               220        5550
## 10 Gentoo  Biscoe           59.6          17                 230        6050
## # ℹ 23 more rows
## # ℹ 2 more variables: sex <fct>, year <int>
  • filter() allows all of the expected operators; i.e. >, >=, <, <=, != (not equal), and == (equal).

Using the ! operator allows for the exclusion of specific observations.

filter(penguins, species!="Chinstrap")
## # A tibble: 276 × 8
##    species island    bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
##    <fct>   <fct>              <dbl>         <dbl>             <int>       <int>
##  1 Adelie  Torgersen           39.1          18.7               181        3750
##  2 Adelie  Torgersen           39.5          17.4               186        3800
##  3 Adelie  Torgersen           40.3          18                 195        3250
##  4 Adelie  Torgersen           NA            NA                  NA          NA
##  5 Adelie  Torgersen           36.7          19.3               193        3450
##  6 Adelie  Torgersen           39.3          20.6               190        3650
##  7 Adelie  Torgersen           38.9          17.8               181        3625
##  8 Adelie  Torgersen           39.2          19.6               195        4675
##  9 Adelie  Torgersen           34.1          18.1               193        3475
## 10 Adelie  Torgersen           42            20.2               190        4250
## # ℹ 266 more rows
## # ℹ 2 more variables: sex <fct>, year <int>

Using filter() with multiple observations

The %in% operator determines whether or not a value is part of a vector.

filter(penguins, body_mass_g %in% c(5250, 5500))
## # A tibble: 8 × 8
##   species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
##   <fct>   <fct>           <dbl>         <dbl>             <int>       <int>
## 1 Gentoo  Biscoe           47.3          15.3               222        5250
## 2 Gentoo  Biscoe           44.4          17.3               219        5250
## 3 Gentoo  Biscoe           51.1          16.5               225        5250
## 4 Gentoo  Biscoe           49.1          15                 228        5500
## 5 Gentoo  Biscoe           46.8          16.1               215        5500
## 6 Gentoo  Biscoe           53.4          15.8               219        5500
## 7 Gentoo  Biscoe           48.1          15.1               209        5500
## 8 Gentoo  Biscoe           51.5          16.3               230        5500
## # ℹ 2 more variables: sex <fct>, year <int>

Alternatively, you can use between if you are looking for a range of specific values.

filter(penguins, between(body_mass_g, 5250, 5500))
## # A tibble: 21 × 8
##    species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
##    <fct>   <fct>           <dbl>         <dbl>             <int>       <int>
##  1 Gentoo  Biscoe           47.6          14.5               215        5400
##  2 Gentoo  Biscoe           48.7          15.1               222        5350
##  3 Gentoo  Biscoe           47.3          15.3               222        5250
##  4 Gentoo  Biscoe           48.4          16.3               220        5400
##  5 Gentoo  Biscoe           44.4          17.3               219        5250
##  6 Gentoo  Biscoe           48.7          15.7               208        5350
##  7 Gentoo  Biscoe           50.5          15.9               225        5400
##  8 Gentoo  Biscoe           45.2          15.8               215        5300
##  9 Gentoo  Biscoe           48.5          14.1               220        5300
## 10 Gentoo  Biscoe           46.2          14.9               221        5300
## # ℹ 11 more rows
## # ℹ 2 more variables: sex <fct>, year <int>

How would you use %in% to get the same result?

filter(penguins, body_mass_g %in% c(5250, 5300, 5350, 5400, 5450, 5500))
## # A tibble: 21 × 8
##    species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
##    <fct>   <fct>           <dbl>         <dbl>             <int>       <int>
##  1 Gentoo  Biscoe           47.6          14.5               215        5400
##  2 Gentoo  Biscoe           48.7          15.1               222        5350
##  3 Gentoo  Biscoe           47.3          15.3               222        5250
##  4 Gentoo  Biscoe           48.4          16.3               220        5400
##  5 Gentoo  Biscoe           44.4          17.3               219        5250
##  6 Gentoo  Biscoe           48.7          15.7               208        5350
##  7 Gentoo  Biscoe           50.5          15.9               225        5400
##  8 Gentoo  Biscoe           45.2          15.8               215        5300
##  9 Gentoo  Biscoe           48.5          14.1               220        5300
## 10 Gentoo  Biscoe           46.2          14.9               221        5300
## # ℹ 11 more rows
## # ℹ 2 more variables: sex <fct>, year <int>

You can also extract observations “near” a certain value but you need to specify a tolerance.

filter(penguins, near(body_mass_g, 5400, tol=100))
## # A tibble: 9 × 8
##   species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
##   <fct>   <fct>           <dbl>         <dbl>             <int>       <int>
## 1 Gentoo  Biscoe           47.6          14.5               215        5400
## 2 Gentoo  Biscoe           48.7          15.1               222        5350
## 3 Gentoo  Biscoe           48.4          16.3               220        5400
## 4 Gentoo  Biscoe           48.7          15.7               208        5350
## 5 Gentoo  Biscoe           50.5          15.9               225        5400
## 6 Gentoo  Biscoe           52.5          15.6               221        5450
## 7 Gentoo  Biscoe           50            15.9               224        5350
## 8 Gentoo  Biscoe           52.2          17.1               228        5400
## 9 Gentoo  Biscoe           49.9          16.1               213        5400
## # ℹ 2 more variables: sex <fct>, year <int>

Practice

The data are from: Gaeta J., G. Sass, S. Carpenter. 2012. Biocomplexity at North Temperate Lakes LTER: Coordinated Field Studies: Large Mouth Bass Growth 2006. Environmental Data Initiative. link

  1. Load the data
fish <- read_csv("data/Gaeta_etal_CLC_data.csv")
## Rows: 4033 Columns: 6
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (2): lakeid, annnumber
## dbl (4): fish_id, length, radii_length_mm, scalelength
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
glimpse(fish)
## Rows: 4,033
## Columns: 6
## $ lakeid          <chr> "AL", "AL", "AL", "AL", "AL", "AL", "AL", "AL", "AL", …
## $ fish_id         <dbl> 299, 299, 299, 300, 300, 300, 300, 301, 301, 301, 301,…
## $ annnumber       <chr> "EDGE", "2", "1", "EDGE", "3", "2", "1", "EDGE", "3", …
## $ length          <dbl> 167, 167, 167, 175, 175, 175, 175, 194, 194, 194, 194,…
## $ radii_length_mm <dbl> 2.697443, 2.037518, 1.311795, 3.015477, 2.670733, 2.13…
## $ scalelength     <dbl> 2.697443, 2.697443, 2.697443, 3.015477, 3.015477, 3.01…
  1. Filter the fish data to include the samples from lake “BO”.
filter(fish, lakeid=="BO")
## # A tibble: 197 × 6
##    lakeid fish_id annnumber length radii_length_mm scalelength
##    <chr>    <dbl> <chr>      <dbl>           <dbl>       <dbl>
##  1 BO         389 EDGE         104           1.50         1.50
##  2 BO         389 1            104           0.736        1.50
##  3 BO         390 EDGE         105           1.59         1.59
##  4 BO         390 1            105           0.698        1.59
##  5 BO         391 EDGE         107           1.43         1.43
##  6 BO         391 1            107           0.695        1.43
##  7 BO         392 EDGE         124           2.11         2.11
##  8 BO         392 2            124           1.36         2.11
##  9 BO         392 1            124           0.792        2.11
## 10 BO         393 EDGE         141           2.16         2.16
## # ℹ 187 more rows
  1. Filter the data to include all lakes except “AR”.
filter(fish, lakeid!="AR")
## # A tibble: 3,771 × 6
##    lakeid fish_id annnumber length radii_length_mm scalelength
##    <chr>    <dbl> <chr>      <dbl>           <dbl>       <dbl>
##  1 AL         299 EDGE         167            2.70        2.70
##  2 AL         299 2            167            2.04        2.70
##  3 AL         299 1            167            1.31        2.70
##  4 AL         300 EDGE         175            3.02        3.02
##  5 AL         300 3            175            2.67        3.02
##  6 AL         300 2            175            2.14        3.02
##  7 AL         300 1            175            1.23        3.02
##  8 AL         301 EDGE         194            3.34        3.34
##  9 AL         301 3            194            2.97        3.34
## 10 AL         301 2            194            2.29        3.34
## # ℹ 3,761 more rows
  1. Filter the fish data to include all fish with a scalelength within 0.25 of 8.
filter(fish, near(scalelength, 8, tol=0.25))
## # A tibble: 236 × 6
##    lakeid fish_id annnumber length radii_length_mm scalelength
##    <chr>    <dbl> <chr>      <dbl>           <dbl>       <dbl>
##  1 AL         309 EDGE         355            7.89        7.89
##  2 AL         309 13           355            7.56        7.89
##  3 AL         309 12           355            7.36        7.89
##  4 AL         309 11           355            7.16        7.89
##  5 AL         309 10           355            6.77        7.89
##  6 AL         309 9            355            6.39        7.89
##  7 AL         309 8            355            5.96        7.89
##  8 AL         309 7            355            5.44        7.89
##  9 AL         309 6            355            4.74        7.89
## 10 AL         309 5            355            4.06        7.89
## # ℹ 226 more rows
  1. Filter the fish data to include fish with a scalelength between 2 and 4.
filter(fish, between(scalelength, 2, 4))
## # A tibble: 723 × 6
##    lakeid fish_id annnumber length radii_length_mm scalelength
##    <chr>    <dbl> <chr>      <dbl>           <dbl>       <dbl>
##  1 AL         299 EDGE         167            2.70        2.70
##  2 AL         299 2            167            2.04        2.70
##  3 AL         299 1            167            1.31        2.70
##  4 AL         300 EDGE         175            3.02        3.02
##  5 AL         300 3            175            2.67        3.02
##  6 AL         300 2            175            2.14        3.02
##  7 AL         300 1            175            1.23        3.02
##  8 AL         301 EDGE         194            3.34        3.34
##  9 AL         301 3            194            2.97        3.34
## 10 AL         301 2            194            2.29        3.34
## # ℹ 713 more rows

Using filter() on multiple conditions

You can also use filter() to extract data based on multiple conditions. Below we extract the Chinstrap penguins that have a body mass greater than 4500g.

filter(penguins, species=="Chinstrap" & body_mass_g> 4500)
## # A tibble: 2 × 8
##   species   island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
##   <fct>     <fct>           <dbl>         <dbl>             <int>       <int>
## 1 Chinstrap Dream            52.8          20                 205        4550
## 2 Chinstrap Dream            52            20.7               210        4800
## # ℹ 2 more variables: sex <fct>, year <int>

Notice that the | operator generates a different result. Why?

filter(penguins, species=="Chinstrap" | body_mass_g> 4500)
## # A tibble: 181 × 8
##    species island    bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
##    <fct>   <fct>              <dbl>         <dbl>             <int>       <int>
##  1 Adelie  Torgersen           39.2          19.6               195        4675
##  2 Adelie  Dream               39.8          19.1               184        4650
##  3 Adelie  Dream               39.6          18.8               190        4600
##  4 Adelie  Torgersen           42.9          17.6               196        4700
##  5 Adelie  Biscoe              41            20                 203        4725
##  6 Adelie  Biscoe              43.2          19                 197        4775
##  7 Adelie  Biscoe              45.6          20.3               191        4600
##  8 Gentoo  Biscoe              50            16.3               230        5700
##  9 Gentoo  Biscoe              50            15.2               218        5700
## 10 Gentoo  Biscoe              47.6          14.5               215        5400
## # ℹ 171 more rows
## # ℹ 2 more variables: sex <fct>, year <int>

Rules:
+ filter(condition1, condition2) will return rows where both conditions are met. By default the , means &.
+ filter(condition1, !condition2) will return all rows where condition one is true but condition 2 is not.
+ filter(condition1 | condition2) will return rows where condition 1 or condition 2 is met.

In this case, we filter out the penguins with a body mass over 5000g and a bill length over 50mm or a bill depth over 18mm.

filter(penguins, body_mass_g>5000 & (bill_length_mm>50 | bill_depth_mm>18))
## # A tibble: 20 × 8
##    species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
##    <fct>   <fct>           <dbl>         <dbl>             <int>       <int>
##  1 Gentoo  Biscoe           50.2          14.3               218        5700
##  2 Gentoo  Biscoe           59.6          17                 230        6050
##  3 Gentoo  Biscoe           50.5          15.9               222        5550
##  4 Gentoo  Biscoe           50.5          15.9               225        5400
##  5 Gentoo  Biscoe           50.4          15.3               224        5550
##  6 Gentoo  Biscoe           54.3          15.7               231        5650
##  7 Gentoo  Biscoe           50.7          15                 223        5550
##  8 Gentoo  Biscoe           51.1          16.3               220        6000
##  9 Gentoo  Biscoe           52.5          15.6               221        5450
## 10 Gentoo  Biscoe           50.8          17.3               228        5600
## 11 Gentoo  Biscoe           51.3          14.2               218        5300
## 12 Gentoo  Biscoe           52.1          17                 230        5550
## 13 Gentoo  Biscoe           52.2          17.1               228        5400
## 14 Gentoo  Biscoe           50.8          15.7               226        5200
## 15 Gentoo  Biscoe           51.1          16.5               225        5250
## 16 Gentoo  Biscoe           55.9          17                 228        5600
## 17 Gentoo  Biscoe           53.4          15.8               219        5500
## 18 Gentoo  Biscoe           51.5          16.3               230        5500
## 19 Gentoo  Biscoe           55.1          16                 230        5850
## 20 Gentoo  Biscoe           50.4          15.7               222        5750
## # ℹ 2 more variables: sex <fct>, year <int>

Pipes %>%

In order to start combining select(), filter(), and other functions efficiently, we need to learn pipes. Pipes feed the output from one function into the input of another function.

Let’s say we are only interested in the body mass of Gentoo penguins. We can use pipes to first select the species and body mass columns then filter that output only for Gentoo penguins. Notice that we only need to call the data one time.

penguins %>% 
  select(species, body_mass_g) %>% 
  filter(species=="Gentoo")
## # A tibble: 124 × 2
##    species body_mass_g
##    <fct>         <int>
##  1 Gentoo         4500
##  2 Gentoo         5700
##  3 Gentoo         4450
##  4 Gentoo         5700
##  5 Gentoo         5400
##  6 Gentoo         4550
##  7 Gentoo         4800
##  8 Gentoo         5200
##  9 Gentoo         4400
## 10 Gentoo         5150
## # ℹ 114 more rows

We can sort the output of filter() using the arrange() function. Here we filter for Gentoo penguins and then sort them by body mass in ascending order.

penguins %>% 
  select(species, body_mass_g) %>% 
  filter(species=="Gentoo") %>% 
  arrange(body_mass_g)
## # A tibble: 124 × 2
##    species body_mass_g
##    <fct>         <int>
##  1 Gentoo         3950
##  2 Gentoo         4100
##  3 Gentoo         4150
##  4 Gentoo         4200
##  5 Gentoo         4200
##  6 Gentoo         4200
##  7 Gentoo         4300
##  8 Gentoo         4300
##  9 Gentoo         4350
## 10 Gentoo         4375
## # ℹ 114 more rows

To sort in descending order, wrap the variable name in desc().

penguins %>% 
  select(species, body_mass_g) %>% 
  filter(species=="Gentoo") %>% 
  arrange(desc(body_mass_g))
## # A tibble: 124 × 2
##    species body_mass_g
##    <fct>         <int>
##  1 Gentoo         6300
##  2 Gentoo         6050
##  3 Gentoo         6000
##  4 Gentoo         6000
##  5 Gentoo         5950
##  6 Gentoo         5950
##  7 Gentoo         5850
##  8 Gentoo         5850
##  9 Gentoo         5850
## 10 Gentoo         5800
## # ℹ 114 more rows

slice()

penguins %>% 
  select(species, body_mass_g) %>% 
  filter(species=="Gentoo") %>% 
  slice_min(body_mass_g, n=1)#don't forget to specify the variable
## # A tibble: 1 × 2
##   species body_mass_g
##   <fct>         <int>
## 1 Gentoo         3950
?slice

Practice

Let’s use the fish data again to keep things simple.

  1. We are interested in the radii length and scalelength of fish with the id 300. Extract this information from the fish data. Please sort the data by radii length.
names(fish)
## [1] "lakeid"          "fish_id"         "annnumber"       "length"         
## [5] "radii_length_mm" "scalelength"
fish %>% 
  select(fish_id, radii_length_mm, scalelength) %>% 
  filter(fish_id==300) %>% 
  arrange(scalelength)
## # A tibble: 4 × 3
##   fish_id radii_length_mm scalelength
##     <dbl>           <dbl>       <dbl>
## 1     300            3.02        3.02
## 2     300            2.67        3.02
## 3     300            2.14        3.02
## 4     300            1.23        3.02
  1. We are interested in the fish from the lakes “AL” and “AR” with a radii length between 2 and 4. Extract this information from the fish data. Please sort the data in descending order by radii length.
fish %>% 
  select(lakeid, radii_length_mm) %>% 
  filter(lakeid=="AL" | lakeid=="AR") %>% 
  filter(between(radii_length_mm, 2, 4)) %>% 
  arrange(desc(radii_length_mm))
## # A tibble: 253 × 2
##    lakeid radii_length_mm
##    <chr>            <dbl>
##  1 AR                4.00
##  2 AR                3.99
##  3 AR                3.99
##  4 AR                3.98
##  5 AR                3.98
##  6 AR                3.96
##  7 AR                3.94
##  8 AR                3.94
##  9 AR                3.94
## 10 AL                3.94
## # ℹ 243 more rows

Wrap-up

Please review the learning goals and be sure to use the code here as a reference when completing the homework.
–>Home