At the end of this exercise, you will be able to:
1. Use the filter() function of dplyr to
extract observations of interest.
2. Use filter() on multiple conditions to extract
observations of interest.
3. Use arrange() to sort data frames by specific
variables.
4. Use pipes %>% to connect multiple functions.
library("tidyverse")
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.6
## ✔ forcats 1.0.1 ✔ stringr 1.6.0
## ✔ ggplot2 4.0.1 ✔ tibble 3.3.0
## ✔ lubridate 1.9.4 ✔ tidyr 1.3.2
## ✔ purrr 1.2.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library("palmerpenguins") #load the palmerpenguins package
##
## Attaching package: 'palmerpenguins'
##
## The following objects are masked from 'package:datasets':
##
## penguins, penguins_raw
These data are from: Gorman KB, Williams TD, Fraser WR (2014). Ecological sexual dimorphism and environmental variability within a community of Antarctic penguins (genus Pygoscelis). PLoS ONE 9(3):e90081. https://doi.org/10.1371/journal.pone.0090081
In the previous lab, we used select() to extract columns
of interest from a data frame. This helps us focus our attention on the
variables relevant to our question. However, it doesn’t allow us to
extract information in rows. The filter() function allows
us to extract data in rows that meet specific criteria.
filter()Unlike select(), filter() allows us to
extract data that meet specific criteria within a row. Let’s say that we
are interested only in the Adelie penguins from the
penguins data frame. We can use filter() to
extract these observations.
glimpse(penguins)
## Rows: 344
## Columns: 8
## $ species <fct> Adelie, Adelie, Adelie, Adelie, Adelie, Adelie, Adel…
## $ island <fct> Torgersen, Torgersen, Torgersen, Torgersen, Torgerse…
## $ bill_length_mm <dbl> 39.1, 39.5, 40.3, NA, 36.7, 39.3, 38.9, 39.2, 34.1, …
## $ bill_depth_mm <dbl> 18.7, 17.4, 18.0, NA, 19.3, 20.6, 17.8, 19.6, 18.1, …
## $ flipper_length_mm <int> 181, 186, 195, NA, 193, 190, 181, 195, 193, 190, 186…
## $ body_mass_g <int> 3750, 3800, 3250, NA, 3450, 3650, 3625, 4675, 3475, …
## $ sex <fct> male, female, female, NA, female, male, female, male…
## $ year <int> 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007…
filter(penguins, species=="Adelie")
## # A tibble: 152 × 8
## species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
## <fct> <fct> <dbl> <dbl> <int> <int>
## 1 Adelie Torgersen 39.1 18.7 181 3750
## 2 Adelie Torgersen 39.5 17.4 186 3800
## 3 Adelie Torgersen 40.3 18 195 3250
## 4 Adelie Torgersen NA NA NA NA
## 5 Adelie Torgersen 36.7 19.3 193 3450
## 6 Adelie Torgersen 39.3 20.6 190 3650
## 7 Adelie Torgersen 38.9 17.8 181 3625
## 8 Adelie Torgersen 39.2 19.6 195 4675
## 9 Adelie Torgersen 34.1 18.1 193 3475
## 10 Adelie Torgersen 42 20.2 190 4250
## # ℹ 142 more rows
## # ℹ 2 more variables: sex <fct>, year <int>
names(penguins)
## [1] "species" "island" "bill_length_mm"
## [4] "bill_depth_mm" "flipper_length_mm" "body_mass_g"
## [7] "sex" "year"
Similarly, if we are only interested in penguins with a body mass
greater than or equal to 5500g we can use filter() to
extract these observations.
filter(penguins, body_mass_g>=5500)
## # A tibble: 33 × 8
## species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
## <fct> <fct> <dbl> <dbl> <int> <int>
## 1 Gentoo Biscoe 50 16.3 230 5700
## 2 Gentoo Biscoe 50 15.2 218 5700
## 3 Gentoo Biscoe 49 16.1 216 5550
## 4 Gentoo Biscoe 48.4 14.6 213 5850
## 5 Gentoo Biscoe 49.3 15.7 217 5850
## 6 Gentoo Biscoe 49.2 15.2 221 6300
## 7 Gentoo Biscoe 50.2 14.3 218 5700
## 8 Gentoo Biscoe 47.8 15 215 5650
## 9 Gentoo Biscoe 50 15.3 220 5550
## 10 Gentoo Biscoe 59.6 17 230 6050
## # ℹ 23 more rows
## # ℹ 2 more variables: sex <fct>, year <int>
filter() allows all of the expected operators;
i.e. >, >=, <, <=, != (not equal), and == (equal).Using the ! operator allows for the exclusion of
specific observations.
filter(penguins, species!="Chinstrap")
## # A tibble: 276 × 8
## species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
## <fct> <fct> <dbl> <dbl> <int> <int>
## 1 Adelie Torgersen 39.1 18.7 181 3750
## 2 Adelie Torgersen 39.5 17.4 186 3800
## 3 Adelie Torgersen 40.3 18 195 3250
## 4 Adelie Torgersen NA NA NA NA
## 5 Adelie Torgersen 36.7 19.3 193 3450
## 6 Adelie Torgersen 39.3 20.6 190 3650
## 7 Adelie Torgersen 38.9 17.8 181 3625
## 8 Adelie Torgersen 39.2 19.6 195 4675
## 9 Adelie Torgersen 34.1 18.1 193 3475
## 10 Adelie Torgersen 42 20.2 190 4250
## # ℹ 266 more rows
## # ℹ 2 more variables: sex <fct>, year <int>
filter() with multiple observationsThe %in% operator
determines whether or not a value is part of a vector.
filter(penguins, body_mass_g %in% c(5250, 5500))
## # A tibble: 8 × 8
## species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
## <fct> <fct> <dbl> <dbl> <int> <int>
## 1 Gentoo Biscoe 47.3 15.3 222 5250
## 2 Gentoo Biscoe 44.4 17.3 219 5250
## 3 Gentoo Biscoe 51.1 16.5 225 5250
## 4 Gentoo Biscoe 49.1 15 228 5500
## 5 Gentoo Biscoe 46.8 16.1 215 5500
## 6 Gentoo Biscoe 53.4 15.8 219 5500
## 7 Gentoo Biscoe 48.1 15.1 209 5500
## 8 Gentoo Biscoe 51.5 16.3 230 5500
## # ℹ 2 more variables: sex <fct>, year <int>
Alternatively, you can use between if you are looking
for a range of specific values.
filter(penguins, between(body_mass_g, 5250, 5500))
## # A tibble: 21 × 8
## species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
## <fct> <fct> <dbl> <dbl> <int> <int>
## 1 Gentoo Biscoe 47.6 14.5 215 5400
## 2 Gentoo Biscoe 48.7 15.1 222 5350
## 3 Gentoo Biscoe 47.3 15.3 222 5250
## 4 Gentoo Biscoe 48.4 16.3 220 5400
## 5 Gentoo Biscoe 44.4 17.3 219 5250
## 6 Gentoo Biscoe 48.7 15.7 208 5350
## 7 Gentoo Biscoe 50.5 15.9 225 5400
## 8 Gentoo Biscoe 45.2 15.8 215 5300
## 9 Gentoo Biscoe 48.5 14.1 220 5300
## 10 Gentoo Biscoe 46.2 14.9 221 5300
## # ℹ 11 more rows
## # ℹ 2 more variables: sex <fct>, year <int>
How would you use %in% to get the same result?
filter(penguins, body_mass_g %in% c(5250, 5300, 5350, 5400, 5450, 5500))
## # A tibble: 21 × 8
## species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
## <fct> <fct> <dbl> <dbl> <int> <int>
## 1 Gentoo Biscoe 47.6 14.5 215 5400
## 2 Gentoo Biscoe 48.7 15.1 222 5350
## 3 Gentoo Biscoe 47.3 15.3 222 5250
## 4 Gentoo Biscoe 48.4 16.3 220 5400
## 5 Gentoo Biscoe 44.4 17.3 219 5250
## 6 Gentoo Biscoe 48.7 15.7 208 5350
## 7 Gentoo Biscoe 50.5 15.9 225 5400
## 8 Gentoo Biscoe 45.2 15.8 215 5300
## 9 Gentoo Biscoe 48.5 14.1 220 5300
## 10 Gentoo Biscoe 46.2 14.9 221 5300
## # ℹ 11 more rows
## # ℹ 2 more variables: sex <fct>, year <int>
You can also extract observations “near” a certain value but you need to specify a tolerance.
filter(penguins, near(body_mass_g, 5400, tol=100))
## # A tibble: 9 × 8
## species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
## <fct> <fct> <dbl> <dbl> <int> <int>
## 1 Gentoo Biscoe 47.6 14.5 215 5400
## 2 Gentoo Biscoe 48.7 15.1 222 5350
## 3 Gentoo Biscoe 48.4 16.3 220 5400
## 4 Gentoo Biscoe 48.7 15.7 208 5350
## 5 Gentoo Biscoe 50.5 15.9 225 5400
## 6 Gentoo Biscoe 52.5 15.6 221 5450
## 7 Gentoo Biscoe 50 15.9 224 5350
## 8 Gentoo Biscoe 52.2 17.1 228 5400
## 9 Gentoo Biscoe 49.9 16.1 213 5400
## # ℹ 2 more variables: sex <fct>, year <int>
The data are from: Gaeta J., G. Sass, S. Carpenter. 2012. Biocomplexity at North Temperate Lakes LTER: Coordinated Field Studies: Large Mouth Bass Growth 2006. Environmental Data Initiative. link
fish <- read_csv("data/Gaeta_etal_CLC_data.csv")
## Rows: 4033 Columns: 6
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (2): lakeid, annnumber
## dbl (4): fish_id, length, radii_length_mm, scalelength
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
glimpse(fish)
## Rows: 4,033
## Columns: 6
## $ lakeid <chr> "AL", "AL", "AL", "AL", "AL", "AL", "AL", "AL", "AL", …
## $ fish_id <dbl> 299, 299, 299, 300, 300, 300, 300, 301, 301, 301, 301,…
## $ annnumber <chr> "EDGE", "2", "1", "EDGE", "3", "2", "1", "EDGE", "3", …
## $ length <dbl> 167, 167, 167, 175, 175, 175, 175, 194, 194, 194, 194,…
## $ radii_length_mm <dbl> 2.697443, 2.037518, 1.311795, 3.015477, 2.670733, 2.13…
## $ scalelength <dbl> 2.697443, 2.697443, 2.697443, 3.015477, 3.015477, 3.01…
fish data to include the samples from lake
“BO”.filter(fish, lakeid=="BO")
## # A tibble: 197 × 6
## lakeid fish_id annnumber length radii_length_mm scalelength
## <chr> <dbl> <chr> <dbl> <dbl> <dbl>
## 1 BO 389 EDGE 104 1.50 1.50
## 2 BO 389 1 104 0.736 1.50
## 3 BO 390 EDGE 105 1.59 1.59
## 4 BO 390 1 105 0.698 1.59
## 5 BO 391 EDGE 107 1.43 1.43
## 6 BO 391 1 107 0.695 1.43
## 7 BO 392 EDGE 124 2.11 2.11
## 8 BO 392 2 124 1.36 2.11
## 9 BO 392 1 124 0.792 2.11
## 10 BO 393 EDGE 141 2.16 2.16
## # ℹ 187 more rows
filter(fish, lakeid!="AR")
## # A tibble: 3,771 × 6
## lakeid fish_id annnumber length radii_length_mm scalelength
## <chr> <dbl> <chr> <dbl> <dbl> <dbl>
## 1 AL 299 EDGE 167 2.70 2.70
## 2 AL 299 2 167 2.04 2.70
## 3 AL 299 1 167 1.31 2.70
## 4 AL 300 EDGE 175 3.02 3.02
## 5 AL 300 3 175 2.67 3.02
## 6 AL 300 2 175 2.14 3.02
## 7 AL 300 1 175 1.23 3.02
## 8 AL 301 EDGE 194 3.34 3.34
## 9 AL 301 3 194 2.97 3.34
## 10 AL 301 2 194 2.29 3.34
## # ℹ 3,761 more rows
filter(fish, near(scalelength, 8, tol=0.25))
## # A tibble: 236 × 6
## lakeid fish_id annnumber length radii_length_mm scalelength
## <chr> <dbl> <chr> <dbl> <dbl> <dbl>
## 1 AL 309 EDGE 355 7.89 7.89
## 2 AL 309 13 355 7.56 7.89
## 3 AL 309 12 355 7.36 7.89
## 4 AL 309 11 355 7.16 7.89
## 5 AL 309 10 355 6.77 7.89
## 6 AL 309 9 355 6.39 7.89
## 7 AL 309 8 355 5.96 7.89
## 8 AL 309 7 355 5.44 7.89
## 9 AL 309 6 355 4.74 7.89
## 10 AL 309 5 355 4.06 7.89
## # ℹ 226 more rows
filter(fish, between(scalelength, 2, 4))
## # A tibble: 723 × 6
## lakeid fish_id annnumber length radii_length_mm scalelength
## <chr> <dbl> <chr> <dbl> <dbl> <dbl>
## 1 AL 299 EDGE 167 2.70 2.70
## 2 AL 299 2 167 2.04 2.70
## 3 AL 299 1 167 1.31 2.70
## 4 AL 300 EDGE 175 3.02 3.02
## 5 AL 300 3 175 2.67 3.02
## 6 AL 300 2 175 2.14 3.02
## 7 AL 300 1 175 1.23 3.02
## 8 AL 301 EDGE 194 3.34 3.34
## 9 AL 301 3 194 2.97 3.34
## 10 AL 301 2 194 2.29 3.34
## # ℹ 713 more rows
filter() on multiple conditionsYou can also use filter() to extract data based on
multiple conditions. Below we extract the Chinstrap penguins that have a
body mass greater than 4500g.
filter(penguins, species=="Chinstrap" & body_mass_g> 4500)
## # A tibble: 2 × 8
## species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
## <fct> <fct> <dbl> <dbl> <int> <int>
## 1 Chinstrap Dream 52.8 20 205 4550
## 2 Chinstrap Dream 52 20.7 210 4800
## # ℹ 2 more variables: sex <fct>, year <int>
Notice that the | operator generates a different result.
Why?
filter(penguins, species=="Chinstrap" | body_mass_g> 4500)
## # A tibble: 181 × 8
## species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
## <fct> <fct> <dbl> <dbl> <int> <int>
## 1 Adelie Torgersen 39.2 19.6 195 4675
## 2 Adelie Dream 39.8 19.1 184 4650
## 3 Adelie Dream 39.6 18.8 190 4600
## 4 Adelie Torgersen 42.9 17.6 196 4700
## 5 Adelie Biscoe 41 20 203 4725
## 6 Adelie Biscoe 43.2 19 197 4775
## 7 Adelie Biscoe 45.6 20.3 191 4600
## 8 Gentoo Biscoe 50 16.3 230 5700
## 9 Gentoo Biscoe 50 15.2 218 5700
## 10 Gentoo Biscoe 47.6 14.5 215 5400
## # ℹ 171 more rows
## # ℹ 2 more variables: sex <fct>, year <int>
Rules:
+ filter(condition1, condition2) will return rows where
both conditions are met. By default the , means &.
+ filter(condition1, !condition2) will return all rows
where condition one is true but condition 2 is not.
+ filter(condition1 | condition2) will return rows where
condition 1 or condition 2 is met.
In this case, we filter out the penguins with a body mass over 5000g and a bill length over 50mm or a bill depth over 18mm.
filter(penguins, body_mass_g>5000 & (bill_length_mm>50 | bill_depth_mm>18))
## # A tibble: 20 × 8
## species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
## <fct> <fct> <dbl> <dbl> <int> <int>
## 1 Gentoo Biscoe 50.2 14.3 218 5700
## 2 Gentoo Biscoe 59.6 17 230 6050
## 3 Gentoo Biscoe 50.5 15.9 222 5550
## 4 Gentoo Biscoe 50.5 15.9 225 5400
## 5 Gentoo Biscoe 50.4 15.3 224 5550
## 6 Gentoo Biscoe 54.3 15.7 231 5650
## 7 Gentoo Biscoe 50.7 15 223 5550
## 8 Gentoo Biscoe 51.1 16.3 220 6000
## 9 Gentoo Biscoe 52.5 15.6 221 5450
## 10 Gentoo Biscoe 50.8 17.3 228 5600
## 11 Gentoo Biscoe 51.3 14.2 218 5300
## 12 Gentoo Biscoe 52.1 17 230 5550
## 13 Gentoo Biscoe 52.2 17.1 228 5400
## 14 Gentoo Biscoe 50.8 15.7 226 5200
## 15 Gentoo Biscoe 51.1 16.5 225 5250
## 16 Gentoo Biscoe 55.9 17 228 5600
## 17 Gentoo Biscoe 53.4 15.8 219 5500
## 18 Gentoo Biscoe 51.5 16.3 230 5500
## 19 Gentoo Biscoe 55.1 16 230 5850
## 20 Gentoo Biscoe 50.4 15.7 222 5750
## # ℹ 2 more variables: sex <fct>, year <int>
%>%In order to start combining select(),
filter(), and other functions efficiently, we need to learn
pipes. Pipes feed the output from one function into the input of another
function.
Let’s say we are only interested in the body mass of Gentoo penguins. We can use pipes to first select the species and body mass columns then filter that output only for Gentoo penguins. Notice that we only need to call the data one time.
penguins %>%
select(species, body_mass_g) %>%
filter(species=="Gentoo")
## # A tibble: 124 × 2
## species body_mass_g
## <fct> <int>
## 1 Gentoo 4500
## 2 Gentoo 5700
## 3 Gentoo 4450
## 4 Gentoo 5700
## 5 Gentoo 5400
## 6 Gentoo 4550
## 7 Gentoo 4800
## 8 Gentoo 5200
## 9 Gentoo 4400
## 10 Gentoo 5150
## # ℹ 114 more rows
We can sort the output of filter() using the
arrange() function. Here we filter for Gentoo penguins and
then sort them by body mass in ascending order.
penguins %>%
select(species, body_mass_g) %>%
filter(species=="Gentoo") %>%
arrange(body_mass_g)
## # A tibble: 124 × 2
## species body_mass_g
## <fct> <int>
## 1 Gentoo 3950
## 2 Gentoo 4100
## 3 Gentoo 4150
## 4 Gentoo 4200
## 5 Gentoo 4200
## 6 Gentoo 4200
## 7 Gentoo 4300
## 8 Gentoo 4300
## 9 Gentoo 4350
## 10 Gentoo 4375
## # ℹ 114 more rows
To sort in descending order, wrap the variable name in
desc().
penguins %>%
select(species, body_mass_g) %>%
filter(species=="Gentoo") %>%
arrange(desc(body_mass_g))
## # A tibble: 124 × 2
## species body_mass_g
## <fct> <int>
## 1 Gentoo 6300
## 2 Gentoo 6050
## 3 Gentoo 6000
## 4 Gentoo 6000
## 5 Gentoo 5950
## 6 Gentoo 5950
## 7 Gentoo 5850
## 8 Gentoo 5850
## 9 Gentoo 5850
## 10 Gentoo 5800
## # ℹ 114 more rows
slice()
penguins %>%
select(species, body_mass_g) %>%
filter(species=="Gentoo") %>%
slice_min(body_mass_g, n=1)#don't forget to specify the variable
## # A tibble: 1 × 2
## species body_mass_g
## <fct> <int>
## 1 Gentoo 3950
?slice
Let’s use the fish data again to keep things simple.
300. Extract this information from the
fish data. Please sort the data by radii length.names(fish)
## [1] "lakeid" "fish_id" "annnumber" "length"
## [5] "radii_length_mm" "scalelength"
fish %>%
select(fish_id, radii_length_mm, scalelength) %>%
filter(fish_id==300) %>%
arrange(scalelength)
## # A tibble: 4 × 3
## fish_id radii_length_mm scalelength
## <dbl> <dbl> <dbl>
## 1 300 3.02 3.02
## 2 300 2.67 3.02
## 3 300 2.14 3.02
## 4 300 1.23 3.02
fish data. Please sort the data in descending order by
radii length.fish %>%
select(lakeid, radii_length_mm) %>%
filter(lakeid=="AL" | lakeid=="AR") %>%
filter(between(radii_length_mm, 2, 4)) %>%
arrange(desc(radii_length_mm))
## # A tibble: 253 × 2
## lakeid radii_length_mm
## <chr> <dbl>
## 1 AR 4.00
## 2 AR 3.99
## 3 AR 3.99
## 4 AR 3.98
## 5 AR 3.98
## 6 AR 3.96
## 7 AR 3.94
## 8 AR 3.94
## 9 AR 3.94
## 10 AL 3.94
## # ℹ 243 more rows
Please review the learning goals and be sure to use the code here as
a reference when completing the homework.
–>Home