filter()
At the end of this exercise, you will be able to:
1. Use the functions of dplyr (filter, select, arrange) to organize and
sort data frames.
2. Use mutate()
to calculate a new column from existing
columns.
In the previous lab, we used select()
to extract columns
of interest from a data frame. This helps us focus our attention on the
variables relevant to our question. However, it doesn’t allow us to
extract observations from within the data frame. The
filter()
function allows us to extract data that meet
specific criteria. When combined with select()
, we have the
power to transform, shape, and explore data with the potential to make
new discoveries.
library("tidyverse")
fish <- read_csv("data/Gaeta_etal_CLC_data.csv")
## Rows: 4033 Columns: 6
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (2): lakeid, annnumber
## dbl (4): fish_id, length, radii_length_mm, scalelength
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
mammals <- read_csv("data/mammal_lifehistories_v2.csv")
## Rows: 1440 Columns: 13
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (4): order, family, Genus, species
## dbl (9): mass, gestation, newborn, weaning, wean mass, AFR, max. life, litte...
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
To keep things tidy, I am going to rename the variables in the mammals data.
mammals <- rename(mammals, genus=Genus, wean_mass="wean mass", max_life="max. life", litter_size="litter size", litters_per_year="litters/year")
filter()
Unlike select()
, filter()
allows us to
extract data that meet specific criteria within a variable. Let’s say
that we are interested only in the fish that occur in lake “AL”. We can
use filter()
to extract these observations.
filter(fish, lakeid == "AL")
## # A tibble: 383 × 6
## lakeid fish_id annnumber length radii_length_mm scalelength
## <chr> <dbl> <chr> <dbl> <dbl> <dbl>
## 1 AL 299 EDGE 167 2.70 2.70
## 2 AL 299 2 167 2.04 2.70
## 3 AL 299 1 167 1.31 2.70
## 4 AL 300 EDGE 175 3.02 3.02
## 5 AL 300 3 175 2.67 3.02
## 6 AL 300 2 175 2.14 3.02
## 7 AL 300 1 175 1.23 3.02
## 8 AL 301 EDGE 194 3.34 3.34
## 9 AL 301 3 194 2.97 3.34
## 10 AL 301 2 194 2.29 3.34
## # ℹ 373 more rows
Similarly, if we are only interested in fish with a length greater
than or equal to 350 we can use filter()
to extract these
observations.
filter(fish, length >= 350)
## # A tibble: 890 × 6
## lakeid fish_id annnumber length radii_length_mm scalelength
## <chr> <dbl> <chr> <dbl> <dbl> <dbl>
## 1 AL 306 EDGE 350 6.94 6.94
## 2 AL 306 10 350 6.46 6.94
## 3 AL 306 9 350 6.16 6.94
## 4 AL 306 8 350 5.88 6.94
## 5 AL 306 7 350 5.42 6.94
## 6 AL 306 6 350 4.90 6.94
## 7 AL 306 5 350 4.46 6.94
## 8 AL 306 4 350 3.75 6.94
## 9 AL 306 3 350 2.93 6.94
## 10 AL 306 2 350 2.14 6.94
## # ℹ 880 more rows
filter()
allows all of the expected operators;
i.e. >, >=, <, <=, != (not equal), and == (equal).Using the !
operator allows for the exclusion of
specific observations.
filter(fish, lakeid != "AL")
## # A tibble: 3,650 × 6
## lakeid fish_id annnumber length radii_length_mm scalelength
## <chr> <dbl> <chr> <dbl> <dbl> <dbl>
## 1 AR 269 EDGE 140 2.01 2.01
## 2 AR 269 1 140 1.48 2.01
## 3 AR 270 EDGE 193 2.66 2.66
## 4 AR 270 3 193 2.39 2.66
## 5 AR 270 2 193 2.03 2.66
## 6 AR 270 1 193 1.42 2.66
## 7 AR 271 EDGE 220 3.50 3.50
## 8 AR 271 5 220 3.13 3.50
## 9 AR 271 4 220 2.86 3.50
## 10 AR 271 3 220 2.63 3.50
## # ℹ 3,640 more rows
filter()
with multiple observationsFiltering multiple values within the same variable requires the
%in%
operator.
filter(fish, length %in% c(167, 175))
## # A tibble: 18 × 6
## lakeid fish_id annnumber length radii_length_mm scalelength
## <chr> <dbl> <chr> <dbl> <dbl> <dbl>
## 1 AL 299 EDGE 167 2.70 2.70
## 2 AL 299 2 167 2.04 2.70
## 3 AL 299 1 167 1.31 2.70
## 4 AL 300 EDGE 175 3.02 3.02
## 5 AL 300 3 175 2.67 3.02
## 6 AL 300 2 175 2.14 3.02
## 7 AL 300 1 175 1.23 3.02
## 8 BO 397 EDGE 175 2.67 2.67
## 9 BO 397 3 175 2.39 2.67
## 10 BO 397 2 175 1.59 2.67
## 11 BO 397 1 175 0.830 2.67
## 12 LSG 45 EDGE 175 3.21 3.21
## 13 LSG 45 3 175 2.92 3.21
## 14 LSG 45 2 175 2.44 3.21
## 15 LSG 45 1 175 1.60 3.21
## 16 RD 103 EDGE 167 2.80 2.80
## 17 RD 103 2 167 2.10 2.80
## 18 RD 103 1 167 1.31 2.80
Alternatively, you can use between
if you are looking
for a range of specific values.
filter(fish, between(scalelength, 2.5, 3))
## # A tibble: 185 × 6
## lakeid fish_id annnumber length radii_length_mm scalelength
## <chr> <dbl> <chr> <dbl> <dbl> <dbl>
## 1 AL 299 EDGE 167 2.70 2.70
## 2 AL 299 2 167 2.04 2.70
## 3 AL 299 1 167 1.31 2.70
## 4 AR 270 EDGE 193 2.66 2.66
## 5 AR 270 3 193 2.39 2.66
## 6 AR 270 2 193 2.03 2.66
## 7 AR 270 1 193 1.42 2.66
## 8 BO 396 EDGE 160 2.59 2.59
## 9 BO 396 3 160 1.97 2.59
## 10 BO 396 2 160 1.39 2.59
## # ℹ 175 more rows
You can also extract observations “near” a certain value but you need to specify a tolerance.
filter(fish, near(radii_length_mm, 2, tol = 0.2))
## # A tibble: 291 × 6
## lakeid fish_id annnumber length radii_length_mm scalelength
## <chr> <dbl> <chr> <dbl> <dbl> <dbl>
## 1 AL 299 2 167 2.04 2.70
## 2 AL 300 2 175 2.14 3.02
## 3 AL 302 2 324 2.19 6.07
## 4 AL 303 2 325 2.04 6.79
## 5 AL 306 2 350 2.14 6.94
## 6 AL 308 2 355 1.86 6.67
## 7 AL 312 2 367 2.17 6.81
## 8 AL 313 2 367 2.06 6.47
## 9 AL 315 2 372 2.04 6.47
## 10 AL 316 2 372 1.82 6.35
## # ℹ 281 more rows
fish
data to include the samples from lake
“BO”.filter(fish, lakeid == "BO")
## # A tibble: 197 × 6
## lakeid fish_id annnumber length radii_length_mm scalelength
## <chr> <dbl> <chr> <dbl> <dbl> <dbl>
## 1 BO 389 EDGE 104 1.50 1.50
## 2 BO 389 1 104 0.736 1.50
## 3 BO 390 EDGE 105 1.59 1.59
## 4 BO 390 1 105 0.698 1.59
## 5 BO 391 EDGE 107 1.43 1.43
## 6 BO 391 1 107 0.695 1.43
## 7 BO 392 EDGE 124 2.11 2.11
## 8 BO 392 2 124 1.36 2.11
## 9 BO 392 1 124 0.792 2.11
## 10 BO 393 EDGE 141 2.16 2.16
## # ℹ 187 more rows
filter(fish, lakeid != "AR")
## # A tibble: 3,771 × 6
## lakeid fish_id annnumber length radii_length_mm scalelength
## <chr> <dbl> <chr> <dbl> <dbl> <dbl>
## 1 AL 299 EDGE 167 2.70 2.70
## 2 AL 299 2 167 2.04 2.70
## 3 AL 299 1 167 1.31 2.70
## 4 AL 300 EDGE 175 3.02 3.02
## 5 AL 300 3 175 2.67 3.02
## 6 AL 300 2 175 2.14 3.02
## 7 AL 300 1 175 1.23 3.02
## 8 AL 301 EDGE 194 3.34 3.34
## 9 AL 301 3 194 2.97 3.34
## 10 AL 301 2 194 2.29 3.34
## # ℹ 3,761 more rows
filter(fish, near(scalelength, 8, tol = 0.25))
## # A tibble: 236 × 6
## lakeid fish_id annnumber length radii_length_mm scalelength
## <chr> <dbl> <chr> <dbl> <dbl> <dbl>
## 1 AL 309 EDGE 355 7.89 7.89
## 2 AL 309 13 355 7.56 7.89
## 3 AL 309 12 355 7.36 7.89
## 4 AL 309 11 355 7.16 7.89
## 5 AL 309 10 355 6.77 7.89
## 6 AL 309 9 355 6.39 7.89
## 7 AL 309 8 355 5.96 7.89
## 8 AL 309 7 355 5.44 7.89
## 9 AL 309 6 355 4.74 7.89
## 10 AL 309 5 355 4.06 7.89
## # ℹ 226 more rows
filter(fish, between(scalelength, 2, 4))
## # A tibble: 723 × 6
## lakeid fish_id annnumber length radii_length_mm scalelength
## <chr> <dbl> <chr> <dbl> <dbl> <dbl>
## 1 AL 299 EDGE 167 2.70 2.70
## 2 AL 299 2 167 2.04 2.70
## 3 AL 299 1 167 1.31 2.70
## 4 AL 300 EDGE 175 3.02 3.02
## 5 AL 300 3 175 2.67 3.02
## 6 AL 300 2 175 2.14 3.02
## 7 AL 300 1 175 1.23 3.02
## 8 AL 301 EDGE 194 3.34 3.34
## 9 AL 301 3 194 2.97 3.34
## 10 AL 301 2 194 2.29 3.34
## # ℹ 713 more rows
Please review the learning goals and be sure to use the code here as
a reference when completing the homework.
–>Home