Dplyr select()

Learning Goals

At the end of this exercise, you will be able to:
1. Use summary functions to assess the structure of a data frame.
2. Us the select function of dplyr to build data frames restricted to variable of interest.
3. Use the rename() function to provide new, consistent names to variables in data frames.

Load the tidyverse

For the remainder of the quarter, we will work within the tidyverse. At the start of each lab, the library needs to be loaded as shown below.

library("tidyverse")
library("janitor")

Load the data

These data are from: Gaeta J., G. Sass, S. Carpenter. 2012. Biocomplexity at North Temperate Lakes LTER: Coordinated Field Studies: Large Mouth Bass Growth 2006. Environmental Data Initiative. link

fish <- read_csv("data/Gaeta_etal_CLC_data.csv")

## Rows: 4033 Columns: 6
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (2): lakeid, annnumber
## dbl (4): fish_id, length, radii_length_mm, scalelength
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Data Structure

Once data have been uploaded, let’s get an idea of its structure, contents, and dimensions.

glimpse(fish)

## Rows: 4,033
## Columns: 6
## $ lakeid          <chr> "AL", "AL", "AL", "AL", "AL", "AL", "AL", "AL", "AL", …
## $ fish_id         <dbl> 299, 299, 299, 300, 300, 300, 300, 301, 301, 301, 301,…
## $ annnumber       <chr> "EDGE", "2", "1", "EDGE", "3", "2", "1", "EDGE", "3", …
## $ length          <dbl> 167, 167, 167, 175, 175, 175, 175, 194, 194, 194, 194,…
## $ radii_length_mm <dbl> 2.697443, 2.037518, 1.311795, 3.015477, 2.670733, 2.13…
## $ scalelength     <dbl> 2.697443, 2.697443, 2.697443, 3.015477, 3.015477, 3.01…

summary(fish)

##     lakeid             fish_id       annnumber             length     
##  Length:4033        Min.   :  1.0   Length:4033        Min.   : 58.0  
##  Class :character   1st Qu.:156.0   Class :character   1st Qu.:253.0  
##  Mode  :character   Median :267.0   Mode  :character   Median :299.0  
##                     Mean   :258.3                      Mean   :293.3  
##                     3rd Qu.:376.0                      3rd Qu.:342.0  
##                     Max.   :478.0                      Max.   :420.0  
##  radii_length_mm    scalelength     
##  Min.   : 0.4569   Min.   : 0.6282  
##  1st Qu.: 2.3252   1st Qu.: 4.2596  
##  Median : 3.5380   Median : 5.4062  
##  Mean   : 3.6589   Mean   : 5.3821  
##  3rd Qu.: 4.8229   3rd Qu.: 6.4145  
##  Max.   :11.0258   Max.   :11.0258

Tidyverse

The tidyverse is a collection of packages that make workflow in R easier. The packages operate more intuitively than base R commands and share a common organizational philosophy.

dplyr

The first package that we will use that is part of the tidyverse is dplyr. dplyr is used to transform data frames by extracting, rearranging, and summarizing data such that they are focused on a question of interest. This is very helpful, especially when wrangling large data, and makes dplyr one of most frequently used packages in the tidyverse. The two functions we will use most are select() and filter().

`select()`

Select allows you to pull out columns of interest from a dataframe. To do this, just add the names of the columns to the select() command. The order in which you add them, will determine the order in which they appear in the output.

names(fish)

## [1] "lakeid"          "fish_id"         "annnumber"       "length"         
## [5] "radii_length_mm" "scalelength"

We are only interested in lakeid and scalelength.

select(fish, "lakeid", "scalelength")

## # A tibble: 4,033 × 2
##    lakeid scalelength
##    <chr>        <dbl>
##  1 AL            2.70
##  2 AL            2.70
##  3 AL            2.70
##  4 AL            3.02
##  5 AL            3.02
##  6 AL            3.02
##  7 AL            3.02
##  8 AL            3.34
##  9 AL            3.34
## 10 AL            3.34
## # ℹ 4,023 more rows

Remember that you can save the output to a new dataframe.

fish_subset <- select(fish, "lakeid", "scalelength")

To add a range of columns use start_col:end_col.

select(fish, fish_id:length)

## # A tibble: 4,033 × 3
##    fish_id annnumber length
##      <dbl> <chr>      <dbl>
##  1     299 EDGE         167
##  2     299 2            167
##  3     299 1            167
##  4     300 EDGE         175
##  5     300 3            175
##  6     300 2            175
##  7     300 1            175
##  8     301 EDGE         194
##  9     301 3            194
## 10     301 2            194
## # ℹ 4,023 more rows

The - operator is useful in select. It allows us to select everything except the specified variables.

select(fish, -"fish_id", -"annnumber", -"length", -"radii_length_mm")

## # A tibble: 4,033 × 2
##    lakeid scalelength
##    <chr>        <dbl>
##  1 AL            2.70
##  2 AL            2.70
##  3 AL            2.70
##  4 AL            3.02
##  5 AL            3.02
##  6 AL            3.02
##  7 AL            3.02
##  8 AL            3.34
##  9 AL            3.34
## 10 AL            3.34
## # ℹ 4,023 more rows

For very large data frames with lots of variables, select() utilizes lots of different operators to make things easier. Let’s say we are only interested in the variables that deal with length.

select(fish, contains("length"))

## # A tibble: 4,033 × 3
##    length radii_length_mm scalelength
##     <dbl>           <dbl>       <dbl>
##  1    167            2.70        2.70
##  2    167            2.04        2.70
##  3    167            1.31        2.70
##  4    175            3.02        3.02
##  5    175            2.67        3.02
##  6    175            2.14        3.02
##  7    175            1.23        3.02
##  8    194            3.34        3.34
##  9    194            2.97        3.34
## 10    194            2.29        3.34
## # ℹ 4,023 more rows

When columns are sequentially named, starts_with() makes selecting columns easier.

select(fish, starts_with("radii"))

## # A tibble: 4,033 × 1
##    radii_length_mm
##              <dbl>
##  1            2.70
##  2            2.04
##  3            1.31
##  4            3.02
##  5            2.67
##  6            2.14
##  7            1.23
##  8            3.34
##  9            2.97
## 10            2.29
## # ℹ 4,023 more rows

Options to select columns based on a specific criteria include:
1. ends_with() = Select columns that end with a character string
2. contains() = Select columns that contain a character string
3. matches() = Select columns that match a regular expression
4. one_of() = Select columns names that are from a group of names

We can use these functions to select columns that end with “id”.

select(fish, ends_with("id"))

## # A tibble: 4,033 × 2
##    lakeid fish_id
##    <chr>    <dbl>
##  1 AL         299
##  2 AL         299
##  3 AL         299
##  4 AL         300
##  5 AL         300
##  6 AL         300
##  7 AL         300
##  8 AL         301
##  9 AL         301
## 10 AL         301
## # ℹ 4,023 more rows

Or, we can select columns that contain “fish”.

select(fish, contains("fish"))

## # A tibble: 4,033 × 1
##    fish_id
##      <dbl>
##  1     299
##  2     299
##  3     299
##  4     300
##  5     300
##  6     300
##  7     300
##  8     301
##  9     301
## 10     301
## # ℹ 4,023 more rows

You can also select columns based on the class of data.

glimpse(fish)

## Rows: 4,033
## Columns: 6
## $ lakeid          <chr> "AL", "AL", "AL", "AL", "AL", "AL", "AL", "AL", "AL", …
## $ fish_id         <dbl> 299, 299, 299, 300, 300, 300, 300, 301, 301, 301, 301,…
## $ annnumber       <chr> "EDGE", "2", "1", "EDGE", "3", "2", "1", "EDGE", "3", …
## $ length          <dbl> 167, 167, 167, 175, 175, 175, 175, 194, 194, 194, 194,…
## $ radii_length_mm <dbl> 2.697443, 2.037518, 1.311795, 3.015477, 2.670733, 2.13…
## $ scalelength     <dbl> 2.697443, 2.697443, 2.697443, 3.015477, 3.015477, 3.01…

select_if(fish, is.numeric)

## # A tibble: 4,033 × 4
##    fish_id length radii_length_mm scalelength
##      <dbl>  <dbl>           <dbl>       <dbl>
##  1     299    167            2.70        2.70
##  2     299    167            2.04        2.70
##  3     299    167            1.31        2.70
##  4     300    175            3.02        3.02
##  5     300    175            2.67        3.02
##  6     300    175            2.14        3.02
##  7     300    175            1.23        3.02
##  8     301    194            3.34        3.34
##  9     301    194            2.97        3.34
## 10     301    194            2.29        3.34
## # ℹ 4,023 more rows

To select all columns that are not a class of data, you need to add a ~.

select_if(fish, ~!is.numeric(.)) #the ! means not

## # A tibble: 4,033 × 2
##    lakeid annnumber
##    <chr>  <chr>    
##  1 AL     EDGE     
##  2 AL     2        
##  3 AL     1        
##  4 AL     EDGE     
##  5 AL     3        
##  6 AL     2        
##  7 AL     1        
##  8 AL     EDGE     
##  9 AL     3        
## 10 AL     2        
## # ℹ 4,023 more rows

Practice

For this exercise we will use life history data mammal_lifehistories_v2.csv for mammals. The data are from:
S. K. Morgan Ernest. 2003. Life history characteristics of placental non-volant mammals. Ecology 84:3402.

Load the data.

mammals <- read_csv("data/mammal_lifehistories_v2.csv")

## Rows: 1440 Columns: 13
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (4): order, family, Genus, species
## dbl (9): mass, gestation, newborn, weaning, wean mass, AFR, max. life, litte...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Use one or more of your favorite functions to assess the structure of the data.

summary(mammals)

##     order              family             Genus             species         
##  Length:1440        Length:1440        Length:1440        Length:1440       
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##       mass             gestation          newborn             weaning       
##  Min.   :     -999   Min.   :-999.00   Min.   :   -999.0   Min.   :-999.00  
##  1st Qu.:       50   1st Qu.:-999.00   1st Qu.:   -999.0   1st Qu.:-999.00  
##  Median :      403   Median :   1.05   Median :      2.6   Median :   0.73  
##  Mean   :   383577   Mean   :-287.25   Mean   :   6703.1   Mean   :-427.17  
##  3rd Qu.:     7009   3rd Qu.:   4.50   3rd Qu.:     98.0   3rd Qu.:   2.00  
##  Max.   :149000000   Max.   :  21.46   Max.   :2250000.0   Max.   :  48.00  
##    wean mass             AFR            max. life       litter size      
##  Min.   :    -999   Min.   :-999.00   Min.   :-999.0   Min.   :-999.000  
##  1st Qu.:    -999   1st Qu.:-999.00   1st Qu.:-999.0   1st Qu.:   1.000  
##  Median :    -999   Median :   2.50   Median :-999.0   Median :   2.270  
##  Mean   :   16049   Mean   :-408.12   Mean   :-490.3   Mean   : -55.634  
##  3rd Qu.:      10   3rd Qu.:  15.61   3rd Qu.: 147.2   3rd Qu.:   3.835  
##  Max.   :19075000   Max.   : 210.00   Max.   :1368.0   Max.   :  14.180  
##   litters/year     
##  Min.   :-999.000  
##  1st Qu.:-999.000  
##  Median :   0.375  
##  Mean   :-477.141  
##  3rd Qu.:   1.155  
##  Max.   :   7.500

Are there any NAs? Are you sure? Try taking an average of max. life as a test.

anyNA(mammals)

## [1] FALSE

What are the names of the columns in the mammals data?

names(mammals)

##  [1] "order"        "family"       "Genus"        "species"      "mass"        
##  [6] "gestation"    "newborn"      "weaning"      "wean mass"    "AFR"         
## [11] "max. life"    "litter size"  "litters/year"

Rename any columns that have capitol letters or punctuation issues.

mammals <- clean_names(mammals)

We are only interested in the variables genus, species, and mass. Use select() to build a new dataframe mass focused on these variables.

select(mammals, "genus", "species", "mass")

## # A tibble: 1,440 × 3
##    genus       species          mass
##    <chr>       <chr>           <dbl>
##  1 Antilocapra americana      45375 
##  2 Addax       nasomaculatus 182375 
##  3 Aepyceros   melampus       41480 
##  4 Alcelaphus  buselaphus    150000 
##  5 Ammodorcas  clarkei        28500 
##  6 Ammotragus  lervia         55500 
##  7 Antidorcas  marsupialis    30000 
##  8 Antilope    cervicapra     37500 
##  9 Bison       bison         497667.
## 10 Bison       bonasus       500000 
## # ℹ 1,430 more rows

What if we only wanted to exclude order and family? Use the - operator to make the code efficient.

select(mammals, -"order", -"family")

## # A tibble: 1,440 × 11
##    genus      species   mass gestation newborn weaning wean_mass    afr max_life
##    <chr>      <chr>    <dbl>     <dbl>   <dbl>   <dbl>     <dbl>  <dbl>    <dbl>
##  1 Antilocap… americ… 4.54e4      8.13   3246.    3         8900   13.5      142
##  2 Addax      nasoma… 1.82e5      9.39   5480     6.5       -999   27.3      308
##  3 Aepyceros  melamp… 4.15e4      6.35   5093     5.63     15900   16.7      213
##  4 Alcelaphus busela… 1.5 e5      7.9   10167.    6.5       -999   23.0      240
##  5 Ammodorcas clarkei 2.85e4      6.8    -999  -999         -999 -999       -999
##  6 Ammotragus lervia  5.55e4      5.08   3810     4         -999   14.9      251
##  7 Antidorcas marsup… 3   e4      5.72   3910     4.04      -999   10.2      228
##  8 Antilope   cervic… 3.75e4      5.5    3846     2.13      -999   20.1      255
##  9 Bison      bison   4.98e5      8.93  20000    10.7     157500   29.4      300
## 10 Bison      bonasus 5   e5      9.14  23000.    6.6       -999   30.0      324
## # ℹ 1,430 more rows
## # ℹ 2 more variables: litter_size <dbl>, litters_year <dbl>

Select the columns that include “mass” as part of the name.

select(mammals, contains("mass"))

## # A tibble: 1,440 × 2
##       mass wean_mass
##      <dbl>     <dbl>
##  1  45375       8900
##  2 182375       -999
##  3  41480      15900
##  4 150000       -999
##  5  28500       -999
##  6  55500       -999
##  7  30000       -999
##  8  37500       -999
##  9 497667.    157500
## 10 500000       -999
## # ℹ 1,430 more rows

Select all of the columns that are of class character.

select_if(mammals, is.character)

## # A tibble: 1,440 × 4
##    order        family         genus       species      
##    <chr>        <chr>          <chr>       <chr>        
##  1 Artiodactyla Antilocapridae Antilocapra americana    
##  2 Artiodactyla Bovidae        Addax       nasomaculatus
##  3 Artiodactyla Bovidae        Aepyceros   melampus     
##  4 Artiodactyla Bovidae        Alcelaphus  buselaphus   
##  5 Artiodactyla Bovidae        Ammodorcas  clarkei      
##  6 Artiodactyla Bovidae        Ammotragus  lervia       
##  7 Artiodactyla Bovidae        Antidorcas  marsupialis  
##  8 Artiodactyla Bovidae        Antilope    cervicapra   
##  9 Artiodactyla Bovidae        Bison       bison        
## 10 Artiodactyla Bovidae        Bison       bonasus      
## # ℹ 1,430 more rows

Janitor

Another option is to use the janitor package. The janitor package has a function called clean_names() that will make all column names lowercase and replace spaces with underscores.

#install.packages("janitor")
library(janitor)

names(mammals)

##  [1] "order"        "family"       "genus"        "species"      "mass"        
##  [6] "gestation"    "newborn"      "weaning"      "wean_mass"    "afr"         
## [11] "max_life"     "litter_size"  "litters_year"

clean_names(mammals)

## # A tibble: 1,440 × 13
##    order  family genus species   mass gestation newborn weaning wean_mass    afr
##    <chr>  <chr>  <chr> <chr>    <dbl>     <dbl>   <dbl>   <dbl>     <dbl>  <dbl>
##  1 Artio… Antil… Anti… americ… 4.54e4      8.13   3246.    3         8900   13.5
##  2 Artio… Bovid… Addax nasoma… 1.82e5      9.39   5480     6.5       -999   27.3
##  3 Artio… Bovid… Aepy… melamp… 4.15e4      6.35   5093     5.63     15900   16.7
##  4 Artio… Bovid… Alce… busela… 1.5 e5      7.9   10167.    6.5       -999   23.0
##  5 Artio… Bovid… Ammo… clarkei 2.85e4      6.8    -999  -999         -999 -999  
##  6 Artio… Bovid… Ammo… lervia  5.55e4      5.08   3810     4         -999   14.9
##  7 Artio… Bovid… Anti… marsup… 3   e4      5.72   3910     4.04      -999   10.2
##  8 Artio… Bovid… Anti… cervic… 3.75e4      5.5    3846     2.13      -999   20.1
##  9 Artio… Bovid… Bison bison   4.98e5      8.93  20000    10.7     157500   29.4
## 10 Artio… Bovid… Bison bonasus 5   e5      9.14  23000.    6.6       -999   30.0
## # ℹ 1,430 more rows
## # ℹ 3 more variables: max_life <dbl>, litter_size <dbl>, litters_year <dbl>

That’s it! Let’s take a break and then move on to part 2!

–>Home

Dplyr `select()`

2025-01-23