Learning Goals

At the end of this exercise, you will be able to:
1. Use the select() function of dplyr to build data frames restricted to variables of interest.
2. Use the rename() function to provide new, consistent names to variables in data frames.

Load the tidyverse

library("tidyverse")
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.6
## ✔ forcats   1.0.1     ✔ stringr   1.6.0
## ✔ ggplot2   4.0.1     ✔ tibble    3.3.0
## ✔ lubridate 1.9.4     ✔ tidyr     1.3.2
## ✔ purrr     1.2.1     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library("palmerpenguins") #load the palmerpenguins package
## 
## Attaching package: 'palmerpenguins'
## 
## The following objects are masked from 'package:datasets':
## 
##     penguins, penguins_raw

Palmerpenguins

These data are from: Gorman KB, Williams TD, Fraser WR (2014). Ecological sexual dimorphism and environmental variability within a community of Antarctic penguins (genus Pygoscelis). PLoS ONE 9(3):e90081. https://doi.org/10.1371/journal.pone.0090081

Data Structure

Once data have been uploaded, let’s get an idea of its structure, contents, and dimensions.

glimpse(penguins)
## Rows: 344
## Columns: 8
## $ species           <fct> Adelie, Adelie, Adelie, Adelie, Adelie, Adelie, Adel…
## $ island            <fct> Torgersen, Torgersen, Torgersen, Torgersen, Torgerse…
## $ bill_length_mm    <dbl> 39.1, 39.5, 40.3, NA, 36.7, 39.3, 38.9, 39.2, 34.1, …
## $ bill_depth_mm     <dbl> 18.7, 17.4, 18.0, NA, 19.3, 20.6, 17.8, 19.6, 18.1, …
## $ flipper_length_mm <int> 181, 186, 195, NA, 193, 190, 181, 195, 193, 190, 186…
## $ body_mass_g       <int> 3750, 3800, 3250, NA, 3450, 3650, 3625, 4675, 3475, …
## $ sex               <fct> male, female, female, NA, female, male, female, male…
## $ year              <int> 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007…
summary(penguins)
##       species          island    bill_length_mm  bill_depth_mm  
##  Adelie   :152   Biscoe   :168   Min.   :32.10   Min.   :13.10  
##  Chinstrap: 68   Dream    :124   1st Qu.:39.23   1st Qu.:15.60  
##  Gentoo   :124   Torgersen: 52   Median :44.45   Median :17.30  
##                                  Mean   :43.92   Mean   :17.15  
##                                  3rd Qu.:48.50   3rd Qu.:18.70  
##                                  Max.   :59.60   Max.   :21.50  
##                                  NA's   :2       NA's   :2      
##  flipper_length_mm  body_mass_g       sex           year     
##  Min.   :172.0     Min.   :2700   female:165   Min.   :2007  
##  1st Qu.:190.0     1st Qu.:3550   male  :168   1st Qu.:2007  
##  Median :197.0     Median :4050   NA's  : 11   Median :2008  
##  Mean   :200.9     Mean   :4202                Mean   :2008  
##  3rd Qu.:213.0     3rd Qu.:4750                3rd Qu.:2009  
##  Max.   :231.0     Max.   :6300                Max.   :2009  
##  NA's   :2         NA's   :2

Tidyverse

Recall that the tidyverse is a collection of packages that make workflow in R easier. The packages operate more intuitively than base R commands and share a common organizational philosophy. In lab 4, we learned how to use ggplot2 to make visualization of data easier. In this lab, we will learn how to use the package dplyr to wrangle data.

dplyr

The first package that we will use to wrangle data is dplyr. dplyr is used to transform data frames by extracting, rearranging, and summarizing data such that they are focused on a question of interest. This is very helpful, especially when wrangling large data, and makes dplyr one of most frequently used packages in the tidyverse. The two functions we will use most are select() and filter().

These functions are often called verbs, and the format used for each is the same. The output is always a new, more restricted dataframe.

select()

The verb select() allows you to pull out columns of interest from a dataframe; it does not affect the rows. To do this, just add the names of the columns of interest to the select() command. The order in which you add them, will determine the order in which they appear in the output.

For smaller dataframes, select() may not make much sense. However, for larger dataframes with many variables, select() is very useful. Let’s look at the fish data again.

names(penguins)
## [1] "species"           "island"            "bill_length_mm"   
## [4] "bill_depth_mm"     "flipper_length_mm" "body_mass_g"      
## [7] "sex"               "year"

We are only interested in species and body mass. We can use select() to extract these columns.

select(penguins, species, body_mass_g)
## # A tibble: 344 × 2
##    species body_mass_g
##    <fct>         <int>
##  1 Adelie         3750
##  2 Adelie         3800
##  3 Adelie         3250
##  4 Adelie           NA
##  5 Adelie         3450
##  6 Adelie         3650
##  7 Adelie         3625
##  8 Adelie         4675
##  9 Adelie         3475
## 10 Adelie         4250
## # ℹ 334 more rows
penguins <- penguins

To add a range of columns use start_col:end_col.

select(penguins, species:flipper_length_mm)
## # A tibble: 344 × 5
##    species island    bill_length_mm bill_depth_mm flipper_length_mm
##    <fct>   <fct>              <dbl>         <dbl>             <int>
##  1 Adelie  Torgersen           39.1          18.7               181
##  2 Adelie  Torgersen           39.5          17.4               186
##  3 Adelie  Torgersen           40.3          18                 195
##  4 Adelie  Torgersen           NA            NA                  NA
##  5 Adelie  Torgersen           36.7          19.3               193
##  6 Adelie  Torgersen           39.3          20.6               190
##  7 Adelie  Torgersen           38.9          17.8               181
##  8 Adelie  Torgersen           39.2          19.6               195
##  9 Adelie  Torgersen           34.1          18.1               193
## 10 Adelie  Torgersen           42            20.2               190
## # ℹ 334 more rows

The ! operator is useful in select. It allows us to select everything except the specified variables.

select(penguins, !body_mass_g)
## # A tibble: 344 × 7
##    species island    bill_length_mm bill_depth_mm flipper_length_mm sex     year
##    <fct>   <fct>              <dbl>         <dbl>             <int> <fct>  <int>
##  1 Adelie  Torgersen           39.1          18.7               181 male    2007
##  2 Adelie  Torgersen           39.5          17.4               186 female  2007
##  3 Adelie  Torgersen           40.3          18                 195 female  2007
##  4 Adelie  Torgersen           NA            NA                  NA <NA>    2007
##  5 Adelie  Torgersen           36.7          19.3               193 female  2007
##  6 Adelie  Torgersen           39.3          20.6               190 male    2007
##  7 Adelie  Torgersen           38.9          17.8               181 female  2007
##  8 Adelie  Torgersen           39.2          19.6               195 male    2007
##  9 Adelie  Torgersen           34.1          18.1               193 <NA>    2007
## 10 Adelie  Torgersen           42            20.2               190 <NA>    2007
## # ℹ 334 more rows

Alternatively, you can use the c() function to exclude multiple columns.

select(penguins, !c(species, island, year, sex))
## # A tibble: 344 × 4
##    bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
##             <dbl>         <dbl>             <int>       <int>
##  1           39.1          18.7               181        3750
##  2           39.5          17.4               186        3800
##  3           40.3          18                 195        3250
##  4           NA            NA                  NA          NA
##  5           36.7          19.3               193        3450
##  6           39.3          20.6               190        3650
##  7           38.9          17.8               181        3625
##  8           39.2          19.6               195        4675
##  9           34.1          18.1               193        3475
## 10           42            20.2               190        4250
## # ℹ 334 more rows

For very large data frames with lots of variables, select() utilizes lots of different operators to make things easier. Let’s say we are only interested in the variables that deal with length.

select(penguins, contains("mm"))
## # A tibble: 344 × 3
##    bill_length_mm bill_depth_mm flipper_length_mm
##             <dbl>         <dbl>             <int>
##  1           39.1          18.7               181
##  2           39.5          17.4               186
##  3           40.3          18                 195
##  4           NA            NA                  NA
##  5           36.7          19.3               193
##  6           39.3          20.6               190
##  7           38.9          17.8               181
##  8           39.2          19.6               195
##  9           34.1          18.1               193
## 10           42            20.2               190
## # ℹ 334 more rows

starts_with()

select(penguins, starts_with("bill"))
## # A tibble: 344 × 2
##    bill_length_mm bill_depth_mm
##             <dbl>         <dbl>
##  1           39.1          18.7
##  2           39.5          17.4
##  3           40.3          18  
##  4           NA            NA  
##  5           36.7          19.3
##  6           39.3          20.6
##  7           38.9          17.8
##  8           39.2          19.6
##  9           34.1          18.1
## 10           42            20.2
## # ℹ 334 more rows

ends_with()

select(penguins, ends_with("mm"))
## # A tibble: 344 × 3
##    bill_length_mm bill_depth_mm flipper_length_mm
##             <dbl>         <dbl>             <int>
##  1           39.1          18.7               181
##  2           39.5          17.4               186
##  3           40.3          18                 195
##  4           NA            NA                  NA
##  5           36.7          19.3               193
##  6           39.3          20.6               190
##  7           38.9          17.8               181
##  8           39.2          19.6               195
##  9           34.1          18.1               193
## 10           42            20.2               190
## # ℹ 334 more rows

You can also select columns based on the class of data.

select(penguins, where(is.numeric))
## # A tibble: 344 × 5
##    bill_length_mm bill_depth_mm flipper_length_mm body_mass_g  year
##             <dbl>         <dbl>             <int>       <int> <int>
##  1           39.1          18.7               181        3750  2007
##  2           39.5          17.4               186        3800  2007
##  3           40.3          18                 195        3250  2007
##  4           NA            NA                  NA          NA  2007
##  5           36.7          19.3               193        3450  2007
##  6           39.3          20.6               190        3650  2007
##  7           38.9          17.8               181        3625  2007
##  8           39.2          19.6               195        4675  2007
##  9           34.1          18.1               193        3475  2007
## 10           42            20.2               190        4250  2007
## # ℹ 334 more rows

There are a few verbs that go with select() because they deal with columns. One of these is rename(). rename() allows you to rename columns in a dataframe. The format is new_name = old_name.

rename(penguins, body_mass=body_mass_g)
## # A tibble: 344 × 8
##    species island bill_length_mm bill_depth_mm flipper_length_mm body_mass sex  
##    <fct>   <fct>           <dbl>         <dbl>             <int>     <int> <fct>
##  1 Adelie  Torge…           39.1          18.7               181      3750 male 
##  2 Adelie  Torge…           39.5          17.4               186      3800 fema…
##  3 Adelie  Torge…           40.3          18                 195      3250 fema…
##  4 Adelie  Torge…           NA            NA                  NA        NA <NA> 
##  5 Adelie  Torge…           36.7          19.3               193      3450 fema…
##  6 Adelie  Torge…           39.3          20.6               190      3650 male 
##  7 Adelie  Torge…           38.9          17.8               181      3625 fema…
##  8 Adelie  Torge…           39.2          19.6               195      4675 male 
##  9 Adelie  Torge…           34.1          18.1               193      3475 <NA> 
## 10 Adelie  Torge…           42            20.2               190      4250 <NA> 
## # ℹ 334 more rows
## # ℹ 1 more variable: year <int>

Alternatively, you can rename from within select.

select(penguins, species, body_mass=body_mass_g)
## # A tibble: 344 × 2
##    species body_mass
##    <fct>       <int>
##  1 Adelie       3750
##  2 Adelie       3800
##  3 Adelie       3250
##  4 Adelie         NA
##  5 Adelie       3450
##  6 Adelie       3650
##  7 Adelie       3625
##  8 Adelie       4675
##  9 Adelie       3475
## 10 Adelie       4250
## # ℹ 334 more rows

The second is relocate(). relocate() allows you to move columns to different locations within a dataframe. Let’s say we want to move length to be the first column in the dataframe.

relocate(penguins, year)
## # A tibble: 344 × 8
##     year species island    bill_length_mm bill_depth_mm flipper_length_mm
##    <int> <fct>   <fct>              <dbl>         <dbl>             <int>
##  1  2007 Adelie  Torgersen           39.1          18.7               181
##  2  2007 Adelie  Torgersen           39.5          17.4               186
##  3  2007 Adelie  Torgersen           40.3          18                 195
##  4  2007 Adelie  Torgersen           NA            NA                  NA
##  5  2007 Adelie  Torgersen           36.7          19.3               193
##  6  2007 Adelie  Torgersen           39.3          20.6               190
##  7  2007 Adelie  Torgersen           38.9          17.8               181
##  8  2007 Adelie  Torgersen           39.2          19.6               195
##  9  2007 Adelie  Torgersen           34.1          18.1               193
## 10  2007 Adelie  Torgersen           42            20.2               190
## # ℹ 334 more rows
## # ℹ 2 more variables: body_mass_g <int>, sex <fct>

Practice

For these exercises, we will use the msleep dataset from the ggplot2 package. This dataset contains information about the sleep habits of various mammals. Reference: V. M. Savage and G. B. West. A quantitative, theoretical framework for understanding mammalian sleep. Proceedings of the National Academy of Sciences, 104 (3):1051-1056, 2007.

  1. What are the names in the msleep dataframe?
names(msleep)
##  [1] "name"         "genus"        "vore"         "order"        "conservation"
##  [6] "sleep_total"  "sleep_rem"    "sleep_cycle"  "awake"        "brainwt"     
## [11] "bodywt"
  1. Use glimpse() to get an idea of the structure of the msleep dataframe.
glimpse(msleep)
## Rows: 83
## Columns: 11
## $ name         <chr> "Cheetah", "Owl monkey", "Mountain beaver", "Greater shor…
## $ genus        <chr> "Acinonyx", "Aotus", "Aplodontia", "Blarina", "Bos", "Bra…
## $ vore         <chr> "carni", "omni", "herbi", "omni", "herbi", "herbi", "carn…
## $ order        <chr> "Carnivora", "Primates", "Rodentia", "Soricomorpha", "Art…
## $ conservation <chr> "lc", NA, "nt", "lc", "domesticated", NA, "vu", NA, "dome…
## $ sleep_total  <dbl> 12.1, 17.0, 14.4, 14.9, 4.0, 14.4, 8.7, 7.0, 10.1, 3.0, 5…
## $ sleep_rem    <dbl> NA, 1.8, 2.4, 2.3, 0.7, 2.2, 1.4, NA, 2.9, NA, 0.6, 0.8, …
## $ sleep_cycle  <dbl> NA, NA, NA, 0.1333333, 0.6666667, 0.7666667, 0.3833333, N…
## $ awake        <dbl> 11.9, 7.0, 9.6, 9.1, 20.0, 9.6, 15.3, 17.0, 13.9, 21.0, 1…
## $ brainwt      <dbl> NA, 0.01550, NA, 0.00029, 0.42300, NA, NA, NA, 0.07000, 0…
## $ bodywt       <dbl> 50.000, 0.480, 1.350, 0.019, 600.000, 3.850, 20.490, 0.04…
  1. Make a new dataframe that only includes the variables order, genus, and bodywt.
select(msleep, "order", "genus", "bodywt")
## # A tibble: 83 × 3
##    order        genus        bodywt
##    <chr>        <chr>         <dbl>
##  1 Carnivora    Acinonyx     50    
##  2 Primates     Aotus         0.48 
##  3 Rodentia     Aplodontia    1.35 
##  4 Soricomorpha Blarina       0.019
##  5 Artiodactyla Bos         600    
##  6 Pilosa       Bradypus      3.85 
##  7 Carnivora    Callorhinus  20.5  
##  8 Rodentia     Calomys       0.045
##  9 Carnivora    Canis        14    
## 10 Artiodactyla Capreolus    14.8  
## # ℹ 73 more rows
  1. What if we are only interested in the numeric variables? Make a new dataframe that is restricted to numerics.
select(msleep, where(is.numeric))
## # A tibble: 83 × 6
##    sleep_total sleep_rem sleep_cycle awake  brainwt  bodywt
##          <dbl>     <dbl>       <dbl> <dbl>    <dbl>   <dbl>
##  1        12.1      NA        NA      11.9 NA        50    
##  2        17         1.8      NA       7    0.0155    0.48 
##  3        14.4       2.4      NA       9.6 NA         1.35 
##  4        14.9       2.3       0.133   9.1  0.00029   0.019
##  5         4         0.7       0.667  20    0.423   600    
##  6        14.4       2.2       0.767   9.6 NA         3.85 
##  7         8.7       1.4       0.383  15.3 NA        20.5  
##  8         7        NA        NA      17   NA         0.045
##  9        10.1       2.9       0.333  13.9  0.07     14    
## 10         3        NA        NA      21    0.0982   14.8  
## # ℹ 73 more rows
  1. Make a dataframe that includes all variables except name.
select(msleep, !name)
## # A tibble: 83 × 10
##    genus       vore  order  conservation sleep_total sleep_rem sleep_cycle awake
##    <chr>       <chr> <chr>  <chr>              <dbl>     <dbl>       <dbl> <dbl>
##  1 Acinonyx    carni Carni… lc                  12.1      NA        NA      11.9
##  2 Aotus       omni  Prima… <NA>                17         1.8      NA       7  
##  3 Aplodontia  herbi Roden… nt                  14.4       2.4      NA       9.6
##  4 Blarina     omni  Soric… lc                  14.9       2.3       0.133   9.1
##  5 Bos         herbi Artio… domesticated         4         0.7       0.667  20  
##  6 Bradypus    herbi Pilosa <NA>                14.4       2.2       0.767   9.6
##  7 Callorhinus carni Carni… vu                   8.7       1.4       0.383  15.3
##  8 Calomys     <NA>  Roden… <NA>                 7        NA        NA      17  
##  9 Canis       carni Carni… domesticated        10.1       2.9       0.333  13.9
## 10 Capreolus   herbi Artio… lc                   3        NA        NA      21  
## # ℹ 73 more rows
## # ℹ 2 more variables: brainwt <dbl>, bodywt <dbl>

That’s it! Let’s take a break and then move on to part 2!

–>Home