At the end of this exercise, you will be able to:
1. Use the select() function of dplyr to build
data frames restricted to variables of interest.
2. Use the rename() function to provide new, consistent
names to variables in data frames.
library("tidyverse")
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.6
## ✔ forcats 1.0.1 ✔ stringr 1.6.0
## ✔ ggplot2 4.0.1 ✔ tibble 3.3.0
## ✔ lubridate 1.9.4 ✔ tidyr 1.3.2
## ✔ purrr 1.2.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library("palmerpenguins") #load the palmerpenguins package
##
## Attaching package: 'palmerpenguins'
##
## The following objects are masked from 'package:datasets':
##
## penguins, penguins_raw
These data are from: Gorman KB, Williams TD, Fraser WR (2014). Ecological sexual dimorphism and environmental variability within a community of Antarctic penguins (genus Pygoscelis). PLoS ONE 9(3):e90081. https://doi.org/10.1371/journal.pone.0090081
Once data have been uploaded, let’s get an idea of its structure, contents, and dimensions.
glimpse(penguins)
## Rows: 344
## Columns: 8
## $ species <fct> Adelie, Adelie, Adelie, Adelie, Adelie, Adelie, Adel…
## $ island <fct> Torgersen, Torgersen, Torgersen, Torgersen, Torgerse…
## $ bill_length_mm <dbl> 39.1, 39.5, 40.3, NA, 36.7, 39.3, 38.9, 39.2, 34.1, …
## $ bill_depth_mm <dbl> 18.7, 17.4, 18.0, NA, 19.3, 20.6, 17.8, 19.6, 18.1, …
## $ flipper_length_mm <int> 181, 186, 195, NA, 193, 190, 181, 195, 193, 190, 186…
## $ body_mass_g <int> 3750, 3800, 3250, NA, 3450, 3650, 3625, 4675, 3475, …
## $ sex <fct> male, female, female, NA, female, male, female, male…
## $ year <int> 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007…
summary(penguins)
## species island bill_length_mm bill_depth_mm
## Adelie :152 Biscoe :168 Min. :32.10 Min. :13.10
## Chinstrap: 68 Dream :124 1st Qu.:39.23 1st Qu.:15.60
## Gentoo :124 Torgersen: 52 Median :44.45 Median :17.30
## Mean :43.92 Mean :17.15
## 3rd Qu.:48.50 3rd Qu.:18.70
## Max. :59.60 Max. :21.50
## NA's :2 NA's :2
## flipper_length_mm body_mass_g sex year
## Min. :172.0 Min. :2700 female:165 Min. :2007
## 1st Qu.:190.0 1st Qu.:3550 male :168 1st Qu.:2007
## Median :197.0 Median :4050 NA's : 11 Median :2008
## Mean :200.9 Mean :4202 Mean :2008
## 3rd Qu.:213.0 3rd Qu.:4750 3rd Qu.:2009
## Max. :231.0 Max. :6300 Max. :2009
## NA's :2 NA's :2
Recall that the tidyverse is a
collection of packages that make workflow in R easier. The packages
operate more intuitively than base R commands and share a common
organizational philosophy. In lab 4, we learned how to use
ggplot2 to make visualization of data easier. In this lab,
we will learn how to use the package dplyr to wrangle
data.
The first package that we will use to wrangle data is
dplyr. dplyr is used to transform data frames
by extracting, rearranging, and summarizing data such that they are
focused on a question of interest. This is very helpful, especially when
wrangling large data, and makes dplyr one of most frequently used
packages in the tidyverse. The two functions we will use most are
select() and filter().
These functions are often called verbs, and the format used for each is the same. The output is always a new, more restricted dataframe.
select()The verb select() allows you to pull out columns of
interest from a dataframe; it does not affect the rows. To do this, just
add the names of the columns of interest to the select()
command. The order in which you add them, will determine the order in
which they appear in the output.
For smaller dataframes, select() may not make much
sense. However, for larger dataframes with many variables,
select() is very useful. Let’s look at the fish data
again.
names(penguins)
## [1] "species" "island" "bill_length_mm"
## [4] "bill_depth_mm" "flipper_length_mm" "body_mass_g"
## [7] "sex" "year"
We are only interested in species and body mass. We can use
select() to extract these columns.
select(penguins, species, body_mass_g)
## # A tibble: 344 × 2
## species body_mass_g
## <fct> <int>
## 1 Adelie 3750
## 2 Adelie 3800
## 3 Adelie 3250
## 4 Adelie NA
## 5 Adelie 3450
## 6 Adelie 3650
## 7 Adelie 3625
## 8 Adelie 4675
## 9 Adelie 3475
## 10 Adelie 4250
## # ℹ 334 more rows
penguins <- penguins
To add a range of columns use start_col:end_col.
select(penguins, species:flipper_length_mm)
## # A tibble: 344 × 5
## species island bill_length_mm bill_depth_mm flipper_length_mm
## <fct> <fct> <dbl> <dbl> <int>
## 1 Adelie Torgersen 39.1 18.7 181
## 2 Adelie Torgersen 39.5 17.4 186
## 3 Adelie Torgersen 40.3 18 195
## 4 Adelie Torgersen NA NA NA
## 5 Adelie Torgersen 36.7 19.3 193
## 6 Adelie Torgersen 39.3 20.6 190
## 7 Adelie Torgersen 38.9 17.8 181
## 8 Adelie Torgersen 39.2 19.6 195
## 9 Adelie Torgersen 34.1 18.1 193
## 10 Adelie Torgersen 42 20.2 190
## # ℹ 334 more rows
The ! operator is useful in select. It allows us to select everything except the specified variables.
select(penguins, !body_mass_g)
## # A tibble: 344 × 7
## species island bill_length_mm bill_depth_mm flipper_length_mm sex year
## <fct> <fct> <dbl> <dbl> <int> <fct> <int>
## 1 Adelie Torgersen 39.1 18.7 181 male 2007
## 2 Adelie Torgersen 39.5 17.4 186 female 2007
## 3 Adelie Torgersen 40.3 18 195 female 2007
## 4 Adelie Torgersen NA NA NA <NA> 2007
## 5 Adelie Torgersen 36.7 19.3 193 female 2007
## 6 Adelie Torgersen 39.3 20.6 190 male 2007
## 7 Adelie Torgersen 38.9 17.8 181 female 2007
## 8 Adelie Torgersen 39.2 19.6 195 male 2007
## 9 Adelie Torgersen 34.1 18.1 193 <NA> 2007
## 10 Adelie Torgersen 42 20.2 190 <NA> 2007
## # ℹ 334 more rows
Alternatively, you can use the c() function to exclude
multiple columns.
select(penguins, !c(species, island, year, sex))
## # A tibble: 344 × 4
## bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
## <dbl> <dbl> <int> <int>
## 1 39.1 18.7 181 3750
## 2 39.5 17.4 186 3800
## 3 40.3 18 195 3250
## 4 NA NA NA NA
## 5 36.7 19.3 193 3450
## 6 39.3 20.6 190 3650
## 7 38.9 17.8 181 3625
## 8 39.2 19.6 195 4675
## 9 34.1 18.1 193 3475
## 10 42 20.2 190 4250
## # ℹ 334 more rows
For very large data frames with lots of variables,
select() utilizes lots of different operators to make
things easier. Let’s say we are only interested in the variables that
deal with length.
select(penguins, contains("mm"))
## # A tibble: 344 × 3
## bill_length_mm bill_depth_mm flipper_length_mm
## <dbl> <dbl> <int>
## 1 39.1 18.7 181
## 2 39.5 17.4 186
## 3 40.3 18 195
## 4 NA NA NA
## 5 36.7 19.3 193
## 6 39.3 20.6 190
## 7 38.9 17.8 181
## 8 39.2 19.6 195
## 9 34.1 18.1 193
## 10 42 20.2 190
## # ℹ 334 more rows
starts_with()
select(penguins, starts_with("bill"))
## # A tibble: 344 × 2
## bill_length_mm bill_depth_mm
## <dbl> <dbl>
## 1 39.1 18.7
## 2 39.5 17.4
## 3 40.3 18
## 4 NA NA
## 5 36.7 19.3
## 6 39.3 20.6
## 7 38.9 17.8
## 8 39.2 19.6
## 9 34.1 18.1
## 10 42 20.2
## # ℹ 334 more rows
ends_with()
select(penguins, ends_with("mm"))
## # A tibble: 344 × 3
## bill_length_mm bill_depth_mm flipper_length_mm
## <dbl> <dbl> <int>
## 1 39.1 18.7 181
## 2 39.5 17.4 186
## 3 40.3 18 195
## 4 NA NA NA
## 5 36.7 19.3 193
## 6 39.3 20.6 190
## 7 38.9 17.8 181
## 8 39.2 19.6 195
## 9 34.1 18.1 193
## 10 42 20.2 190
## # ℹ 334 more rows
You can also select columns based on the class of data.
select(penguins, where(is.numeric))
## # A tibble: 344 × 5
## bill_length_mm bill_depth_mm flipper_length_mm body_mass_g year
## <dbl> <dbl> <int> <int> <int>
## 1 39.1 18.7 181 3750 2007
## 2 39.5 17.4 186 3800 2007
## 3 40.3 18 195 3250 2007
## 4 NA NA NA NA 2007
## 5 36.7 19.3 193 3450 2007
## 6 39.3 20.6 190 3650 2007
## 7 38.9 17.8 181 3625 2007
## 8 39.2 19.6 195 4675 2007
## 9 34.1 18.1 193 3475 2007
## 10 42 20.2 190 4250 2007
## # ℹ 334 more rows
There are a few verbs that go with select() because they
deal with columns. One of these is rename().
rename() allows you to rename columns in a dataframe. The
format is new_name = old_name.
rename(penguins, body_mass=body_mass_g)
## # A tibble: 344 × 8
## species island bill_length_mm bill_depth_mm flipper_length_mm body_mass sex
## <fct> <fct> <dbl> <dbl> <int> <int> <fct>
## 1 Adelie Torge… 39.1 18.7 181 3750 male
## 2 Adelie Torge… 39.5 17.4 186 3800 fema…
## 3 Adelie Torge… 40.3 18 195 3250 fema…
## 4 Adelie Torge… NA NA NA NA <NA>
## 5 Adelie Torge… 36.7 19.3 193 3450 fema…
## 6 Adelie Torge… 39.3 20.6 190 3650 male
## 7 Adelie Torge… 38.9 17.8 181 3625 fema…
## 8 Adelie Torge… 39.2 19.6 195 4675 male
## 9 Adelie Torge… 34.1 18.1 193 3475 <NA>
## 10 Adelie Torge… 42 20.2 190 4250 <NA>
## # ℹ 334 more rows
## # ℹ 1 more variable: year <int>
Alternatively, you can rename from within select.
select(penguins, species, body_mass=body_mass_g)
## # A tibble: 344 × 2
## species body_mass
## <fct> <int>
## 1 Adelie 3750
## 2 Adelie 3800
## 3 Adelie 3250
## 4 Adelie NA
## 5 Adelie 3450
## 6 Adelie 3650
## 7 Adelie 3625
## 8 Adelie 4675
## 9 Adelie 3475
## 10 Adelie 4250
## # ℹ 334 more rows
The second is relocate(). relocate() allows
you to move columns to different locations within a dataframe. Let’s say
we want to move length to be the first column in the
dataframe.
relocate(penguins, year)
## # A tibble: 344 × 8
## year species island bill_length_mm bill_depth_mm flipper_length_mm
## <int> <fct> <fct> <dbl> <dbl> <int>
## 1 2007 Adelie Torgersen 39.1 18.7 181
## 2 2007 Adelie Torgersen 39.5 17.4 186
## 3 2007 Adelie Torgersen 40.3 18 195
## 4 2007 Adelie Torgersen NA NA NA
## 5 2007 Adelie Torgersen 36.7 19.3 193
## 6 2007 Adelie Torgersen 39.3 20.6 190
## 7 2007 Adelie Torgersen 38.9 17.8 181
## 8 2007 Adelie Torgersen 39.2 19.6 195
## 9 2007 Adelie Torgersen 34.1 18.1 193
## 10 2007 Adelie Torgersen 42 20.2 190
## # ℹ 334 more rows
## # ℹ 2 more variables: body_mass_g <int>, sex <fct>
For these exercises, we will use the msleep dataset from
the ggplot2 package. This dataset contains information
about the sleep habits of various mammals. Reference:
V. M. Savage and G. B. West. A quantitative, theoretical framework for
understanding mammalian sleep. Proceedings of the National Academy of
Sciences, 104 (3):1051-1056, 2007.
msleep dataframe?names(msleep)
## [1] "name" "genus" "vore" "order" "conservation"
## [6] "sleep_total" "sleep_rem" "sleep_cycle" "awake" "brainwt"
## [11] "bodywt"
glimpse() to get an idea of the structure of the
msleep dataframe.glimpse(msleep)
## Rows: 83
## Columns: 11
## $ name <chr> "Cheetah", "Owl monkey", "Mountain beaver", "Greater shor…
## $ genus <chr> "Acinonyx", "Aotus", "Aplodontia", "Blarina", "Bos", "Bra…
## $ vore <chr> "carni", "omni", "herbi", "omni", "herbi", "herbi", "carn…
## $ order <chr> "Carnivora", "Primates", "Rodentia", "Soricomorpha", "Art…
## $ conservation <chr> "lc", NA, "nt", "lc", "domesticated", NA, "vu", NA, "dome…
## $ sleep_total <dbl> 12.1, 17.0, 14.4, 14.9, 4.0, 14.4, 8.7, 7.0, 10.1, 3.0, 5…
## $ sleep_rem <dbl> NA, 1.8, 2.4, 2.3, 0.7, 2.2, 1.4, NA, 2.9, NA, 0.6, 0.8, …
## $ sleep_cycle <dbl> NA, NA, NA, 0.1333333, 0.6666667, 0.7666667, 0.3833333, N…
## $ awake <dbl> 11.9, 7.0, 9.6, 9.1, 20.0, 9.6, 15.3, 17.0, 13.9, 21.0, 1…
## $ brainwt <dbl> NA, 0.01550, NA, 0.00029, 0.42300, NA, NA, NA, 0.07000, 0…
## $ bodywt <dbl> 50.000, 0.480, 1.350, 0.019, 600.000, 3.850, 20.490, 0.04…
order, genus, and bodywt.select(msleep, "order", "genus", "bodywt")
## # A tibble: 83 × 3
## order genus bodywt
## <chr> <chr> <dbl>
## 1 Carnivora Acinonyx 50
## 2 Primates Aotus 0.48
## 3 Rodentia Aplodontia 1.35
## 4 Soricomorpha Blarina 0.019
## 5 Artiodactyla Bos 600
## 6 Pilosa Bradypus 3.85
## 7 Carnivora Callorhinus 20.5
## 8 Rodentia Calomys 0.045
## 9 Carnivora Canis 14
## 10 Artiodactyla Capreolus 14.8
## # ℹ 73 more rows
select(msleep, where(is.numeric))
## # A tibble: 83 × 6
## sleep_total sleep_rem sleep_cycle awake brainwt bodywt
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 12.1 NA NA 11.9 NA 50
## 2 17 1.8 NA 7 0.0155 0.48
## 3 14.4 2.4 NA 9.6 NA 1.35
## 4 14.9 2.3 0.133 9.1 0.00029 0.019
## 5 4 0.7 0.667 20 0.423 600
## 6 14.4 2.2 0.767 9.6 NA 3.85
## 7 8.7 1.4 0.383 15.3 NA 20.5
## 8 7 NA NA 17 NA 0.045
## 9 10.1 2.9 0.333 13.9 0.07 14
## 10 3 NA NA 21 0.0982 14.8
## # ℹ 73 more rows
name.select(msleep, !name)
## # A tibble: 83 × 10
## genus vore order conservation sleep_total sleep_rem sleep_cycle awake
## <chr> <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl>
## 1 Acinonyx carni Carni… lc 12.1 NA NA 11.9
## 2 Aotus omni Prima… <NA> 17 1.8 NA 7
## 3 Aplodontia herbi Roden… nt 14.4 2.4 NA 9.6
## 4 Blarina omni Soric… lc 14.9 2.3 0.133 9.1
## 5 Bos herbi Artio… domesticated 4 0.7 0.667 20
## 6 Bradypus herbi Pilosa <NA> 14.4 2.2 0.767 9.6
## 7 Callorhinus carni Carni… vu 8.7 1.4 0.383 15.3
## 8 Calomys <NA> Roden… <NA> 7 NA NA 17
## 9 Canis carni Carni… domesticated 10.1 2.9 0.333 13.9
## 10 Capreolus herbi Artio… lc 3 NA NA 21
## # ℹ 73 more rows
## # ℹ 2 more variables: brainwt <dbl>, bodywt <dbl>
–>Home