mutate()
, and
if_else()
At the end of this exercise, you will be able to:
1. Use mutate()
to add columns in a dataframe.
2. Use mutate()
and if_else()
to replace
values in a dataframe.
library("tidyverse")
library("janitor")
For this lab, we will use the following dataset:
S. K. Morgan Ernest. 2003. Life history characteristics of placental
non-volant mammals. Ecology 84:3402. link
%>%
Recall that we use pipes to connect the output of code to a
subsequent function. This makes our code cleaner and more efficient. One
way we can use pipes is to attach the clean_names()
function from janitor to the read_csv()
output.
mammals <- read_csv("data/mammal_lifehistories_v2.csv") %>% clean_names()
## Rows: 1440 Columns: 13
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (4): order, family, Genus, species
## dbl (9): mass, gestation, newborn, weaning, wean mass, AFR, max. life, litte...
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
mutate()
Recall that mutate allows us to create a new column from existing
columns in a data frame. Use mutate()
to make a new column
that converts gestation to years. Which animal has the longest gestation
period?
mammals %>%
select(genus, species, gestation) %>%
mutate(gestation_years = gestation/12) %>%
arrange(-gestation_years)
## # A tibble: 1,440 × 4
## genus species gestation gestation_years
## <chr> <chr> <dbl> <dbl>
## 1 Loxodonta africana 21.5 1.79
## 2 Elephas maximus 21.1 1.76
## 3 Rhinoceros sondaicus 16.5 1.38
## 4 Rhinoceros unicornis 16.4 1.37
## 5 Diceros bicornis 16.1 1.34
## 6 Ceratotherium simum 15.9 1.32
## 7 Physeter catodon 15.8 1.32
## 8 Globicephala macrorhynchus 15.2 1.27
## 9 Pseudorca crassidens 14.9 1.24
## 10 Giraffa camelopardalis 14.9 1.24
## # ℹ 1,430 more rows
mutate()
and across()This last function is super helpful when cleaning data. With “wild” data, there are often mixed entries (upper and lowercase), blank spaces, odd characters, etc. These all need to be dealt with before analysis.
Here is an example that changes all entries to lowercase (if present).
mammals
## # A tibble: 1,440 × 13
## order family genus species mass gestation newborn weaning wean_mass afr
## <chr> <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Artio… Antil… Anti… americ… 4.54e4 8.13 3246. 3 8900 13.5
## 2 Artio… Bovid… Addax nasoma… 1.82e5 9.39 5480 6.5 -999 27.3
## 3 Artio… Bovid… Aepy… melamp… 4.15e4 6.35 5093 5.63 15900 16.7
## 4 Artio… Bovid… Alce… busela… 1.5 e5 7.9 10167. 6.5 -999 23.0
## 5 Artio… Bovid… Ammo… clarkei 2.85e4 6.8 -999 -999 -999 -999
## 6 Artio… Bovid… Ammo… lervia 5.55e4 5.08 3810 4 -999 14.9
## 7 Artio… Bovid… Anti… marsup… 3 e4 5.72 3910 4.04 -999 10.2
## 8 Artio… Bovid… Anti… cervic… 3.75e4 5.5 3846 2.13 -999 20.1
## 9 Artio… Bovid… Bison bison 4.98e5 8.93 20000 10.7 157500 29.4
## 10 Artio… Bovid… Bison bonasus 5 e5 9.14 23000. 6.6 -999 30.0
## # ℹ 1,430 more rows
## # ℹ 3 more variables: max_life <dbl>, litter_size <dbl>, litters_year <dbl>
mammals %>%
mutate(across(everything(), tolower))
## # A tibble: 1,440 × 13
## order family genus species mass gestation newborn weaning wean_mass afr
## <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
## 1 artioda… antil… anti… americ… 45375 8.13 3246.36 3 8900 13.53
## 2 artioda… bovid… addax nasoma… 1823… 9.39 5480 6.5 -999 27.27
## 3 artioda… bovid… aepy… melamp… 41480 6.35 5093 5.63 15900 16.66
## 4 artioda… bovid… alce… busela… 1500… 7.9 10166.… 6.5 -999 23.02
## 5 artioda… bovid… ammo… clarkei 28500 6.8 -999 -999 -999 -999
## 6 artioda… bovid… ammo… lervia 55500 5.08 3810 4 -999 14.89
## 7 artioda… bovid… anti… marsup… 30000 5.72 3910 4.04 -999 10.23
## 8 artioda… bovid… anti… cervic… 37500 5.5 3846 2.13 -999 20.13
## 9 artioda… bovid… bison bison 4976… 8.93 20000 10.71 157500 29.45
## 10 artioda… bovid… bison bonasus 5e+05 9.14 23000.… 6.6 -999 29.99
## # ℹ 1,430 more rows
## # ℹ 3 more variables: max_life <chr>, litter_size <chr>, litters_year <chr>
Using the across function we can specify individual columns.
mammals %>%
mutate(across(c("order", "family"), tolower))
## # A tibble: 1,440 × 13
## order family genus species mass gestation newborn weaning wean_mass afr
## <chr> <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 artio… antil… Anti… americ… 4.54e4 8.13 3246. 3 8900 13.5
## 2 artio… bovid… Addax nasoma… 1.82e5 9.39 5480 6.5 -999 27.3
## 3 artio… bovid… Aepy… melamp… 4.15e4 6.35 5093 5.63 15900 16.7
## 4 artio… bovid… Alce… busela… 1.5 e5 7.9 10167. 6.5 -999 23.0
## 5 artio… bovid… Ammo… clarkei 2.85e4 6.8 -999 -999 -999 -999
## 6 artio… bovid… Ammo… lervia 5.55e4 5.08 3810 4 -999 14.9
## 7 artio… bovid… Anti… marsup… 3 e4 5.72 3910 4.04 -999 10.2
## 8 artio… bovid… Anti… cervic… 3.75e4 5.5 3846 2.13 -999 20.1
## 9 artio… bovid… Bison bison 4.98e5 8.93 20000 10.7 157500 29.4
## 10 artio… bovid… Bison bonasus 5 e5 9.14 23000. 6.6 -999 30.0
## # ℹ 1,430 more rows
## # ℹ 3 more variables: max_life <dbl>, litter_size <dbl>, litters_year <dbl>
if_else()
We will briefly introduce if_else()
here because it
allows us to use mutate()
but not have the entire column
affected in the same way. In a sense, this can function like find and
replace in a spreadsheet program. With ifelse()
, you first
specify a logical statement, afterwards what needs to happen if the
statement returns TRUE
, and lastly what needs to happen if
it’s FALSE
.
Have a look at the data from mammals below. Notice that the values
for newborn include -999.00
. This is sometimes used as a
placeholder for NA (but, is a really bad idea). We can use
if_else()
to replace -999.00
with
NA
.
mammals %>%
select(genus, species, newborn) %>%
arrange(newborn)
## # A tibble: 1,440 × 3
## genus species newborn
## <chr> <chr> <dbl>
## 1 Ammodorcas clarkei -999
## 2 Bos javanicus -999
## 3 Bubalus depressicornis -999
## 4 Bubalus mindorensis -999
## 5 Capra falconeri -999
## 6 Cephalophus niger -999
## 7 Cephalophus nigrifrons -999
## 8 Cephalophus natalensis -999
## 9 Cephalophus leucogaster -999
## 10 Cephalophus ogilbyi -999
## # ℹ 1,430 more rows
mammals %>%
select(genus, species, newborn) %>%
mutate(newborn_new = ifelse(newborn == -999.00, NA, newborn))%>%
arrange(newborn)
## # A tibble: 1,440 × 4
## genus species newborn newborn_new
## <chr> <chr> <dbl> <dbl>
## 1 Ammodorcas clarkei -999 NA
## 2 Bos javanicus -999 NA
## 3 Bubalus depressicornis -999 NA
## 4 Bubalus mindorensis -999 NA
## 5 Capra falconeri -999 NA
## 6 Cephalophus niger -999 NA
## 7 Cephalophus nigrifrons -999 NA
## 8 Cephalophus natalensis -999 NA
## 9 Cephalophus leucogaster -999 NA
## 10 Cephalophus ogilbyi -999 NA
## # ℹ 1,430 more rows
max_life_new
. Then convert
max_life_new
into years. Finally, sort the date in
descending order by max_life_new. Which mammal has the longest life
span?mammals %>%
select(family, genus, species, max_life) %>%
mutate(max_life_new= ifelse(max_life==-999, NA, max_life)) %>%
mutate(max_life_new = max_life_new/12) %>%
na.omit() %>%
arrange(max_life)
## # A tibble: 599 × 5
## family genus species max_life max_life_new
## <chr> <chr> <chr> <dbl> <dbl>
## 1 Muridae Myopus schisticolor 12 1
## 2 Soricidae Sorex longirostris 14 1.17
## 3 Muridae Microtus longicaudus 14 1.17
## 4 Soricidae Myosorex varius 16 1.33
## 5 Muridae Microtus pennsylvanicus 16 1.33
## 6 Soricidae Sorex fumeus 17 1.42
## 7 Soricidae Sorex arcticus 18 1.5
## 8 Soricidae Sorex ornatus 18 1.5
## 9 Soricidae Sorex monticolus 18 1.5
## 10 Soricidae Sorex trowbridgii 18 1.5
## # ℹ 589 more rows
msleep24
from the
msleep
data that: contains the name
and
vore
variables along with a new column called
sleep_total_24
which is the amount of time a species sleeps
expressed as a proportion of a 24-hour day. Restrict the
sleep_total_24
values to less than or equal to 0.3. Arrange
the output in descending order.msleep24 <- msleep %>%
mutate(sleep_total_24=sleep_total/24) %>%
select(name, vore, sleep_total_24, sleep_total) %>%
filter(sleep_total_24<=0.3) %>%
arrange(desc(sleep_total_24))
msleep24
## # A tibble: 20 × 4
## name vore sleep_total_24 sleep_total
## <chr> <chr> <dbl> <dbl>
## 1 Vesper mouse <NA> 0.292 7
## 2 Gray hyrax herbi 0.262 6.3
## 3 Genet carni 0.262 6.3
## 4 Gray seal carni 0.258 6.2
## 5 Common porpoise carni 0.233 5.6
## 6 Rock hyrax <NA> 0.225 5.4
## 7 Goat herbi 0.221 5.3
## 8 Tree hyrax herbi 0.221 5.3
## 9 Bottle-nosed dolphin carni 0.217 5.2
## 10 Brazilian tapir herbi 0.183 4.4
## 11 Cow herbi 0.167 4
## 12 Asian elephant herbi 0.162 3.9
## 13 Sheep herbi 0.158 3.8
## 14 Caspian seal carni 0.146 3.5
## 15 African elephant herbi 0.137 3.3
## 16 Donkey herbi 0.129 3.1
## 17 Roe deer herbi 0.125 3
## 18 Horse herbi 0.121 2.9
## 19 Pilot whale carni 0.112 2.7
## 20 Giraffe herbi 0.0792 1.9
Did dplyr
do what we expected? How do we check our
output? Remember, just because your code runs it doesn’t mean that it
did what you intended.
summary(msleep24)
## name vore sleep_total_24 sleep_total
## Length:20 Length:20 Min. :0.07917 Min. :1.900
## Class :character Class :character 1st Qu.:0.13542 1st Qu.:3.250
## Mode :character Mode :character Median :0.17500 Median :4.200
## Mean :0.18563 Mean :4.455
## 3rd Qu.:0.22708 3rd Qu.:5.450
## Max. :0.29167 Max. :7.000
Histograms are also a quick way to check the output.
hist(msleep24$sleep_total)
–>Home