Learning Goals

At the end of this exercise, you will be able to:
1. Use distinct() to find unique observations in rows.
2. Use mutate() to create new columns from existing columns.
3. Use mutate() with across and where to transform multiple columns that meet specific criteria. 4. Use if_else() to conditionally change values in a column. 5. Clean data using janitor and mutate().

Load the tidyverse

library("tidyverse")
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.6
## ✔ forcats   1.0.1     ✔ stringr   1.6.0
## ✔ ggplot2   4.0.1     ✔ tibble    3.3.0
## ✔ lubridate 1.9.4     ✔ tidyr     1.3.2
## ✔ purrr     1.2.1     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library("janitor")
## 
## Attaching package: 'janitor'
## 
## The following objects are masked from 'package:stats':
## 
##     chisq.test, fisher.test
library("palmerpenguins") #load the palmerpenguins package
## 
## Attaching package: 'palmerpenguins'
## 
## The following objects are masked from 'package:datasets':
## 
##     penguins, penguins_raw
options(scipen=999) #turn off scientific notation

Palmerpenguins

These data are from: Gorman KB, Williams TD, Fraser WR (2014). Ecological sexual dimorphism and environmental variability within a community of Antarctic penguins (genus Pygoscelis). PLoS ONE 9(3):e90081. https://doi.org/10.1371/journal.pone.0090081

Review & Practice

Recall that the the verbs select() and filter() are used to extract columns and rows from a dataframe. We use the pipe operator %>% to connect multiple functions together.

  1. Select species, island, and body mass from the penguins data. Arrange results by body mass.
penguins %>% 
  select(species, island, body_mass_g) %>% 
  arrange(body_mass_g)
## # A tibble: 344 × 3
##    species   island    body_mass_g
##    <fct>     <fct>           <int>
##  1 Chinstrap Dream            2700
##  2 Adelie    Biscoe           2850
##  3 Adelie    Biscoe           2850
##  4 Adelie    Biscoe           2900
##  5 Adelie    Dream            2900
##  6 Adelie    Torgersen        2900
##  7 Chinstrap Dream            2900
##  8 Adelie    Biscoe           2925
##  9 Adelie    Dream            2975
## 10 Adelie    Dream            3000
## # ℹ 334 more rows
  1. Filter the penguins data to only include observations from Biscoe and Dream islands.
penguins %>% 
  filter(island=="Biscoe" | island=="Dream")
## # A tibble: 292 × 8
##    species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
##    <fct>   <fct>           <dbl>         <dbl>             <int>       <int>
##  1 Adelie  Biscoe           37.8          18.3               174        3400
##  2 Adelie  Biscoe           37.7          18.7               180        3600
##  3 Adelie  Biscoe           35.9          19.2               189        3800
##  4 Adelie  Biscoe           38.2          18.1               185        3950
##  5 Adelie  Biscoe           38.8          17.2               180        3800
##  6 Adelie  Biscoe           35.3          18.9               187        3800
##  7 Adelie  Biscoe           40.6          18.6               183        3550
##  8 Adelie  Biscoe           40.5          17.9               187        3200
##  9 Adelie  Biscoe           37.9          18.6               172        3150
## 10 Adelie  Biscoe           40.5          18.9               180        3950
## # ℹ 282 more rows
## # ℹ 2 more variables: sex <fct>, year <int>
penguins %>% 
  filter(island==c("Biscoe", "Dream"))
## # A tibble: 146 × 8
##    species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
##    <fct>   <fct>           <dbl>         <dbl>             <int>       <int>
##  1 Adelie  Biscoe           37.8          18.3               174        3400
##  2 Adelie  Biscoe           35.9          19.2               189        3800
##  3 Adelie  Biscoe           38.8          17.2               180        3800
##  4 Adelie  Biscoe           40.6          18.6               183        3550
##  5 Adelie  Biscoe           37.9          18.6               172        3150
##  6 Adelie  Dream            37.2          18.1               178        3900
##  7 Adelie  Dream            40.9          18.9               184        3900
##  8 Adelie  Dream            39.2          21.1               196        4150
##  9 Adelie  Dream            42.2          18.5               180        3550
## 10 Adelie  Dream            39.8          19.1               184        4650
## # ℹ 136 more rows
## # ℹ 2 more variables: sex <fct>, year <int>
penguins %>% 
  filter(island %in% c("Biscoe", "Dream"))
## # A tibble: 292 × 8
##    species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
##    <fct>   <fct>           <dbl>         <dbl>             <int>       <int>
##  1 Adelie  Biscoe           37.8          18.3               174        3400
##  2 Adelie  Biscoe           37.7          18.7               180        3600
##  3 Adelie  Biscoe           35.9          19.2               189        3800
##  4 Adelie  Biscoe           38.2          18.1               185        3950
##  5 Adelie  Biscoe           38.8          17.2               180        3800
##  6 Adelie  Biscoe           35.3          18.9               187        3800
##  7 Adelie  Biscoe           40.6          18.6               183        3550
##  8 Adelie  Biscoe           40.5          17.9               187        3200
##  9 Adelie  Biscoe           37.9          18.6               172        3150
## 10 Adelie  Biscoe           40.5          18.9               180        3950
## # ℹ 282 more rows
## # ℹ 2 more variables: sex <fct>, year <int>
  1. Make a plot that shows the relationship between body mass and flipper length. How does this compare among different species?
penguins %>% 
  ggplot(aes(x=flipper_length_mm, y=body_mass_g, color=species)) +
  geom_point() +
  geom_smooth(method="lm", se=FALSE)
## `geom_smooth()` using formula = 'y ~ x'
## Warning: Removed 2 rows containing non-finite outside the scale range
## (`stat_smooth()`).
## Warning: Removed 2 rows containing missing values or values outside the scale range
## (`geom_point()`).

distinct()

distinct() looks for all unique observations in rows. This is a little tricky because it can look like it is working column-wise, but it is actually working row-wise.

One helpful approach to new data is to find any duplicated rows. If we first look at the dimensions of the penguins data, we see it has 344 rows and 8 columns.

dim(penguins)
## [1] 344   8

Using distinct() across all rows, we see there are no duplicates. This means every row contains unique observations across all variables.

penguins %>% 
  distinct()
## # A tibble: 344 × 8
##    species island    bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
##    <fct>   <fct>              <dbl>         <dbl>             <int>       <int>
##  1 Adelie  Torgersen           39.1          18.7               181        3750
##  2 Adelie  Torgersen           39.5          17.4               186        3800
##  3 Adelie  Torgersen           40.3          18                 195        3250
##  4 Adelie  Torgersen           NA            NA                  NA          NA
##  5 Adelie  Torgersen           36.7          19.3               193        3450
##  6 Adelie  Torgersen           39.3          20.6               190        3650
##  7 Adelie  Torgersen           38.9          17.8               181        3625
##  8 Adelie  Torgersen           39.2          19.6               195        4675
##  9 Adelie  Torgersen           34.1          18.1               193        3475
## 10 Adelie  Torgersen           42            20.2               190        4250
## # ℹ 334 more rows
## # ℹ 2 more variables: sex <fct>, year <int>

But if we only look at species, we can see that there are only 3 unique species in the data.

penguins %>% 
  distinct(species)
## # A tibble: 3 × 1
##   species  
##   <fct>    
## 1 Adelie   
## 2 Gentoo   
## 3 Chinstrap

What if we want to know which islands each species occurs on?

penguins %>% 
  distinct(species, island, .keep_all=TRUE)
## # A tibble: 5 × 8
##   species   island    bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
##   <fct>     <fct>              <dbl>         <dbl>             <int>       <int>
## 1 Adelie    Torgersen           39.1          18.7               181        3750
## 2 Adelie    Biscoe              37.8          18.3               174        3400
## 3 Adelie    Dream               39.5          16.7               178        3250
## 4 Gentoo    Biscoe              46.1          13.2               211        4500
## 5 Chinstrap Dream               46.5          17.9               192        3500
## # ℹ 2 more variables: sex <fct>, year <int>

mutate()

mutate() is another verb that acts on columns. It allows us to create new columns from existing columns in a data frame. When we use mutate(), the columns are added to the end of the dataframe by default. Let’s create a new column that converts body mass from grams to kilograms.

penguins %>%
  mutate(body_mass_kg = body_mass_g/1000) %>% 
  select(species, body_mass_g, body_mass_kg) %>% 
  arrange(body_mass_kg)
## # A tibble: 344 × 3
##    species   body_mass_g body_mass_kg
##    <fct>           <int>        <dbl>
##  1 Chinstrap        2700         2.7 
##  2 Adelie           2850         2.85
##  3 Adelie           2850         2.85
##  4 Adelie           2900         2.9 
##  5 Adelie           2900         2.9 
##  6 Adelie           2900         2.9 
##  7 Chinstrap        2900         2.9 
##  8 Adelie           2925         2.92
##  9 Adelie           2975         2.98
## 10 Adelie           3000         3   
## # ℹ 334 more rows

mutate() and across()

We use across() within mutate() to apply a function to multiple columns. This is especially helpful when cleaning data. For example, let’s say we want to convert all columns that end with mm to centimeters. We can use across() to do this.

penguins %>%
  mutate(across(ends_with("mm"), ~./10)) %>%
  select(species, 
         bill_length_cm=bill_length_mm, 
         bill_depth_cm=bill_depth_mm, 
         flipper_length_cm=flipper_length_mm)
## # A tibble: 344 × 4
##    species bill_length_cm bill_depth_cm flipper_length_cm
##    <fct>            <dbl>         <dbl>             <dbl>
##  1 Adelie            3.91          1.87              18.1
##  2 Adelie            3.95          1.74              18.6
##  3 Adelie            4.03          1.8               19.5
##  4 Adelie           NA            NA                 NA  
##  5 Adelie            3.67          1.93              19.3
##  6 Adelie            3.93          2.06              19  
##  7 Adelie            3.89          1.78              18.1
##  8 Adelie            3.92          1.96              19.5
##  9 Adelie            3.41          1.81              19.3
## 10 Adelie            4.2           2.02              19  
## # ℹ 334 more rows

What does the ~./10 mean? The ~ indicates that what follows is a formula (lambda function). The . represents the current column being processed. So, ./10 means “take the current column and divide it by 10”. This operation is applied to all columns that end with mm.

Cleaning Data

Cleaning raw data is an essential, but tedious step in data analysis. It’s impossible to predict every scenario that you will come across, but there are some common issues that we can learn to address.

We already learned how to use rename() to change column names. We also learned how to rename columns from within select(). But, this can be very inefficient if we have a large dataset.

Let’s have a look at some new data focused on mammal lifehistories. The data are from: S. K. Morgan Ernest. 2003. Life history characteristics of placental non-volant mammals. Ecology 84:3402. link

mammals <- read_csv("data/mammal_lifehistories_v2.csv")
## Rows: 1440 Columns: 13
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (4): order, family, Genus, species
## dbl (9): mass, gestation, newborn, weaning, wean mass, AFR, max. life, litte...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

What is the structure of the data? Are there any NA’s or other issues?

glimpse(mammals)
## Rows: 1,440
## Columns: 13
## $ order          <chr> "Artiodactyla", "Artiodactyla", "Artiodactyla", "Artiod…
## $ family         <chr> "Antilocapridae", "Bovidae", "Bovidae", "Bovidae", "Bov…
## $ Genus          <chr> "Antilocapra", "Addax", "Aepyceros", "Alcelaphus", "Amm…
## $ species        <chr> "americana", "nasomaculatus", "melampus", "buselaphus",…
## $ mass           <dbl> 45375.0, 182375.0, 41480.0, 150000.0, 28500.0, 55500.0,…
## $ gestation      <dbl> 8.13, 9.39, 6.35, 7.90, 6.80, 5.08, 5.72, 5.50, 8.93, 9…
## $ newborn        <dbl> 3246.36, 5480.00, 5093.00, 10166.67, -999.00, 3810.00, …
## $ weaning        <dbl> 3.00, 6.50, 5.63, 6.50, -999.00, 4.00, 4.04, 2.13, 10.7…
## $ `wean mass`    <dbl> 8900, -999, 15900, -999, -999, -999, -999, -999, 157500…
## $ AFR            <dbl> 13.53, 27.27, 16.66, 23.02, -999.00, 14.89, 10.23, 20.1…
## $ `max. life`    <dbl> 142, 308, 213, 240, -999, 251, 228, 255, 300, 324, 300,…
## $ `litter size`  <dbl> 1.85, 1.00, 1.00, 1.00, 1.00, 1.37, 1.00, 1.00, 1.00, 1…
## $ `litters/year` <dbl> 1.00, 0.99, 0.95, -999.00, -999.00, 2.00, -999.00, 1.89…

One thing to notice is the column names are inconsistent. This is going to cause problems for us down the line. We could rename each column, one at a time, using rename(), but that would be tedious. Instead, we can use the clean_names() function from the janitor package to fix all of the column names at once.

mammals <- mammals %>% 
  clean_names()
glimpse(mammals)
## Rows: 1,440
## Columns: 13
## $ order        <chr> "Artiodactyla", "Artiodactyla", "Artiodactyla", "Artiodac…
## $ family       <chr> "Antilocapridae", "Bovidae", "Bovidae", "Bovidae", "Bovid…
## $ genus        <chr> "Antilocapra", "Addax", "Aepyceros", "Alcelaphus", "Ammod…
## $ species      <chr> "americana", "nasomaculatus", "melampus", "buselaphus", "…
## $ mass         <dbl> 45375.0, 182375.0, 41480.0, 150000.0, 28500.0, 55500.0, 3…
## $ gestation    <dbl> 8.13, 9.39, 6.35, 7.90, 6.80, 5.08, 5.72, 5.50, 8.93, 9.1…
## $ newborn      <dbl> 3246.36, 5480.00, 5093.00, 10166.67, -999.00, 3810.00, 39…
## $ weaning      <dbl> 3.00, 6.50, 5.63, 6.50, -999.00, 4.00, 4.04, 2.13, 10.71,…
## $ wean_mass    <dbl> 8900, -999, 15900, -999, -999, -999, -999, -999, 157500, …
## $ afr          <dbl> 13.53, 27.27, 16.66, 23.02, -999.00, 14.89, 10.23, 20.13,…
## $ max_life     <dbl> 142, 308, 213, 240, -999, 251, 228, 255, 300, 324, 300, 3…
## $ litter_size  <dbl> 1.85, 1.00, 1.00, 1.00, 1.00, 1.37, 1.00, 1.00, 1.00, 1.0…
## $ litters_year <dbl> 1.00, 0.99, 0.95, -999.00, -999.00, 2.00, -999.00, 1.89, …

Notice that clean_names() has converted all column names to lowercase and replaced spaces with underscores. But, no adjustments were made to the data itself. What if we want to change observations from upper case to lower case?

mammals %>% 
  mutate(across(c("order", "family"), tolower)) #specific columns
## # A tibble: 1,440 × 13
##    order  family genus species   mass gestation newborn weaning wean_mass    afr
##    <chr>  <chr>  <chr> <chr>    <dbl>     <dbl>   <dbl>   <dbl>     <dbl>  <dbl>
##  1 artio… antil… Anti… americ… 4.54e4      8.13   3246.    3         8900   13.5
##  2 artio… bovid… Addax nasoma… 1.82e5      9.39   5480     6.5       -999   27.3
##  3 artio… bovid… Aepy… melamp… 4.15e4      6.35   5093     5.63     15900   16.7
##  4 artio… bovid… Alce… busela… 1.5 e5      7.9   10167.    6.5       -999   23.0
##  5 artio… bovid… Ammo… clarkei 2.85e4      6.8    -999  -999         -999 -999  
##  6 artio… bovid… Ammo… lervia  5.55e4      5.08   3810     4         -999   14.9
##  7 artio… bovid… Anti… marsup… 3   e4      5.72   3910     4.04      -999   10.2
##  8 artio… bovid… Anti… cervic… 3.75e4      5.5    3846     2.13      -999   20.1
##  9 artio… bovid… Bison bison   4.98e5      8.93  20000    10.7     157500   29.4
## 10 artio… bovid… Bison bonasus 5   e5      9.14  23000.    6.6       -999   30.0
## # ℹ 1,430 more rows
## # ℹ 3 more variables: max_life <dbl>, litter_size <dbl>, litters_year <dbl>

This will change all columns to lower case. But, notice what happens to numeric columns.

mammals %>% 
  mutate(across(everything(), tolower)) #all columns
## # A tibble: 1,440 × 13
##    order    family genus species mass  gestation newborn weaning wean_mass afr  
##    <chr>    <chr>  <chr> <chr>   <chr> <chr>     <chr>   <chr>   <chr>     <chr>
##  1 artioda… antil… anti… americ… 45375 8.13      3246.36 3       8900      13.53
##  2 artioda… bovid… addax nasoma… 1823… 9.39      5480    6.5     -999      27.27
##  3 artioda… bovid… aepy… melamp… 41480 6.35      5093    5.63    15900     16.66
##  4 artioda… bovid… alce… busela… 1500… 7.9       10166.… 6.5     -999      23.02
##  5 artioda… bovid… ammo… clarkei 28500 6.8       -999    -999    -999      -999 
##  6 artioda… bovid… ammo… lervia  55500 5.08      3810    4       -999      14.89
##  7 artioda… bovid… anti… marsup… 30000 5.72      3910    4.04    -999      10.23
##  8 artioda… bovid… anti… cervic… 37500 5.5       3846    2.13    -999      20.13
##  9 artioda… bovid… bison bison   4976… 8.93      20000   10.71   157500    29.45
## 10 artioda… bovid… bison bonasus 5000… 9.14      23000.… 6.6     -999      29.99
## # ℹ 1,430 more rows
## # ℹ 3 more variables: max_life <chr>, litter_size <chr>, litters_year <chr>

For this reason, it might be better to use where so we can specify only character columns.

mammals <- mammals %>%
  mutate(across(where(is.character), tolower)) #all character columns

if_else()

We briefly introduce if_else() here because it allows us to use mutate() but not have the entire column affected in the same way. With ifelse(), you first specify a logical statement, afterwards what needs to happen if the statement returns TRUE, and lastly what needs to happen if it’s FALSE.

Have a look at the data from mammals below. Notice that the values for newborn include -999.00. This is sometimes used as a placeholder for NA (but, is a really bad idea). We can use if_else() to replace -999.00 with NA.

mammals %>% 
  select(genus, species, newborn) %>% 
  arrange(newborn)
## # A tibble: 1,440 × 3
##    genus       species        newborn
##    <chr>       <chr>            <dbl>
##  1 ammodorcas  clarkei           -999
##  2 bos         javanicus         -999
##  3 bubalus     depressicornis    -999
##  4 bubalus     mindorensis       -999
##  5 capra       falconeri         -999
##  6 cephalophus niger             -999
##  7 cephalophus nigrifrons        -999
##  8 cephalophus natalensis        -999
##  9 cephalophus leucogaster       -999
## 10 cephalophus ogilbyi           -999
## # ℹ 1,430 more rows
mammals %>% 
  select(genus, species, newborn) %>%
  mutate(newborn_new = ifelse(newborn == -999.00, NA, newborn))%>% 
  arrange(newborn)
## # A tibble: 1,440 × 4
##    genus       species        newborn newborn_new
##    <chr>       <chr>            <dbl>       <dbl>
##  1 ammodorcas  clarkei           -999          NA
##  2 bos         javanicus         -999          NA
##  3 bubalus     depressicornis    -999          NA
##  4 bubalus     mindorensis       -999          NA
##  5 capra       falconeri         -999          NA
##  6 cephalophus niger             -999          NA
##  7 cephalophus nigrifrons        -999          NA
##  8 cephalophus natalensis        -999          NA
##  9 cephalophus leucogaster       -999          NA
## 10 cephalophus ogilbyi           -999          NA
## # ℹ 1,430 more rows

Practice

  1. Following the example above, convert all -999 values in the mammals dataframe to NA.
mammals <- mammals %>%
  mutate(across(c(mass, wean_mass, gestation, max_life, newborn, weaning, litter_size, afr, litters_year),
                ~ifelse(. == -999, NA, .)))
summary(mammals)
##     order              family             genus             species         
##  Length:1440        Length:1440        Length:1440        Length:1440       
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##                                                                             
##       mass             gestation          newborn              weaning      
##  Min.   :        2   Min.   : 0.4900   Min.   :      0.21   Min.   : 0.300  
##  1st Qu.:       61   1st Qu.: 0.9925   1st Qu.:      4.40   1st Qu.: 0.920  
##  Median :      606   Median : 2.1100   Median :     43.70   Median : 1.690  
##  Mean   :   407701   Mean   : 3.8630   Mean   :  12126.55   Mean   : 3.967  
##  3rd Qu.:     8554   3rd Qu.: 6.0000   3rd Qu.:    542.50   3rd Qu.: 4.840  
##  Max.   :149000000   Max.   :21.4600   Max.   :2250000.00   Max.   :48.000  
##  NA's   :85          NA's   :418       NA's   :595          NA's   :619     
##    wean_mass               afr            max_life     litter_size    
##  Min.   :       2.1   Min.   :  0.70   Min.   :  12   Min.   : 1.000  
##  1st Qu.:      20.1   1st Qu.:  4.50   1st Qu.:  84   1st Qu.: 1.018  
##  Median :     102.6   Median : 12.00   Median : 192   Median : 2.500  
##  Mean   :   60220.5   Mean   : 22.44   Mean   : 224   Mean   : 2.805  
##  3rd Qu.:    2000.0   3rd Qu.: 28.24   3rd Qu.: 288   3rd Qu.: 4.000  
##  Max.   :19075000.0   Max.   :210.00   Max.   :1368   Max.   :14.180  
##  NA's   :1039         NA's   :607      NA's   :841    NA's   :84      
##   litters_year  
##  Min.   :0.140  
##  1st Qu.:1.000  
##  Median :1.000  
##  Mean   :1.636  
##  3rd Qu.:2.000  
##  Max.   :7.500  
##  NA's   :689
  1. In the mammals data, make a new column mass_kg that that converts mass from grams to kilograms. Select the columns genus, species, mass, and mass_kg, and arrange the data by mass_kg in descending order. What is the common name for the species with the highest mass?
    blue whale
mammals %>% 
  mutate(mass_kg = mass/1000) %>% 
  select(genus, species, mass, mass_kg) %>% 
  arrange(desc(mass_kg))
## # A tibble: 1,440 × 4
##    genus        species             mass mass_kg
##    <chr>        <chr>              <dbl>   <dbl>
##  1 balaenoptera musculus      149000000  149000 
##  2 balaena      mysticetus     80000000   80000 
##  3 balaenoptera physalus       66800000   66800 
##  4 megaptera    novaeangliae   30000000   30000 
##  5 eschrichtius robustus       25066667.  25067.
##  6 eubalaena    australis      23000000   23000 
##  7 eubalaena    glacialis      23000000   23000 
##  8 balaenoptera edeni          20000000   20000 
##  9 balaenoptera acutorostrata  16266667.  16267.
## 10 physeter     catodon        15400000   15400 
## # ℹ 1,430 more rows
  1. What is the relationship between gestation and newborn mass?
mammals %>% 
  mutate(mass_kg = mass/1000) %>% 
  mutate(wean_gestation_ratio = log10(newborn/gestation)) %>% 
  select(genus, species, wean_gestation_ratio) %>% 
  arrange(desc(wean_gestation_ratio))
## # A tibble: 1,440 × 3
##    genus        species      wean_gestation_ratio
##    <chr>        <chr>                       <dbl>
##  1 balaenoptera musculus                     5.32
##  2 balaenoptera physalus                     5.23
##  3 megaptera    novaeangliae                 5.06
##  4 physeter     catodon                      4.78
##  5 balaenoptera borealis                     4.76
##  6 eschrichtius robustus                     4.63
##  7 orcinus      orca                         4.03
##  8 kogia        breviceps                    3.87
##  9 globicephala melas                        3.87
## 10 monodon      monoceros                    3.73
## # ℹ 1,430 more rows
mammals %>% 
  ggplot(aes(x=gestation, y=log10(newborn))) +
  geom_point()+
  geom_smooth(method="lm", se=FALSE)
## `geom_smooth()` using formula = 'y ~ x'
## Warning: Removed 673 rows containing non-finite outside the scale range
## (`stat_smooth()`).
## Warning: Removed 673 rows containing missing values or values outside the scale range
## (`geom_point()`).

  1. Which mammal has the longest life span in years?
    fin whale
mammals %>% 
  select(family, genus, species, max_life) %>% 
  mutate(max_life_new = max_life/12) %>%
  arrange(desc(max_life_new))
## # A tibble: 1,440 × 5
##    family          genus        species      max_life max_life_new
##    <chr>           <chr>        <chr>           <dbl>        <dbl>
##  1 balaenopteridae balaenoptera physalus         1368          114
##  2 balaenopteridae balaenoptera musculus         1320          110
##  3 balaenidae      balaena      mysticetus       1200          100
##  4 delphinidae     orcinus      orca             1080           90
##  5 ziphiidae       berardius    bairdii          1008           84
##  6 elephantidae    elephas      maximus           960           80
##  7 balaenopteridae megaptera    novaeangliae      924           77
##  8 physeteridae    physeter     catodon           924           77
##  9 balaenopteridae balaenoptera borealis          888           74
## 10 dugongidae      dugong       dugon             876           73
## # ℹ 1,430 more rows
  1. Do larger mammals live longer?
mammals %>% 
  ggplot(aes(x=log10(mass), y=max_life)) +
  geom_point()+
  geom_smooth(method="lm", se=FALSE)
## `geom_smooth()` using formula = 'y ~ x'
## Warning: Removed 848 rows containing non-finite outside the scale range
## (`stat_smooth()`).
## Warning: Removed 848 rows containing missing values or values outside the scale range
## (`geom_point()`).

–>Home