Learning Goals

At the end of this exercise, you will be able to:
1. Understand and apply the syntax of building plots using ggplot2.
2. Build a boxplot using ggplot2.
3. Build a scatterplot using ggplot2.
4. Build a barplot using ggplot2 and show the difference between stat=count and stat=identity.

Where have we been, and where are we going?

At this point you should feel comfortable working in RStudio and using dplyr and tidyr. You also know how to produce statistical summaries of data and deal with NA’s. It is OK if you need to go back through the labs and find bits of code that work for you, that’s what most people do!

##Resources
- ggplot2 cheatsheet

Libraries

library(tidyverse)
library(naniar)
library(janitor)

Grammar of Graphics

The ability to quickly produce and edit graphs and charts is a strength of R. These data visualizations are produced by the package ggplot2 and it is a core part of the tidyverse. The syntax for using ggplot is specific and common to all of the plots. This is what Hadley Wickham calls a Grammar of Graphics. The “gg” in ggplot stands for grammar of graphics.

Philosophy

What makes a good chart? In my opinion a good chart is elegant in its simplicity. It provides a clean, clear visual of the data without being overwhelming to the reader. This can be hard to do and takes some careful thinking. Always keep in mind that the reader will almost never know the data as well as you do so you need to be mindful about presenting the facts.

Data Types

We first need to define some of the data types we will use to build plots.

  • discrete quantitative data that only contains integers
  • continuous quantitative data that can take any numerical value
  • categorical qualitative data that can take on a limited number of values

Basics

The syntax used by ggplot takes some practice to get used to, especially for customizing plots, but the basic elements are the same. It is helpful to think of plots as being built up in layers.

In short, plot= data + geom_ + aesthetics.

We start by calling the ggplot function, identifying the data, and specifying the axes. We then add the geom type to describe how we want our data represented. Each geom_ works with specific types of data and R is capable of building plots of single variables, multiple variables, and even maps. Lastly, we add aesthetics.

Example

To make things easy, let’s start with some built in data.

names(iris)
## [1] "Sepal.Length" "Sepal.Width"  "Petal.Length" "Petal.Width"  "Species"
glimpse(iris)
## Rows: 150
## Columns: 5
## $ Sepal.Length <dbl> 5.1, 4.9, 4.7, 4.6, 5.0, 5.4, 4.6, 5.0, 4.4, 4.9, 5.4, 4.…
## $ Sepal.Width  <dbl> 3.5, 3.0, 3.2, 3.1, 3.6, 3.9, 3.4, 3.4, 2.9, 3.1, 3.7, 3.…
## $ Petal.Length <dbl> 1.4, 1.4, 1.3, 1.5, 1.4, 1.7, 1.4, 1.5, 1.4, 1.5, 1.5, 1.…
## $ Petal.Width  <dbl> 0.2, 0.2, 0.2, 0.2, 0.2, 0.4, 0.3, 0.2, 0.2, 0.1, 0.2, 0.…
## $ Species      <fct> setosa, setosa, setosa, setosa, setosa, setosa, setosa, s…

To make a plot, we need to first specify the data and map the aesthetics. The aesthetics include how each variable in our data set will be used. In the example below, I am using the aes() function to identify the x and y variables in the plot.

ggplot(data=iris, #specify the data
       mapping=aes(x=Species, y=Petal.Length)) #map the aesthetics

Notice that we have a nice background, labeled axes, and even a value range of our variables on the y-axis- but no plot. This is because we need to tell ggplot how we want our data represented. This is called the geometry or geom(). There are many types of geom, see the ggplot cheatsheet.

Here we specify that we want a boxplot, indicated by geom_boxplot().

ggplot(data=iris, #specify the data
       mapping=aes(x=Species, y=Petal.Length))+ #map the aesthetics
  geom_boxplot() #add the plot type

Practice

  1. Use the iris data to build a scatterplot that compares sepal length vs. sepal width. Use the cheat sheet for help to find the correct geom_ for a scatterplot.
names(iris)
## [1] "Sepal.Length" "Sepal.Width"  "Petal.Length" "Petal.Width"  "Species"
ggplot(data=iris, 
       mapping=aes(x=Sepal.Width, y=Sepal.Length))+
  geom_point()

Scatterplots and barplots

Now that we have a general idea of the syntax, let’s start by working with two common plots: 1) scatter plots and 2) bar plots.

Data

Database of vertebrate home range sizes.
Reference: Tamburello N, Cote IM, Dulvy NK (2015) Energy and the scaling of animal space use. The American Naturalist 186(2):196-211. http://dx.doi.org/10.1086/682070.
Data: http://datadryad.org/resource/doi:10.5061/dryad.q5j65/1

homerange <- read_csv("data/Tamburelloetal_HomeRangeDatabase.csv")
## Rows: 569 Columns: 24
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (16): taxon, common.name, class, order, family, genus, species, primarym...
## dbl  (8): mean.mass.g, log10.mass, mean.hra.m2, log10.hra, dimension, preyma...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Practice

  1. What is the structure of the homerange data? Does it have any NA’s? Is it tidy? Do a quick exploratory analysis of your choice below.
glimpse(homerange)
## Rows: 569
## Columns: 24
## $ taxon                      <chr> "lake fishes", "river fishes", "river fishe…
## $ common.name                <chr> "american eel", "blacktail redhorse", "cent…
## $ class                      <chr> "actinopterygii", "actinopterygii", "actino…
## $ order                      <chr> "anguilliformes", "cypriniformes", "cyprini…
## $ family                     <chr> "anguillidae", "catostomidae", "cyprinidae"…
## $ genus                      <chr> "anguilla", "moxostoma", "campostoma", "cli…
## $ species                    <chr> "rostrata", "poecilura", "anomalum", "fundu…
## $ primarymethod              <chr> "telemetry", "mark-recapture", "mark-recapt…
## $ N                          <chr> "16", NA, "20", "26", "17", "5", "2", "2", …
## $ mean.mass.g                <dbl> 887.00, 562.00, 34.00, 4.00, 4.00, 3525.00,…
## $ log10.mass                 <dbl> 2.9479236, 2.7497363, 1.5314789, 0.6020600,…
## $ alternative.mass.reference <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
## $ mean.hra.m2                <dbl> 282750.00, 282.10, 116.11, 125.50, 87.10, 3…
## $ log10.hra                  <dbl> 5.4514026, 2.4504031, 2.0648696, 2.0986437,…
## $ hra.reference              <chr> "Minns, C. K. 1995. Allometry of home range…
## $ realm                      <chr> "aquatic", "aquatic", "aquatic", "aquatic",…
## $ thermoregulation           <chr> "ectotherm", "ectotherm", "ectotherm", "ect…
## $ locomotion                 <chr> "swimming", "swimming", "swimming", "swimmi…
## $ trophic.guild              <chr> "carnivore", "carnivore", "carnivore", "car…
## $ dimension                  <dbl> 3, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 3, 3…
## $ preymass                   <dbl> NA, NA, NA, NA, NA, NA, 1.39, NA, NA, NA, N…
## $ log10.preymass             <dbl> NA, NA, NA, NA, NA, NA, 0.1430148, NA, NA, …
## $ PPMR                       <dbl> NA, NA, NA, NA, NA, NA, 530, NA, NA, NA, NA…
## $ prey.size.reference        <chr> NA, NA, NA, NA, NA, NA, "Brose U, et al. 20…
naniar::miss_var_summary(homerange)
## # A tibble: 24 × 3
##    variable                   n_miss pct_miss
##    <chr>                       <int>    <num>
##  1 alternative.mass.reference    561   98.6  
##  2 preymass                      502   88.2  
##  3 log10.preymass                502   88.2  
##  4 PPMR                          502   88.2  
##  5 prey.size.reference           502   88.2  
##  6 N                             375   65.9  
##  7 primarymethod                   1    0.176
##  8 taxon                           0    0    
##  9 common.name                     0    0    
## 10 class                           0    0    
## # ℹ 14 more rows

1. Scatter Plots

Scatter plots are good at revealing relationships that are not readily visible in the raw data. For now, we will not add regression aka. “best of fit” lines or calculate any r2 values.

In the case below, we are exploring whether or not there is a relationship between animal mass and home range. We are using the log transformed values because there is a large difference in mass and home range among the different species in the data.

names(homerange)
##  [1] "taxon"                      "common.name"               
##  [3] "class"                      "order"                     
##  [5] "family"                     "genus"                     
##  [7] "species"                    "primarymethod"             
##  [9] "N"                          "mean.mass.g"               
## [11] "log10.mass"                 "alternative.mass.reference"
## [13] "mean.hra.m2"                "log10.hra"                 
## [15] "hra.reference"              "realm"                     
## [17] "thermoregulation"           "locomotion"                
## [19] "trophic.guild"              "dimension"                 
## [21] "preymass"                   "log10.preymass"            
## [23] "PPMR"                       "prey.size.reference"
ggplot(data=homerange, #specify the data
       mapping=aes(x=log10.mass, y=log10.hra))+ #map the aesthetics
  geom_point() #add the plot type

In big data sets with lots of overlapping values, over-plotting can be an issue. geom_jitter() is similar to geom_point() but it helps with over plotting by adding some random noise to the data and separating some of the individual points.

ggplot(data=homerange, mapping=aes(x=log10.mass, y=log10.hra))+
  geom_jitter()

To add a regression (best of fit) line, we just add another layer.

ggplot(data=homerange, mapping=aes(x=log10.mass, y=log10.hra))+
  geom_point()+
  geom_smooth(method=lm, se=T) #add a regression line
## `geom_smooth()` using formula = 'y ~ x'

Practice

  1. What is the relationship between log10.hra and log10.preymass? What do you notice about how ggplot treats NA’s?
names(homerange)
##  [1] "taxon"                      "common.name"               
##  [3] "class"                      "order"                     
##  [5] "family"                     "genus"                     
##  [7] "species"                    "primarymethod"             
##  [9] "N"                          "mean.mass.g"               
## [11] "log10.mass"                 "alternative.mass.reference"
## [13] "mean.hra.m2"                "log10.hra"                 
## [15] "hra.reference"              "realm"                     
## [17] "thermoregulation"           "locomotion"                
## [19] "trophic.guild"              "dimension"                 
## [21] "preymass"                   "log10.preymass"            
## [23] "PPMR"                       "prey.size.reference"
ggplot(homerange, mapping=aes(x=log10.hra, y=log10.preymass))+
  geom_point(na.rm=T)+
  geom_smooth(method=lm, se=F, na.rm=F)
## `geom_smooth()` using formula = 'y ~ x'
## Warning: Removed 502 rows containing non-finite outside the scale range
## (`stat_smooth()`).

Bar Plot: geom_bar()

The simplest type of bar plot counts the number of observations in a categorical variable. In this case, we want to know how many observations are present in the variable trophic.guild. Notice that we do not specify a y-axis because it is count by default.

names(homerange)
##  [1] "taxon"                      "common.name"               
##  [3] "class"                      "order"                     
##  [5] "family"                     "genus"                     
##  [7] "species"                    "primarymethod"             
##  [9] "N"                          "mean.mass.g"               
## [11] "log10.mass"                 "alternative.mass.reference"
## [13] "mean.hra.m2"                "log10.hra"                 
## [15] "hra.reference"              "realm"                     
## [17] "thermoregulation"           "locomotion"                
## [19] "trophic.guild"              "dimension"                 
## [21] "preymass"                   "log10.preymass"            
## [23] "PPMR"                       "prey.size.reference"
homerange %>% 
  count(trophic.guild)
## # A tibble: 2 × 2
##   trophic.guild     n
##   <chr>         <int>
## 1 carnivore       342
## 2 herbivore       227

Also notice that we can use pipes! The mapping= function is implied by aes and so is often left out.

homerange %>% 
  ggplot(aes(x=trophic.guild)) + 
  geom_bar() #good for counts

Bar Plot: geom_col()

Unlike geom_bar(), geom_col() allows us to specify an x-axis and a y-axis.

homerange %>% 
  filter(family=="salmonidae") %>%
  select(common.name, log10.mass) %>% 
  ggplot(aes(y=common.name, x=log10.mass))+ #notice the switch in x and y
  geom_col()

geom_bar() with stat="identity" stat="identity" allows us to map a variable to the y-axis so that we aren’t restricted to counts.

homerange %>% 
  filter(family=="salmonidae") %>% 
  ggplot(aes(x=common.name, y=log10.mass))+
  geom_bar(stat="identity")

Practice

  1. Filter the homerange data to include mammals only.
names(homerange)
##  [1] "taxon"                      "common.name"               
##  [3] "class"                      "order"                     
##  [5] "family"                     "genus"                     
##  [7] "species"                    "primarymethod"             
##  [9] "N"                          "mean.mass.g"               
## [11] "log10.mass"                 "alternative.mass.reference"
## [13] "mean.hra.m2"                "log10.hra"                 
## [15] "hra.reference"              "realm"                     
## [17] "thermoregulation"           "locomotion"                
## [19] "trophic.guild"              "dimension"                 
## [21] "preymass"                   "log10.preymass"            
## [23] "PPMR"                       "prey.size.reference"
homerange %>% 
  filter(class=="mammalia")
## # A tibble: 238 × 24
##    taxon   common.name      class order family genus species primarymethod N    
##    <chr>   <chr>            <chr> <chr> <chr>  <chr> <chr>   <chr>         <chr>
##  1 mammals giant golden mo… mamm… afro… chrys… chry… trevel… telemetry*    <NA> 
##  2 mammals Grant's golden … mamm… afro… chrys… erem… granti  telemetry*    <NA> 
##  3 mammals pronghorn        mamm… arti… antil… anti… americ… telemetry*    <NA> 
##  4 mammals impala           mamm… arti… bovid… aepy… melamp… telemetry*    <NA> 
##  5 mammals hartebeest       mamm… arti… bovid… alce… busela… telemetry*    <NA> 
##  6 mammals barbary sheep    mamm… arti… bovid… ammo… lervia  telemetry*    <NA> 
##  7 mammals American bison   mamm… arti… bovid… bison bison   telemetry*    <NA> 
##  8 mammals European bison   mamm… arti… bovid… bison bonasus telemetry*    <NA> 
##  9 mammals goat             mamm… arti… bovid… capra hircus  telemetry*    <NA> 
## 10 mammals Spanish ibex     mamm… arti… bovid… capra pyrena… telemetry*    <NA> 
## # ℹ 228 more rows
## # ℹ 15 more variables: mean.mass.g <dbl>, log10.mass <dbl>,
## #   alternative.mass.reference <chr>, mean.hra.m2 <dbl>, log10.hra <dbl>,
## #   hra.reference <chr>, realm <chr>, thermoregulation <chr>, locomotion <chr>,
## #   trophic.guild <chr>, dimension <dbl>, preymass <dbl>, log10.preymass <dbl>,
## #   PPMR <dbl>, prey.size.reference <chr>
  1. Are there more herbivores or carnivores in mammals? Make a bar plot that shows their relative numbers.
homerange %>% 
  filter(class=="mammalia") %>% 
  count(trophic.guild) 
## # A tibble: 2 × 2
##   trophic.guild     n
##   <chr>         <int>
## 1 carnivore        80
## 2 herbivore       158
homerange %>% 
  filter(class=="mammalia") %>% 
  count(trophic.guild) %>% 
  ggplot(aes(x=trophic.guild, y=n))+
  geom_col()

  1. Make a bar plot that shows the masses of the top 10 smallest mammals.
homerange %>%
  filter(class=="mammalia") %>% 
  top_n(-10, log10.mass) %>% 
  ggplot(aes(x=common.name, y=log10.mass))+
  geom_col()+
  coord_flip()

Wrap-up

Please review the learning goals and be sure to use the code here as a reference when completing the homework.
–>Home