Learning Goals

At the end of this exercise, you will be able to:
1. Understand and apply the syntax of building plots using ggplot2.
2. Build a boxplot using ggplot2.
3. Build a scatterplot using ggplot2.
4. Build a barplot using ggplot2.

##Resources
- R for Data Science 2e - ggplot2 cheatsheet

Libraries

library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.6
## ✔ forcats   1.0.1     ✔ stringr   1.6.0
## ✔ ggplot2   4.0.1     ✔ tibble    3.3.0
## ✔ lubridate 1.9.4     ✔ tidyr     1.3.2
## ✔ purrr     1.2.1     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(palmerpenguins)
## 
## Attaching package: 'palmerpenguins'
## 
## The following objects are masked from 'package:datasets':
## 
##     penguins, penguins_raw

Grammar of Graphics

The ability to quickly produce and customize graphs is a strength of R. Data visualizations are produced by the package ggplot2 and it is a core part of the tidyverse. The syntax for using ggplot is specific and common to all types of plots. This is what Hadley Wickham calls a Grammar of Graphics. The “gg” in ggplot stands for grammar of graphics.

Philosophy

What makes a good chart? In my opinion a good chart is elegant in its simplicity. It provides a clean, clear visual of the data without being overwhelming to the reader. This can be hard to do and takes some careful thinking. Always keep in mind that the reader will almost never know the data as well as you do so you need to be mindful about how you present the facts.

Data Types

We first need to define some of the data types we will use to build plots.

  • discrete quantitative data that only contains integers
  • continuous quantitative data that can take any numerical value
  • categorical qualitative data that can take on a limited number of values

Basics

The syntax used by ggplot takes some practice to get used to, especially for customizing plots, but the basic elements are the same. It is helpful to think of plots as being built up in layers.

In short, plot= data + geom_ + aesthetics.

We start by calling the ggplot function, identifying the data, and specifying the axes. We then add the geom type to describe how we want our data represented. Each geom_ works with specific types of data and R is capable of building plots of single variables, multiple variables, and even maps. Lastly, we add aesthetics.

Example

To make things easy, let’s start with some built in data.

names(penguins)
## [1] "species"           "island"            "bill_length_mm"   
## [4] "bill_depth_mm"     "flipper_length_mm" "body_mass_g"      
## [7] "sex"               "year"
glimpse(penguins) #notice that we have a mix of categorical and continuous data
## Rows: 344
## Columns: 8
## $ species           <fct> Adelie, Adelie, Adelie, Adelie, Adelie, Adelie, Adel…
## $ island            <fct> Torgersen, Torgersen, Torgersen, Torgersen, Torgerse…
## $ bill_length_mm    <dbl> 39.1, 39.5, 40.3, NA, 36.7, 39.3, 38.9, 39.2, 34.1, …
## $ bill_depth_mm     <dbl> 18.7, 17.4, 18.0, NA, 19.3, 20.6, 17.8, 19.6, 18.1, …
## $ flipper_length_mm <int> 181, 186, 195, NA, 193, 190, 181, 195, 193, 190, 186…
## $ body_mass_g       <int> 3750, 3800, 3250, NA, 3450, 3650, 3625, 4675, 3475, …
## $ sex               <fct> male, female, female, NA, female, male, female, male…
## $ year              <int> 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007…

Let’s start by asking the question: How does body mass vary among penguin species?

To make a plot, we need to first specify the data and map the aesthetics. The aesthetics include how each variable in our data set will be used. In the example below, I am using the aes() function to identify the x and y variables in the plot.

ggplot(data=penguins, #specify the data
       mapping=aes(x=species, y=body_mass_g)) #map the aesthetics

Notice that we have a nice background, labeled axes, and even a value range of our variables on the y-axis- but no plot. This is because we need to tell ggplot how we want our data represented. This is called the geometry or geom(). There are many types of geom, see the ggplot cheatsheet.

Here we want a boxplot, specified by geom_boxplot(). We will explore boxplots in more detail later, but for now we just need an example.

ggplot(data=penguins, #specify the data
       mapping=aes(x=species, y=body_mass_g))+ #map the aesthetics
  geom_boxplot() #add the plot type
## Warning: Removed 2 rows containing non-finite outside the scale range
## (`stat_boxplot()`).

Practice

  1. How does flipper length vary among penguin species?
names(penguins)
## [1] "species"           "island"            "bill_length_mm"   
## [4] "bill_depth_mm"     "flipper_length_mm" "body_mass_g"      
## [7] "sex"               "year"
ggplot(data=penguins, 
       mapping=aes(x=species, y=flipper_length_mm))+
  geom_boxplot()
## Warning: Removed 2 rows containing non-finite outside the scale range
## (`stat_boxplot()`).

Scatterplots and barplots

Now that we have a general idea of the syntax, let’s explore two common plots: 1) scatter plots and 2) bar plots.

1. Scatter Plots

Scatter plots are good at revealing relationships that are not readily visible in the raw data.

Let’s ask the question: Is there a relationship between body mass and flipper length?

names(penguins)
## [1] "species"           "island"            "bill_length_mm"   
## [4] "bill_depth_mm"     "flipper_length_mm" "body_mass_g"      
## [7] "sex"               "year"
ggplot(data=penguins, #specify the data
       mapping=aes(x=body_mass_g, y=flipper_length_mm))+ #map the aesthetics
  geom_point() #add the plot type
## Warning: Removed 2 rows containing missing values or values outside the scale range
## (`geom_point()`).

Notice the warning! R is telling us that there are some missing values in our data. This is common in real world data sets. R automatically omits these missing values when plotting. We can also deal with the NA’s explicitly using the na.rm=T function.

ggplot(data=penguins, #specify the data
       mapping=aes(x=body_mass_g, y=flipper_length_mm))+ #map the aesthetics
  geom_point(na.rm=T) #add the plot type, disregard NA's

To add a regression (best of fit) line, we add another layer.

ggplot(data=penguins, #specify the data
       mapping=aes(x=body_mass_g, y=flipper_length_mm))+ #map the aesthetics
  geom_point()+ #add the plot type
  geom_smooth(method=lm, se=T) #add a regression line
## `geom_smooth()` using formula = 'y ~ x'
## Warning: Removed 2 rows containing non-finite outside the scale range
## (`stat_smooth()`).
## Warning: Removed 2 rows containing missing values or values outside the scale range
## (`geom_point()`).

This graph is fine, but it doesn’t distinguish between species. It might be helpful for the reader to see how each species is represented. We can do this by mapping the color aesthetic to species.

ggplot(data=penguins, #specify the data
       mapping=aes(x=body_mass_g, y=flipper_length_mm, color=species))+ #map the aesthetics
  geom_point()+ #add the plot type, map color to species
  geom_smooth(method=lm, se=T) #add a regression line
## `geom_smooth()` using formula = 'y ~ x'
## Warning: Removed 2 rows containing non-finite outside the scale range
## (`stat_smooth()`).
## Warning: Removed 2 rows containing missing values or values outside the scale range
## (`geom_point()`).

The plot above looks good, but I think it’s a bit messy having the regression line presented for each species. When we add the color aesthetic, it is passed down to all layers. To fix this, we can move the color aesthetic to just the geom_point layer.

ggplot(data=penguins, #specify the data
       mapping=aes(x=body_mass_g, y=flipper_length_mm))+ #map the aesthetics
  geom_point(mapping = aes(color = species))+ #add the plot type, map color to species
  geom_smooth(method=lm, se=T) #add a regression line
## `geom_smooth()` using formula = 'y ~ x'
## Warning: Removed 2 rows containing non-finite outside the scale range
## (`stat_smooth()`).
## Warning: Removed 2 rows containing missing values or values outside the scale range
## (`geom_point()`).

Another helpful aesthetic is shape which can be used to distinguish points by shape instead of or in addition to color. A common approach is to map both color and shape to the same variable.

ggplot(data=penguins, #specify the data
       mapping=aes(x=body_mass_g, y=flipper_length_mm))+ #map the aesthetics
  geom_point(mapping = aes(color = species, shape=species))+ #add the plot type, map color to species
  geom_smooth(method=lm, se=T) #add a regression line
## `geom_smooth()` using formula = 'y ~ x'
## Warning: Removed 2 rows containing non-finite outside the scale range
## (`stat_smooth()`).
## Warning: Removed 2 rows containing missing values or values outside the scale range
## (`geom_point()`).

But don’t we want a title? We do this using the labs() function.

ggplot(data=penguins, #specify the data
       mapping=aes(x=body_mass_g, y=flipper_length_mm))+ #map the aesthetics
  geom_point(mapping = aes(color = species, shape=species))+ #add the plot type, map color to species
  geom_smooth(method=lm, se=T)+ #add a regression line
  labs(title = "Body mass (g) vs. Flipper length (mm)")
## `geom_smooth()` using formula = 'y ~ x'
## Warning: Removed 2 rows containing non-finite outside the scale range
## (`stat_smooth()`).
## Warning: Removed 2 rows containing missing values or values outside the scale range
## (`geom_point()`).

Practice

  1. Is there a relationship between bill length and bill depth?
names(penguins)
## [1] "species"           "island"            "bill_length_mm"   
## [4] "bill_depth_mm"     "flipper_length_mm" "body_mass_g"      
## [7] "sex"               "year"
ggplot(data=penguins,
       mapping=aes(x=bill_length_mm, y=bill_depth_mm))+
  geom_point(mapping=aes(color=species))+
  geom_smooth(method=lm, se=T)+
  labs(title="Bill length (mm) vs. Bill depth (mm)")
## `geom_smooth()` using formula = 'y ~ x'
## Warning: Removed 2 rows containing non-finite outside the scale range
## (`stat_smooth()`).
## Warning: Removed 2 rows containing missing values or values outside the scale range
## (`geom_point()`).

Bar Plot: geom_bar()

The simplest type of bar plot counts the number of observations in a categorical variable. In this case, we want to know how many observations are present in the variable species. Notice that we do not specify a y-axis because it is count by default.

names(penguins)
## [1] "species"           "island"            "bill_length_mm"   
## [4] "bill_depth_mm"     "flipper_length_mm" "body_mass_g"      
## [7] "sex"               "year"
ggplot(data=penguins, #specify the data
       mapping=aes(x=species))+ #map the aesthetics
  geom_bar() #good for counts

What if we want to use the color aesthetic like we did for geom_point above? Let’s try…

ggplot(data=penguins, #specify the data
       mapping=aes(x=species))+ #map the aesthetics
  geom_bar(mapping=aes(color=species)) #good for counts

This doesn’t work because the color aesthetic is being applied to the bars themselves. Instead, we need to use fill to color the inside of the bars.

ggplot(data=penguins, #specify the data
       mapping=aes(x=species))+ #map the aesthetics
  geom_bar(mapping=aes(fill=species)) #good for counts

Practice

  1. Make a bar plot showing the number of penguins on each island.
names(penguins)
## [1] "species"           "island"            "bill_length_mm"   
## [4] "bill_depth_mm"     "flipper_length_mm" "body_mass_g"      
## [7] "sex"               "year"
ggplot(data=penguins, 
       mapping=aes(x=island))+
  geom_bar(mapping=aes(fill=island))

  1. Make a bar plot showing the number of penguins of each species on each island. (Hint: use fill to distinguish species)
ggplot(data=penguins,
       mapping=aes(x=island))+
  geom_bar(mapping=aes(fill=species), position="dodge")

Wrap-up

Please review the learning goals and be sure to use the code here as a reference when completing the homework.

–>Home