At the end of this exercise, you will be able to:
1. Understand and apply the syntax of building plots using
ggplot2.
2. Build a boxplot using ggplot2.
3. Build a scatterplot using ggplot2.
4. Build a barplot using ggplot2.
##Resources
- R for Data
Science 2e - ggplot2
cheatsheet
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.6
## ✔ forcats 1.0.1 ✔ stringr 1.6.0
## ✔ ggplot2 4.0.1 ✔ tibble 3.3.0
## ✔ lubridate 1.9.4 ✔ tidyr 1.3.2
## ✔ purrr 1.2.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(palmerpenguins)
##
## Attaching package: 'palmerpenguins'
##
## The following objects are masked from 'package:datasets':
##
## penguins, penguins_raw
The ability to quickly produce and customize graphs is a strength of
R. Data visualizations are produced by the package ggplot2
and it is a core part of the tidyverse. The syntax for using ggplot is
specific and common to all types of plots. This is what Hadley Wickham
calls a Grammar of
Graphics. The “gg” in ggplot stands for grammar of
graphics.
What makes a good chart? In my opinion a good chart is elegant in its simplicity. It provides a clean, clear visual of the data without being overwhelming to the reader. This can be hard to do and takes some careful thinking. Always keep in mind that the reader will almost never know the data as well as you do so you need to be mindful about how you present the facts.
We first need to define some of the data types we will use to build plots.
discrete quantitative data that only contains
integerscontinuous quantitative data that can take any
numerical valuecategorical qualitative data that can take on a limited
number of valuesThe syntax used by ggplot takes some practice to get used to, especially for customizing plots, but the basic elements are the same. It is helpful to think of plots as being built up in layers.
In short, plot= data + geom_ + aesthetics.
We start by calling the ggplot function, identifying the data, and
specifying the axes. We then add the geom type to describe
how we want our data represented. Each geom_ works with
specific types of data and R is capable of building plots of single
variables, multiple variables, and even maps. Lastly, we add
aesthetics.
To make things easy, let’s start with some built in data.
names(penguins)
## [1] "species" "island" "bill_length_mm"
## [4] "bill_depth_mm" "flipper_length_mm" "body_mass_g"
## [7] "sex" "year"
glimpse(penguins) #notice that we have a mix of categorical and continuous data
## Rows: 344
## Columns: 8
## $ species <fct> Adelie, Adelie, Adelie, Adelie, Adelie, Adelie, Adel…
## $ island <fct> Torgersen, Torgersen, Torgersen, Torgersen, Torgerse…
## $ bill_length_mm <dbl> 39.1, 39.5, 40.3, NA, 36.7, 39.3, 38.9, 39.2, 34.1, …
## $ bill_depth_mm <dbl> 18.7, 17.4, 18.0, NA, 19.3, 20.6, 17.8, 19.6, 18.1, …
## $ flipper_length_mm <int> 181, 186, 195, NA, 193, 190, 181, 195, 193, 190, 186…
## $ body_mass_g <int> 3750, 3800, 3250, NA, 3450, 3650, 3625, 4675, 3475, …
## $ sex <fct> male, female, female, NA, female, male, female, male…
## $ year <int> 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007…
Let’s start by asking the question: How does body mass vary among penguin species?
To make a plot, we need to first specify the data and map the aesthetics. The aesthetics include how each variable in our data set will be used. In the example below, I am using the aes() function to identify the x and y variables in the plot.
ggplot(data=penguins, #specify the data
mapping=aes(x=species, y=body_mass_g)) #map the aesthetics
Notice that we have a nice background, labeled axes, and even a value
range of our variables on the y-axis- but no plot. This is because we
need to tell ggplot how we want our data represented. This is called the
geometry or geom(). There are many types of
geom, see the ggplot cheatsheet.
Here we want a boxplot, specified by geom_boxplot(). We
will explore boxplots in more detail later, but for now we just need an
example.
ggplot(data=penguins, #specify the data
mapping=aes(x=species, y=body_mass_g))+ #map the aesthetics
geom_boxplot() #add the plot type
## Warning: Removed 2 rows containing non-finite outside the scale range
## (`stat_boxplot()`).
names(penguins)
## [1] "species" "island" "bill_length_mm"
## [4] "bill_depth_mm" "flipper_length_mm" "body_mass_g"
## [7] "sex" "year"
ggplot(data=penguins,
mapping=aes(x=species, y=flipper_length_mm))+
geom_boxplot()
## Warning: Removed 2 rows containing non-finite outside the scale range
## (`stat_boxplot()`).
Now that we have a general idea of the syntax, let’s explore two common plots: 1) scatter plots and 2) bar plots.
Scatter plots are good at revealing relationships that are not readily visible in the raw data.
Let’s ask the question: Is there a relationship between body mass and flipper length?
names(penguins)
## [1] "species" "island" "bill_length_mm"
## [4] "bill_depth_mm" "flipper_length_mm" "body_mass_g"
## [7] "sex" "year"
ggplot(data=penguins, #specify the data
mapping=aes(x=body_mass_g, y=flipper_length_mm))+ #map the aesthetics
geom_point() #add the plot type
## Warning: Removed 2 rows containing missing values or values outside the scale range
## (`geom_point()`).
Notice the warning! R is telling us that there are some missing
values in our data. This is common in real world data sets. R
automatically omits these missing values when plotting. We can also deal
with the NA’s explicitly using the na.rm=T function.
ggplot(data=penguins, #specify the data
mapping=aes(x=body_mass_g, y=flipper_length_mm))+ #map the aesthetics
geom_point(na.rm=T) #add the plot type, disregard NA's
To add a regression (best of fit) line, we add another layer.
ggplot(data=penguins, #specify the data
mapping=aes(x=body_mass_g, y=flipper_length_mm))+ #map the aesthetics
geom_point()+ #add the plot type
geom_smooth(method=lm, se=T) #add a regression line
## `geom_smooth()` using formula = 'y ~ x'
## Warning: Removed 2 rows containing non-finite outside the scale range
## (`stat_smooth()`).
## Warning: Removed 2 rows containing missing values or values outside the scale range
## (`geom_point()`).
This graph is fine, but it doesn’t distinguish between species. It might be helpful for the reader to see how each species is represented. We can do this by mapping the color aesthetic to species.
ggplot(data=penguins, #specify the data
mapping=aes(x=body_mass_g, y=flipper_length_mm, color=species))+ #map the aesthetics
geom_point()+ #add the plot type, map color to species
geom_smooth(method=lm, se=T) #add a regression line
## `geom_smooth()` using formula = 'y ~ x'
## Warning: Removed 2 rows containing non-finite outside the scale range
## (`stat_smooth()`).
## Warning: Removed 2 rows containing missing values or values outside the scale range
## (`geom_point()`).
The plot above looks good, but I think it’s a bit messy having the regression line presented for each species. When we add the color aesthetic, it is passed down to all layers. To fix this, we can move the color aesthetic to just the geom_point layer.
ggplot(data=penguins, #specify the data
mapping=aes(x=body_mass_g, y=flipper_length_mm))+ #map the aesthetics
geom_point(mapping = aes(color = species))+ #add the plot type, map color to species
geom_smooth(method=lm, se=T) #add a regression line
## `geom_smooth()` using formula = 'y ~ x'
## Warning: Removed 2 rows containing non-finite outside the scale range
## (`stat_smooth()`).
## Warning: Removed 2 rows containing missing values or values outside the scale range
## (`geom_point()`).
Another helpful aesthetic is shape which can be used to
distinguish points by shape instead of or in addition to color. A common
approach is to map both color and shape to the same variable.
ggplot(data=penguins, #specify the data
mapping=aes(x=body_mass_g, y=flipper_length_mm))+ #map the aesthetics
geom_point(mapping = aes(color = species, shape=species))+ #add the plot type, map color to species
geom_smooth(method=lm, se=T) #add a regression line
## `geom_smooth()` using formula = 'y ~ x'
## Warning: Removed 2 rows containing non-finite outside the scale range
## (`stat_smooth()`).
## Warning: Removed 2 rows containing missing values or values outside the scale range
## (`geom_point()`).
But don’t we want a title? We do this using the labs()
function.
ggplot(data=penguins, #specify the data
mapping=aes(x=body_mass_g, y=flipper_length_mm))+ #map the aesthetics
geom_point(mapping = aes(color = species, shape=species))+ #add the plot type, map color to species
geom_smooth(method=lm, se=T)+ #add a regression line
labs(title = "Body mass (g) vs. Flipper length (mm)")
## `geom_smooth()` using formula = 'y ~ x'
## Warning: Removed 2 rows containing non-finite outside the scale range
## (`stat_smooth()`).
## Warning: Removed 2 rows containing missing values or values outside the scale range
## (`geom_point()`).
names(penguins)
## [1] "species" "island" "bill_length_mm"
## [4] "bill_depth_mm" "flipper_length_mm" "body_mass_g"
## [7] "sex" "year"
ggplot(data=penguins,
mapping=aes(x=bill_length_mm, y=bill_depth_mm))+
geom_point(mapping=aes(color=species))+
geom_smooth(method=lm, se=T)+
labs(title="Bill length (mm) vs. Bill depth (mm)")
## `geom_smooth()` using formula = 'y ~ x'
## Warning: Removed 2 rows containing non-finite outside the scale range
## (`stat_smooth()`).
## Warning: Removed 2 rows containing missing values or values outside the scale range
## (`geom_point()`).
geom_bar()The simplest type of bar plot counts the number of observations in a
categorical variable. In this case, we want to know how many
observations are present in the variable species. Notice
that we do not specify a y-axis because it is count by default.
names(penguins)
## [1] "species" "island" "bill_length_mm"
## [4] "bill_depth_mm" "flipper_length_mm" "body_mass_g"
## [7] "sex" "year"
ggplot(data=penguins, #specify the data
mapping=aes(x=species))+ #map the aesthetics
geom_bar() #good for counts
What if we want to use the color aesthetic like we did for
geom_point above? Let’s try…
ggplot(data=penguins, #specify the data
mapping=aes(x=species))+ #map the aesthetics
geom_bar(mapping=aes(color=species)) #good for counts
This doesn’t work because the color aesthetic is being applied to the
bars themselves. Instead, we need to use fill to color the
inside of the bars.
ggplot(data=penguins, #specify the data
mapping=aes(x=species))+ #map the aesthetics
geom_bar(mapping=aes(fill=species)) #good for counts
names(penguins)
## [1] "species" "island" "bill_length_mm"
## [4] "bill_depth_mm" "flipper_length_mm" "body_mass_g"
## [7] "sex" "year"
ggplot(data=penguins,
mapping=aes(x=island))+
geom_bar(mapping=aes(fill=island))
fill to distinguish species)ggplot(data=penguins,
mapping=aes(x=island))+
geom_bar(mapping=aes(fill=species), position="dodge")
Please review the learning goals and be sure to use the code here as a reference when completing the homework.
–>Home