ggplot
part 1At the end of this exercise, you will be able to:
1. Understand and apply the syntax of building plots using
ggplot2
.
2. Build a boxplot using ggplot2
.
3. Build a scatterplot using ggplot2
.
4. Build a barplot using ggplot2
and show the difference
between stat=count
and stat=identity
.
At this point you should feel comfortable working in RStudio and
using dplyr
and tidyr.
You also know how to
produce statistical summaries of data and deal with NA’s. It is OK if
you need to go back through the labs and find bits of code that work for
you, that’s what most people do!
##Resources
- ggplot2
cheatsheet
library(tidyverse)
library(naniar)
library(janitor)
The ability to quickly produce and edit graphs and charts is a
strength of R. These data visualizations are produced by the package
ggplot2
and it is a core part of the tidyverse. The syntax
for using ggplot is specific and common to all of the plots. This is
what Hadley Wickham calls a Grammar of
Graphics. The “gg” in ggplot
stands for grammar of
graphics.
What makes a good chart? In my opinion a good chart is elegant in its simplicity. It provides a clean, clear visual of the data without being overwhelming to the reader. This can be hard to do and takes some careful thinking. Always keep in mind that the reader will almost never know the data as well as you do so you need to be mindful about presenting the facts.
We first need to define some of the data types we will use to build plots.
discrete
quantitative data that only contains
integerscontinuous
quantitative data that can take any
numerical valuecategorical
qualitative data that can take on a limited
number of valuesThe syntax used by ggplot takes some practice to get used to, especially for customizing plots, but the basic elements are the same. It is helpful to think of plots as being built up in layers.
In short, plot= data + geom_ + aesthetics.
We start by calling the ggplot function, identifying the data, and
specifying the axes. We then add the geom
type to describe
how we want our data represented. Each geom_
works with
specific types of data and R is capable of building plots of single
variables, multiple variables, and even maps. Lastly, we add
aesthetics.
To make things easy, let’s start with some built in data.
names(iris)
## [1] "Sepal.Length" "Sepal.Width" "Petal.Length" "Petal.Width" "Species"
glimpse(iris)
## Rows: 150
## Columns: 5
## $ Sepal.Length <dbl> 5.1, 4.9, 4.7, 4.6, 5.0, 5.4, 4.6, 5.0, 4.4, 4.9, 5.4, 4.…
## $ Sepal.Width <dbl> 3.5, 3.0, 3.2, 3.1, 3.6, 3.9, 3.4, 3.4, 2.9, 3.1, 3.7, 3.…
## $ Petal.Length <dbl> 1.4, 1.4, 1.3, 1.5, 1.4, 1.7, 1.4, 1.5, 1.4, 1.5, 1.5, 1.…
## $ Petal.Width <dbl> 0.2, 0.2, 0.2, 0.2, 0.2, 0.4, 0.3, 0.2, 0.2, 0.1, 0.2, 0.…
## $ Species <fct> setosa, setosa, setosa, setosa, setosa, setosa, setosa, s…
To make a plot, we need to first specify the data and map the aesthetics. The aesthetics include how each variable in our data set will be used. In the example below, I am using the aes() function to identify the x and y variables in the plot.
ggplot(data=iris, #specify the data
mapping=aes(x=Species, y=Petal.Length)) #map the aesthetics
Notice that we have a nice background, labeled axes, and even a value
range of our variables on the y-axis- but no plot. This is because we
need to tell ggplot how we want our data represented. This is called the
geometry or geom()
. There are many types of
geom
, see the ggplot cheatsheet.
Here we specify that we want a boxplot, indicated by
geom_boxplot()
.
ggplot(data=iris, #specify the data
mapping=aes(x=Species, y=Petal.Length))+ #map the aesthetics
geom_boxplot() #add the plot type
geom_
for a scatterplot.names(iris)
## [1] "Sepal.Length" "Sepal.Width" "Petal.Length" "Petal.Width" "Species"
ggplot(data=iris,
mapping=aes(x=Sepal.Width, y=Sepal.Length))+
geom_point()
Now that we have a general idea of the syntax, let’s start by working with two common plots: 1) scatter plots and 2) bar plots.
Database of vertebrate home range sizes.
Reference: Tamburello N, Cote IM, Dulvy NK (2015) Energy and the scaling
of animal space use. The American Naturalist 186(2):196-211. http://dx.doi.org/10.1086/682070.
Data: http://datadryad.org/resource/doi:10.5061/dryad.q5j65/1
homerange <- read_csv("data/Tamburelloetal_HomeRangeDatabase.csv")
## Rows: 569 Columns: 24
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (16): taxon, common.name, class, order, family, genus, species, primarym...
## dbl (8): mean.mass.g, log10.mass, mean.hra.m2, log10.hra, dimension, preyma...
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
homerange
data? Does it
have any NA’s? Is it tidy? Do a quick exploratory analysis of your
choice below.glimpse(homerange)
## Rows: 569
## Columns: 24
## $ taxon <chr> "lake fishes", "river fishes", "river fishe…
## $ common.name <chr> "american eel", "blacktail redhorse", "cent…
## $ class <chr> "actinopterygii", "actinopterygii", "actino…
## $ order <chr> "anguilliformes", "cypriniformes", "cyprini…
## $ family <chr> "anguillidae", "catostomidae", "cyprinidae"…
## $ genus <chr> "anguilla", "moxostoma", "campostoma", "cli…
## $ species <chr> "rostrata", "poecilura", "anomalum", "fundu…
## $ primarymethod <chr> "telemetry", "mark-recapture", "mark-recapt…
## $ N <chr> "16", NA, "20", "26", "17", "5", "2", "2", …
## $ mean.mass.g <dbl> 887.00, 562.00, 34.00, 4.00, 4.00, 3525.00,…
## $ log10.mass <dbl> 2.9479236, 2.7497363, 1.5314789, 0.6020600,…
## $ alternative.mass.reference <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
## $ mean.hra.m2 <dbl> 282750.00, 282.10, 116.11, 125.50, 87.10, 3…
## $ log10.hra <dbl> 5.4514026, 2.4504031, 2.0648696, 2.0986437,…
## $ hra.reference <chr> "Minns, C. K. 1995. Allometry of home range…
## $ realm <chr> "aquatic", "aquatic", "aquatic", "aquatic",…
## $ thermoregulation <chr> "ectotherm", "ectotherm", "ectotherm", "ect…
## $ locomotion <chr> "swimming", "swimming", "swimming", "swimmi…
## $ trophic.guild <chr> "carnivore", "carnivore", "carnivore", "car…
## $ dimension <dbl> 3, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 3, 3…
## $ preymass <dbl> NA, NA, NA, NA, NA, NA, 1.39, NA, NA, NA, N…
## $ log10.preymass <dbl> NA, NA, NA, NA, NA, NA, 0.1430148, NA, NA, …
## $ PPMR <dbl> NA, NA, NA, NA, NA, NA, 530, NA, NA, NA, NA…
## $ prey.size.reference <chr> NA, NA, NA, NA, NA, NA, "Brose U, et al. 20…
naniar::miss_var_summary(homerange)
## # A tibble: 24 × 3
## variable n_miss pct_miss
## <chr> <int> <num>
## 1 alternative.mass.reference 561 98.6
## 2 preymass 502 88.2
## 3 log10.preymass 502 88.2
## 4 PPMR 502 88.2
## 5 prey.size.reference 502 88.2
## 6 N 375 65.9
## 7 primarymethod 1 0.176
## 8 taxon 0 0
## 9 common.name 0 0
## 10 class 0 0
## # ℹ 14 more rows
Scatter plots are good at revealing relationships that are not readily visible in the raw data. For now, we will not add regression aka. “best of fit” lines or calculate any r2 values.
In the case below, we are exploring whether or not there is a relationship between animal mass and home range. We are using the log transformed values because there is a large difference in mass and home range among the different species in the data.
names(homerange)
## [1] "taxon" "common.name"
## [3] "class" "order"
## [5] "family" "genus"
## [7] "species" "primarymethod"
## [9] "N" "mean.mass.g"
## [11] "log10.mass" "alternative.mass.reference"
## [13] "mean.hra.m2" "log10.hra"
## [15] "hra.reference" "realm"
## [17] "thermoregulation" "locomotion"
## [19] "trophic.guild" "dimension"
## [21] "preymass" "log10.preymass"
## [23] "PPMR" "prey.size.reference"
ggplot(data=homerange, #specify the data
mapping=aes(x=log10.mass, y=log10.hra))+ #map the aesthetics
geom_point() #add the plot type
In big data sets with lots of overlapping values, over-plotting can
be an issue. geom_jitter()
is similar to
geom_point()
but it helps with over plotting by adding some
random noise to the data and separating some of the individual
points.
ggplot(data=homerange, mapping=aes(x=log10.mass, y=log10.hra))+
geom_jitter()
To add a regression (best of fit) line, we just add another layer.
ggplot(data=homerange, mapping=aes(x=log10.mass, y=log10.hra))+
geom_point()+
geom_smooth(method=lm, se=T) #add a regression line
## `geom_smooth()` using formula = 'y ~ x'
names(homerange)
## [1] "taxon" "common.name"
## [3] "class" "order"
## [5] "family" "genus"
## [7] "species" "primarymethod"
## [9] "N" "mean.mass.g"
## [11] "log10.mass" "alternative.mass.reference"
## [13] "mean.hra.m2" "log10.hra"
## [15] "hra.reference" "realm"
## [17] "thermoregulation" "locomotion"
## [19] "trophic.guild" "dimension"
## [21] "preymass" "log10.preymass"
## [23] "PPMR" "prey.size.reference"
ggplot(homerange, mapping=aes(x=log10.hra, y=log10.preymass))+
geom_point(na.rm=T)+
geom_smooth(method=lm, se=F, na.rm=F)
## `geom_smooth()` using formula = 'y ~ x'
## Warning: Removed 502 rows containing non-finite outside the scale range
## (`stat_smooth()`).
geom_bar()
The simplest type of bar plot counts the number of observations in a
categorical variable. In this case, we want to know how many
observations are present in the variable trophic.guild
.
Notice that we do not specify a y-axis because it is count by
default.
names(homerange)
## [1] "taxon" "common.name"
## [3] "class" "order"
## [5] "family" "genus"
## [7] "species" "primarymethod"
## [9] "N" "mean.mass.g"
## [11] "log10.mass" "alternative.mass.reference"
## [13] "mean.hra.m2" "log10.hra"
## [15] "hra.reference" "realm"
## [17] "thermoregulation" "locomotion"
## [19] "trophic.guild" "dimension"
## [21] "preymass" "log10.preymass"
## [23] "PPMR" "prey.size.reference"
homerange %>%
count(trophic.guild)
## # A tibble: 2 × 2
## trophic.guild n
## <chr> <int>
## 1 carnivore 342
## 2 herbivore 227
Also notice that we can use pipes! The mapping=
function
is implied by aes
and so is often left out.
homerange %>%
ggplot(aes(x=trophic.guild)) +
geom_bar() #good for counts
geom_col()
Unlike geom_bar()
, geom_col()
allows us to
specify an x-axis and a y-axis.
homerange %>%
filter(family=="salmonidae") %>%
select(common.name, log10.mass) %>%
ggplot(aes(y=common.name, x=log10.mass))+ #notice the switch in x and y
geom_col()
geom_bar()
with stat="identity"
stat="identity"
allows us to map a variable to the y-axis
so that we aren’t restricted to counts.
homerange %>%
filter(family=="salmonidae") %>%
ggplot(aes(x=common.name, y=log10.mass))+
geom_bar(stat="identity")
homerange
data to include
mammals
only.names(homerange)
## [1] "taxon" "common.name"
## [3] "class" "order"
## [5] "family" "genus"
## [7] "species" "primarymethod"
## [9] "N" "mean.mass.g"
## [11] "log10.mass" "alternative.mass.reference"
## [13] "mean.hra.m2" "log10.hra"
## [15] "hra.reference" "realm"
## [17] "thermoregulation" "locomotion"
## [19] "trophic.guild" "dimension"
## [21] "preymass" "log10.preymass"
## [23] "PPMR" "prey.size.reference"
homerange %>%
filter(class=="mammalia")
## # A tibble: 238 × 24
## taxon common.name class order family genus species primarymethod N
## <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
## 1 mammals giant golden mo… mamm… afro… chrys… chry… trevel… telemetry* <NA>
## 2 mammals Grant's golden … mamm… afro… chrys… erem… granti telemetry* <NA>
## 3 mammals pronghorn mamm… arti… antil… anti… americ… telemetry* <NA>
## 4 mammals impala mamm… arti… bovid… aepy… melamp… telemetry* <NA>
## 5 mammals hartebeest mamm… arti… bovid… alce… busela… telemetry* <NA>
## 6 mammals barbary sheep mamm… arti… bovid… ammo… lervia telemetry* <NA>
## 7 mammals American bison mamm… arti… bovid… bison bison telemetry* <NA>
## 8 mammals European bison mamm… arti… bovid… bison bonasus telemetry* <NA>
## 9 mammals goat mamm… arti… bovid… capra hircus telemetry* <NA>
## 10 mammals Spanish ibex mamm… arti… bovid… capra pyrena… telemetry* <NA>
## # ℹ 228 more rows
## # ℹ 15 more variables: mean.mass.g <dbl>, log10.mass <dbl>,
## # alternative.mass.reference <chr>, mean.hra.m2 <dbl>, log10.hra <dbl>,
## # hra.reference <chr>, realm <chr>, thermoregulation <chr>, locomotion <chr>,
## # trophic.guild <chr>, dimension <dbl>, preymass <dbl>, log10.preymass <dbl>,
## # PPMR <dbl>, prey.size.reference <chr>
homerange %>%
filter(class=="mammalia") %>%
count(trophic.guild)
## # A tibble: 2 × 2
## trophic.guild n
## <chr> <int>
## 1 carnivore 80
## 2 herbivore 158
homerange %>%
filter(class=="mammalia") %>%
count(trophic.guild) %>%
ggplot(aes(x=trophic.guild, y=n))+
geom_col()
homerange %>%
filter(class=="mammalia") %>%
top_n(-10, log10.mass) %>%
ggplot(aes(x=common.name, y=log10.mass))+
geom_col()+
coord_flip()
Please review the learning goals and be sure to use the code here as
a reference when completing the homework.
–>Home