At the end of this exercise, you will be able to:
1. Define NA and describe how they are treated in R.
2. Produce summaries of the number of NA’s in a data set.
3. Replace values with NA in a data set.
naniarThe naniar package includes some useful tools to manage NA’s.
#install.packages("naniar")
library("tidyverse")
library("naniar")
library("janitor")
When working with “wild” data, dealing with NA’s is a fundamental part of the data cleaning process. Data scientists spend much of their time cleaning and transforming data- including managing NA’s. There isn’t a single approach that will always work so you need to be careful about using replacement strategies across an entire data set. And, as the data sets become larger NA’s can become trickier to deal with.
For the following, we will use life history data for mammals. The data are
from:
S. K. Morgan Ernest. 2003. Life history characteristics of
placental non-volant mammals. Ecology 84:3402.
life_history <- read_csv("data/mammal_lifehistories_v3.csv") %>% clean_names()
Sometimes using one or more of the summary functions can give us clues to how the authors have represented missing data. This doesn’t always work, but it is a good place to start.
This will give you a quick summary of the number of NA’s in each variable. Notice that, at least for now, it doesn’t look like there are NA’s in most variables because they are still represented by -999.
Here we look for all of the -999’s.
read_csv()In the case of -999 as NA, we can use
read_csv to replace these values when we import the data.
But, I only do this once I understand how NA’s are represented in the
data. Sometimes, they are represented in multiple ways.
life_history_no_nas <-
read_csv("data/mammal_lifehistories_v3.csv", na="-999") %>%
clean_names()
Rechecking for NA’s. Now we see that there are NA’s in many of the variables.
Sometimes it can be helpful to do a quick scan using view() or
glimpse() to see if there are any other odd values that might represent
NA’s. Notice that not measured is used in
newborn.
Notice that max_life has no NA’s. Does that make sense?
How likely is it that we know the lifespan for all of the species in the
data set?
Let’s use mutate() and use na_if() to
replace 0’s with NA’s in max_life. This chunk allows us to
address problems in a single variable.
naniarnaniar is a package that manages NA’s. Many of the functions it performs can also be performed using tidyverse functions, but it provides some nice alternatives.
miss_var_summary provides a summary of NA’s across the
data frame.
A unique feature of naniar is that it can produce
visuals to help evaluate NA’s. Here we use gg_miss_var to
visualize the number of NA’s in each variable.
We can also use geom_miss_point() to visualize where
NA’s are located in a scatter plot. Here we see that there are NA’s in
wean_mass across the range of mass.
We can also use miss_var_summary with
group_by(). This helps us better evaluate where NA’s are in
the data.
naniar has nice replace functions which will allow you
to precisely control which values you want replaced with NA’s in each
variable. This is a nice alternative to mutate() and
na_if().
#life_history %>% #going back to the original data
# replace_with_na(replace = list(newborn = "not measured",
# weaning= -999,
# wean_mass= -999,
# afr= -999,
# max_life= 0,
# litter_size= -999,
# gestation= -999,
# mass= -999)) %>%
#miss_var_summary()
You can also use naniar to replace a specific value (like -999) with NA across the entire data set.
Finally, naniar has some built-in examples of common values or character strings used to represent NA’s. The chunk below will use these built-in parameters to replace NA’s across the entire data set.
common_na_strings
## [1] "missing" "NA" "N A" "N/A" "#N/A" "NA " " NA"
## [8] "N /A" "N / A" " N / A" "N / A " "na" "n a" "n/a"
## [15] "na " " na" "n /a" "n / a" " a / a" "n / a " "NULL"
## [22] "null" "" "\\?" "\\*" "\\."
common_na_numbers
## [1] -9 -99 -999 -9999 9999 66 77 88
Let’s practice evaluating NA’s in a large data set. The data are compiled from CITES. This is the international organization that tracks trade in endangered wildlife. You can find information about the data here.
Some key information:
country
codes
Import the data and do a little exploration. Be sure to clean the names if necessary.
Use naniar to summarize the NA’s in each
variable.
Try using group_by() with naniar. Look
specifically at class and
exporter_reported_quantity. For which taxonomic classes do
we have the highest number of missing export data?
–>Home