At the end of this exercise, you will be able to:
1. Define NA and describe how they are treated in R.
2. Produce summaries of the number of NA’s in a data set.
3. Replace values with NA in a data set.
naniarThe naniar package includes some useful tools to manage NA’s.
#install.packages("naniar")
library("tidyverse")
library("naniar")
library("janitor")
When working with “wild” data, dealing with NA’s is a fundamental part of the data cleaning process. Data scientists spend much of their time cleaning and transforming data- including managing NA’s. There isn’t a single approach that will always work so you need to be careful about using replacement strategies across an entire data set. And, as the data sets become larger NA’s can become trickier to deal with.
For the following, we will use life history data for mammals. The data are
from:
S. K. Morgan Ernest. 2003. Life history characteristics of
placental non-volant mammals. Ecology 84:3402.
life_history <- read_csv("data/mammal_lifehistories_v3.csv") %>% clean_names()
Sometimes using one or more of the summary functions can give us clues to how the authors have represented missing data. This doesn’t always work, but it is a good place to start.
glimpse(life_history)
## Rows: 1,440
## Columns: 13
## $ order <chr> "Artiodactyla", "Artiodactyla", "Artiodactyla", "Artiodac…
## $ family <chr> "Antilocapridae", "Bovidae", "Bovidae", "Bovidae", "Bovid…
## $ genus <chr> "Antilocapra", "Addax", "Aepyceros", "Alcelaphus", "Ammod…
## $ species <chr> "americana", "nasomaculatus", "melampus", "buselaphus", "…
## $ mass <dbl> 45375.0, 182375.0, 41480.0, 150000.0, 28500.0, 55500.0, 3…
## $ gestation <dbl> 8.13, 9.39, 6.35, 7.90, 6.80, 5.08, 5.72, 5.50, 8.93, 9.1…
## $ newborn <chr> "3246.36", "5480", "5093", "10166.67", "not measured", "3…
## $ weaning <dbl> 3.00, 6.50, 5.63, 6.50, -999.00, 4.00, 4.04, 2.13, 10.71,…
## $ wean_mass <dbl> 8900, -999, 15900, -999, -999, -999, -999, -999, 157500, …
## $ afr <dbl> 13.53, 27.27, 16.66, 23.02, -999.00, 14.89, 10.23, 20.13,…
## $ max_life <dbl> 142, 308, 213, 240, 0, 251, 228, 255, 300, 324, 300, 314,…
## $ litter_size <dbl> 1.85, 1.00, 1.00, 1.00, 1.00, 1.37, 1.00, 1.00, 1.00, 1.0…
## $ litters_year <dbl> 1.00, 0.99, 0.95, NA, NA, 2.00, NA, 1.89, 1.00, 1.00, 0.7…
summary(life_history)
## order family genus species
## Length:1440 Length:1440 Length:1440 Length:1440
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
##
## mass gestation newborn weaning
## Min. : -999 Min. :-999.00 Length:1440 Min. :-999.00
## 1st Qu.: 50 1st Qu.:-999.00 Class :character 1st Qu.:-999.00
## Median : 403 Median : 1.05 Mode :character Median : 0.73
## Mean : 383577 Mean :-287.25 Mean :-427.17
## 3rd Qu.: 7009 3rd Qu.: 4.50 3rd Qu.: 2.00
## Max. :149000000 Max. : 21.46 Max. : 48.00
##
## wean_mass afr max_life litter_size
## Min. : -999 Min. :-999.00 Min. : 0.00 Min. :-999.000
## 1st Qu.: -999 1st Qu.:-999.00 1st Qu.: 0.00 1st Qu.: 1.000
## Median : -999 Median : 2.50 Median : 0.00 Median : 2.270
## Mean : 16049 Mean :-408.12 Mean : 93.19 Mean : -55.634
## 3rd Qu.: 10 3rd Qu.: 15.61 3rd Qu.: 147.25 3rd Qu.: 3.835
## Max. :19075000 Max. : 210.00 Max. :1368.00 Max. : 14.180
##
## litters_year
## Min. :0.140
## 1st Qu.:1.000
## Median :1.000
## Mean :1.636
## 3rd Qu.:2.000
## Max. :7.500
## NA's :689
This will give you a quick summary of the number of NA’s in each variable. Notice that, at least for now, it doesn’t look like there are NA’s in most variables because they are still represented by -999.
life_history %>%
summarize(across(everything(), ~ sum(is.na(.))))
## # A tibble: 1 × 13
## order family genus species mass gestation newborn weaning wean_mass afr
## <int> <int> <int> <int> <int> <int> <int> <int> <int> <int>
## 1 0 0 0 0 0 0 0 0 0 0
## # ℹ 3 more variables: max_life <int>, litter_size <int>, litters_year <int>
Here we look for all of the -999’s.
life_history %>%
summarize(across(everything(), ~sum(.==-999)))
## # A tibble: 1 × 13
## order family genus species mass gestation newborn weaning wean_mass afr
## <int> <int> <int> <int> <int> <int> <int> <int> <int> <int>
## 1 0 0 0 0 85 418 0 619 1039 607
## # ℹ 3 more variables: max_life <int>, litter_size <int>, litters_year <int>
read_csv()In the case of -999 as NA, we can use
read_csv to replace these values when we import the data.
But, I only do this once I understand how NA’s are represented in the
data. Sometimes, they are represented in multiple ways.
life_history_no_nas <-
read_csv("data/mammal_lifehistories_v3.csv", na="-999") %>%
clean_names()
Rechecking for NA’s. Now we see that there are NA’s in many of the variables.
life_history_no_nas %>%
summarize(across(everything(), ~ sum(is.na(.))))
## # A tibble: 1 × 13
## order family genus species mass gestation newborn weaning wean_mass afr
## <int> <int> <int> <int> <int> <int> <int> <int> <int> <int>
## 1 0 0 0 0 85 418 0 619 1039 607
## # ℹ 3 more variables: max_life <int>, litter_size <int>, litters_year <int>
Sometimes it can be helpful to do a quick scan using view() or
glimpse() to see if there are any other odd values that might represent
NA’s. Notice that not measured is used in
newborn.
glimpse(life_history_no_nas)
## Rows: 1,440
## Columns: 13
## $ order <chr> "Artiodactyla", "Artiodactyla", "Artiodactyla", "Artiodac…
## $ family <chr> "Antilocapridae", "Bovidae", "Bovidae", "Bovidae", "Bovid…
## $ genus <chr> "Antilocapra", "Addax", "Aepyceros", "Alcelaphus", "Ammod…
## $ species <chr> "americana", "nasomaculatus", "melampus", "buselaphus", "…
## $ mass <dbl> 45375.0, 182375.0, 41480.0, 150000.0, 28500.0, 55500.0, 3…
## $ gestation <dbl> 8.13, 9.39, 6.35, 7.90, 6.80, 5.08, 5.72, 5.50, 8.93, 9.1…
## $ newborn <chr> "3246.36", "5480", "5093", "10166.67", "not measured", "3…
## $ weaning <dbl> 3.00, 6.50, 5.63, 6.50, NA, 4.00, 4.04, 2.13, 10.71, 6.60…
## $ wean_mass <dbl> 8900, NA, 15900, NA, NA, NA, NA, NA, 157500, NA, NA, NA, …
## $ afr <dbl> 13.53, 27.27, 16.66, 23.02, NA, 14.89, 10.23, 20.13, 29.4…
## $ max_life <dbl> 142, 308, 213, 240, 0, 251, 228, 255, 300, 324, 300, 314,…
## $ litter_size <dbl> 1.85, 1.00, 1.00, 1.00, 1.00, 1.37, 1.00, 1.00, 1.00, 1.0…
## $ litters_year <dbl> 1.00, 0.99, 0.95, NA, NA, 2.00, NA, 1.89, 1.00, 1.00, 0.7…
summary(life_history_no_nas)
## order family genus species
## Length:1440 Length:1440 Length:1440 Length:1440
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
##
## mass gestation newborn weaning
## Min. : 2 Min. : 0.4900 Length:1440 Min. : 0.300
## 1st Qu.: 61 1st Qu.: 0.9925 Class :character 1st Qu.: 0.920
## Median : 606 Median : 2.1100 Mode :character Median : 1.690
## Mean : 407701 Mean : 3.8630 Mean : 3.967
## 3rd Qu.: 8554 3rd Qu.: 6.0000 3rd Qu.: 4.840
## Max. :149000000 Max. :21.4600 Max. :48.000
## NA's :85 NA's :418 NA's :619
## wean_mass afr max_life litter_size
## Min. :2.100e+00 Min. : 0.70 Min. : 0.00 Min. : 1.000
## 1st Qu.:2.010e+01 1st Qu.: 4.50 1st Qu.: 0.00 1st Qu.: 1.018
## Median :1.026e+02 Median : 12.00 Median : 0.00 Median : 2.500
## Mean :6.022e+04 Mean : 22.44 Mean : 93.19 Mean : 2.805
## 3rd Qu.:2.000e+03 3rd Qu.: 28.24 3rd Qu.: 147.25 3rd Qu.: 4.000
## Max. :1.908e+07 Max. :210.00 Max. :1368.00 Max. :14.180
## NA's :1039 NA's :607 NA's :84
## litters_year
## Min. :0.140
## 1st Qu.:1.000
## Median :1.000
## Mean :1.636
## 3rd Qu.:2.000
## Max. :7.500
## NA's :689
Notice that max_life has no NA’s. Does that make sense?
How likely is it that we know the lifespan for all of the species in the
data set?
hist(life_history_no_nas$max_life)
life_history_no_nas %>%
filter(max_life==0) %>%
select(order, family, genus, species, max_life)
## # A tibble: 841 × 5
## order family genus species max_life
## <chr> <chr> <chr> <chr> <dbl>
## 1 Artiodactyla Bovidae Ammodorcas clarkei 0
## 2 Artiodactyla Bovidae Capra caucasica 0
## 3 Artiodactyla Bovidae Capra ibex 0
## 4 Artiodactyla Bovidae Cephalophus niger 0
## 5 Artiodactyla Bovidae Cephalophus natalensis 0
## 6 Artiodactyla Bovidae Cephalophus leucogaster 0
## 7 Artiodactyla Bovidae Cephalophus ogilbyi 0
## 8 Artiodactyla Bovidae Cephalophus zebra 0
## 9 Artiodactyla Bovidae Cephalophus rufilatus 0
## 10 Artiodactyla Bovidae Cephalophus dorsalis 0
## # ℹ 831 more rows
Let’s use mutate() and use na_if() to
replace 0’s with NA’s in max_life. This chunk allows us to
address problems in a single variable.
life_history_no_nas <- life_history_no_nas %>%
mutate(max_life=na_if(max_life, 0))
naniarnaniar is a package that manages NA’s. Many of the functions it performs can also be performed using tidyverse functions, but it provides some nice alternatives.
miss_var_summary provides a summary of NA’s across the
data frame.
miss_var_summary(life_history_no_nas)
## # A tibble: 13 × 3
## variable n_miss pct_miss
## <chr> <int> <num>
## 1 wean_mass 1039 72.2
## 2 max_life 841 58.4
## 3 litters_year 689 47.8
## 4 weaning 619 43.0
## 5 afr 607 42.2
## 6 gestation 418 29.0
## 7 mass 85 5.90
## 8 litter_size 84 5.83
## 9 order 0 0
## 10 family 0 0
## 11 genus 0 0
## 12 species 0 0
## 13 newborn 0 0
A unique feature of naniar is that it can produce
visuals to help evaluate NA’s. Here we use gg_miss_var to
visualize the number of NA’s in each variable.
life_history_no_nas %>%
gg_miss_var()
We can also use geom_miss_point() to visualize where
NA’s are located in a scatter plot. Here we see that there are NA’s in
wean_mass across the range of mass.
life_history_no_nas %>%
ggplot(aes(x=log10(mass), y=log10(wean_mass)))+
geom_point()+
geom_miss_point()
## Warning: Removed 1044 rows containing missing values or values outside the scale range
## (`geom_point()`).
We can also use miss_var_summary with
group_by(). This helps us better evaluate where NA’s are in
the data.
life_history_no_nas %>%
group_by(order) %>%
miss_var_summary(order=T)
## # A tibble: 204 × 4
## # Groups: order [17]
## order variable n_miss pct_miss
## <chr> <chr> <int> <num>
## 1 Artiodactyla wean_mass 134 83.2
## 2 Artiodactyla litters_year 77 47.8
## 3 Artiodactyla weaning 68 42.2
## 4 Artiodactyla max_life 62 38.5
## 5 Artiodactyla afr 32 19.9
## 6 Artiodactyla gestation 11 6.83
## 7 Artiodactyla litter_size 5 3.11
## 8 Artiodactyla mass 4 2.48
## 9 Artiodactyla family 0 0
## 10 Artiodactyla genus 0 0
## # ℹ 194 more rows
naniar has nice replace functions which will allow you
to precisely control which values you want replaced with NA’s in each
variable. This is a nice alternative to mutate() and
na_if().
life_history %>% #going back to the original data
replace_with_na(replace = list(newborn = "not measured",
weaning= -999,
wean_mass= -999,
afr= -999,
max_life= 0,
litter_size= -999,
gestation= -999,
mass= -999)) %>%
miss_var_summary()
## # A tibble: 13 × 3
## variable n_miss pct_miss
## <chr> <int> <num>
## 1 wean_mass 1039 72.2
## 2 max_life 841 58.4
## 3 litters_year 689 47.8
## 4 weaning 619 43.0
## 5 afr 607 42.2
## 6 newborn 595 41.3
## 7 gestation 418 29.0
## 8 mass 85 5.90
## 9 litter_size 84 5.83
## 10 order 0 0
## 11 family 0 0
## 12 genus 0 0
## 13 species 0 0
You can also use naniar to replace a specific value (like -999) with NA across the entire data set.
life_history %>%
replace_with_na_all(condition = ~.x ==-999) %>%
miss_var_summary()
## # A tibble: 13 × 3
## variable n_miss pct_miss
## <chr> <int> <num>
## 1 wean_mass 1039 72.2
## 2 litters_year 689 47.8
## 3 weaning 619 43.0
## 4 afr 607 42.2
## 5 gestation 418 29.0
## 6 mass 85 5.90
## 7 litter_size 84 5.83
## 8 order 0 0
## 9 family 0 0
## 10 genus 0 0
## 11 species 0 0
## 12 newborn 0 0
## 13 max_life 0 0
Finally, naniar has some built-in examples of common values or character strings used to represent NA’s. The chunk below will use these built-in parameters to replace NA’s across the entire data set.
common_na_strings
## [1] "missing" "NA" "N A" "N/A" "#N/A" "NA " " NA"
## [8] "N /A" "N / A" " N / A" "N / A " "na" "n a" "n/a"
## [15] "na " " na" "n /a" "n / a" " a / a" "n / a " "NULL"
## [22] "null" "" "\\?" "\\*" "\\."
common_na_numbers
## [1] -9 -99 -999 -9999 9999 66 77 88
life_history %>%
replace_with_na_all(condition = ~.x %in% c(common_na_numbers, common_na_strings)) %>%
mutate(newborn=na_if(newborn, "not measured"))
## # A tibble: 1,440 × 13
## order family genus species mass gestation newborn weaning wean_mass afr
## <chr> <chr> <chr> <chr> <dbl> <dbl> <chr> <dbl> <dbl> <dbl>
## 1 Artiod… Antil… Anti… americ… 4.54e4 8.13 3246.36 3 8900 13.5
## 2 Artiod… Bovid… Addax nasoma… 1.82e5 9.39 5480 6.5 NA 27.3
## 3 Artiod… Bovid… Aepy… melamp… 4.15e4 6.35 5093 5.63 15900 16.7
## 4 Artiod… Bovid… Alce… busela… 1.5 e5 7.9 10166.… 6.5 NA 23.0
## 5 Artiod… Bovid… Ammo… clarkei 2.85e4 6.8 <NA> NA NA NA
## 6 Artiod… Bovid… Ammo… lervia 5.55e4 5.08 3810 4 NA 14.9
## 7 Artiod… Bovid… Anti… marsup… 3 e4 5.72 3910 4.04 NA 10.2
## 8 Artiod… Bovid… Anti… cervic… 3.75e4 5.5 3846 2.13 NA 20.1
## 9 Artiod… Bovid… Bison bison 4.98e5 8.93 20000 10.7 157500 29.4
## 10 Artiod… Bovid… Bison bonasus 5 e5 9.14 23000.… 6.6 NA 30.0
## # ℹ 1,430 more rows
## # ℹ 3 more variables: max_life <dbl>, litter_size <dbl>, litters_year <dbl>
Let’s practice evaluating NA’s in a large data set. The data are compiled from CITES. This is the international organization that tracks trade in endangered wildlife. You can find information about the data here.
Some key information:
country
codes
cites <- read_csv("data/cites.csv") %>%
clean_names()
glimpse(cites)
## Rows: 67,161
## Columns: 16
## $ year <dbl> 2016, 2016, 2016, 2016, 2016, 2016, 2016, 2…
## $ app <chr> "I", "I", "I", "I", "I", "I", "I", "I", "I"…
## $ taxon <chr> "Aquila heliaca", "Aquila heliaca", "Haliae…
## $ class <chr> "Aves", "Aves", "Aves", "Aves", "Aves", "Av…
## $ order <chr> "Falconiformes", "Falconiformes", "Falconif…
## $ family <chr> "Accipitridae", "Accipitridae", "Accipitrid…
## $ genus <chr> "Aquila", "Aquila", "Haliaeetus", "Haliaeet…
## $ importer <chr> "TR", "XV", "BE", "BE", "DK", "XV", "BR", "…
## $ exporter <chr> "NL", "RS", "NO", "NO", "IS", "RS", "FR", "…
## $ origin <chr> "CZ", "RS", NA, NA, NA, "RS", NA, NA, NA, N…
## $ importer_reported_quantity <dbl> NA, NA, NA, NA, 700, NA, NA, NA, NA, NA, 10…
## $ exporter_reported_quantity <dbl> 1, 1, 43, 43, NA, 1, 12, 4, 2, 4, NA, 2, 1,…
## $ term <chr> "bodies", "bodies", "feathers", "specimens"…
## $ unit <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, "g"…
## $ purpose <chr> "T", "Q", "S", "S", "S", "Q", "S", "S", "S"…
## $ source <chr> "C", "O", "W", "W", "W", "O", "C", "U", "W"…
naniar to summarize the NA’s in each variable.miss_var_summary(cites)
## # A tibble: 16 × 3
## variable n_miss pct_miss
## <chr> <int> <num>
## 1 unit 60759 90.5
## 2 origin 41518 61.8
## 3 importer_reported_quantity 35295 52.6
## 4 exporter_reported_quantity 23140 34.5
## 5 class 20224 30.1
## 6 purpose 6059 9.02
## 7 genus 1459 2.17
## 8 exporter 573 0.853
## 9 source 544 0.810
## 10 family 461 0.686
## 11 importer 71 0.106
## 12 order 57 0.0849
## 13 year 0 0
## 14 app 0 0
## 15 taxon 0 0
## 16 term 0 0
group_by() with naniar. Look
specifically at class and
exporter_reported_quantity. For which taxonomic classes do
we have the highest number of missing export data?cites %>%
select(class, exporter_reported_quantity) %>%
group_by(class) %>%
miss_var_summary() %>%
arrange(desc(n_miss))
## # A tibble: 17 × 4
## # Groups: class [17]
## class variable n_miss pct_miss
## <chr> <chr> <int> <num>
## 1 <NA> exporter_reported_quantity 7002 34.6
## 2 Reptilia exporter_reported_quantity 5323 28.9
## 3 Anthozoa exporter_reported_quantity 3858 43.9
## 4 Mammalia exporter_reported_quantity 3731 43.9
## 5 Aves exporter_reported_quantity 1792 26.1
## 6 Actinopteri exporter_reported_quantity 726 26.3
## 7 Amphibia exporter_reported_quantity 190 45.2
## 8 Bivalvia exporter_reported_quantity 165 61.3
## 9 Gastropoda exporter_reported_quantity 104 54.5
## 10 Insecta exporter_reported_quantity 74 23.9
## 11 Hydrozoa exporter_reported_quantity 61 33.7
## 12 Elasmobranchii exporter_reported_quantity 58 51.3
## 13 Arachnida exporter_reported_quantity 32 47.8
## 14 Hirudinoidea exporter_reported_quantity 11 32.4
## 15 Holothuroidea exporter_reported_quantity 10 100
## 16 Dipneusti exporter_reported_quantity 3 75
## 17 Coelacanthi exporter_reported_quantity 0 0
–>Home