At the end of this exercise, you will be able to:
1. Define NA
and describe how they are treated in R.
2. Produce summaries of the number of NA’s in a data set.
3. Replace values with NA
in a data set as appropriate.
naniar
#install.packages("naniar")
library("tidyverse")
library("naniar")
library("janitor")
When working with “wild” data, dealing with NA’s is a fundamental part of the data cleaning process. Data scientists spend most of their time cleaning and transforming data- including managing NA’s. There isn’t a single approach that will always work so you need to be careful about using replacement strategies across an entire data set. And, as the data sets become larger NA’s can become trickier to deal with.
For the following, we will use life history data for mammals. The data are
from:
S. K. Morgan Ernest. 2003. Life history characteristics of
placental non-volant mammals. Ecology 84:3402.
life_history <- read_csv("data/mammal_lifehistories_v3.csv") %>% clean_names()
Sometimes using one or more of the summary functions can give us clues to how the authors have represented missing data. This doesn’t always work, but it is a good place to start.
summary(life_history)
## order family genus species
## Length:1440 Length:1440 Length:1440 Length:1440
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
##
## mass gestation newborn weaning
## Min. : -999 Min. :-999.00 Length:1440 Min. :-999.00
## 1st Qu.: 50 1st Qu.:-999.00 Class :character 1st Qu.:-999.00
## Median : 403 Median : 1.05 Mode :character Median : 0.73
## Mean : 383577 Mean :-287.25 Mean :-427.17
## 3rd Qu.: 7009 3rd Qu.: 4.50 3rd Qu.: 2.00
## Max. :149000000 Max. : 21.46 Max. : 48.00
##
## wean_mass afr max_life litter_size
## Min. : -999 Min. :-999.00 Min. : 0.00 Min. :-999.000
## 1st Qu.: -999 1st Qu.:-999.00 1st Qu.: 0.00 1st Qu.: 1.000
## Median : -999 Median : 2.50 Median : 0.00 Median : 2.270
## Mean : 16049 Mean :-408.12 Mean : 93.19 Mean : -55.634
## 3rd Qu.: 10 3rd Qu.: 15.61 3rd Qu.: 147.25 3rd Qu.: 3.835
## Max. :19075000 Max. : 210.00 Max. :1368.00 Max. : 14.180
##
## litters_year
## Min. :0.140
## 1st Qu.:1.000
## Median :1.000
## Mean :1.636
## 3rd Qu.:2.000
## Max. :7.500
## NA's :689
This will give you a quick summary of the number of NA’s in each variable. Notice that, at least for now, it doesn’t look like there are NA’s in most variables. But, given the size of the data this is unlikely.
life_history %>%
summarize(across(everything(), ~ sum(is.na(.))))
## # A tibble: 1 × 13
## order family genus species mass gestation newborn weaning wean_mass afr
## <int> <int> <int> <int> <int> <int> <int> <int> <int> <int>
## 1 0 0 0 0 0 0 0 0 0 0
## # ℹ 3 more variables: max_life <int>, litter_size <int>, litters_year <int>
life_history_no_nas <-
read_csv("data/mammal_lifehistories_v3.csv", na = c("NA", " ", ".", "-999")) %>%
clean_names() # notice that I am creating a new dataframe that doesn't have any NA's
## Warning: One or more parsing issues, call `problems()` on your data frame for details,
## e.g.:
## dat <- vroom(...)
## problems(dat)
## Rows: 1440 Columns: 13
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (5): order, family, Genus, species, newborn
## dbl (8): mass, gestation, weaning, wean mass, AFR, max. life, litter size, l...
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Recheck
life_history_no_nas %>%
summarize(across(everything(), ~ sum(is.na(.))))
## # A tibble: 1 × 13
## order family genus species mass gestation newborn weaning wean_mass afr
## <int> <int> <int> <int> <int> <int> <int> <int> <int> <int>
## 1 0 0 0 0 85 418 0 619 1039 607
## # ℹ 3 more variables: max_life <int>, litter_size <int>, litters_year <int>
naniar
naniar is a package that is built to manage NA’s. Many of the functions it performs can also be performed using tidyverse functions, but it does provide some interesting alternatives.
miss_var_summary
provides a summary of NA’s across the
data frame.
miss_var_summary(life_history_no_nas)
## # A tibble: 13 × 3
## variable n_miss pct_miss
## <chr> <int> <num>
## 1 wean_mass 1039 72.2
## 2 litters_year 689 47.8
## 3 weaning 619 43.0
## 4 afr 607 42.2
## 5 gestation 418 29.0
## 6 mass 85 5.90
## 7 litter_size 84 5.83
## 8 order 0 0
## 9 family 0 0
## 10 genus 0 0
## 11 species 0 0
## 12 newborn 0 0
## 13 max_life 0 0
Notice that max_life
has no NA’s. Does that make sense
in the biological context of this data?
hist(life_history_no_nas$max_life) #we found another way that NA's are represented
life_history_no_nas %>%
filter(max_life==0) %>%
select(order, family, genus, species, max_life)
## # A tibble: 841 × 5
## order family genus species max_life
## <chr> <chr> <chr> <chr> <dbl>
## 1 Artiodactyla Bovidae Ammodorcas clarkei 0
## 2 Artiodactyla Bovidae Capra caucasica 0
## 3 Artiodactyla Bovidae Capra ibex 0
## 4 Artiodactyla Bovidae Cephalophus niger 0
## 5 Artiodactyla Bovidae Cephalophus natalensis 0
## 6 Artiodactyla Bovidae Cephalophus leucogaster 0
## 7 Artiodactyla Bovidae Cephalophus ogilbyi 0
## 8 Artiodactyla Bovidae Cephalophus zebra 0
## 9 Artiodactyla Bovidae Cephalophus rufilatus 0
## 10 Artiodactyla Bovidae Cephalophus dorsalis 0
## # ℹ 831 more rows
Let’s use mutate()
and use na_if()
to
replace 0’s with NA’s in max_life
. This chunk allows us to
address problems in a single variable.
life_history_no_nas <- life_history_no_nas %>%
mutate(max_life=na_if(max_life, 0))
We can also use miss_var_summary
with
group_by()
. This helps us better evaluate where NA’s are in
the data.
life_history_no_nas %>%
group_by(order) %>%
miss_var_summary(order=T)
## # A tibble: 204 × 4
## # Groups: order [17]
## order variable n_miss pct_miss
## <chr> <chr> <int> <num>
## 1 Artiodactyla wean_mass 134 83.2
## 2 Artiodactyla litters_year 77 47.8
## 3 Artiodactyla weaning 68 42.2
## 4 Artiodactyla max_life 62 38.5
## 5 Artiodactyla afr 32 19.9
## 6 Artiodactyla gestation 11 6.83
## 7 Artiodactyla litter_size 5 3.11
## 8 Artiodactyla mass 4 2.48
## 9 Artiodactyla family 0 0
## 10 Artiodactyla genus 0 0
## # ℹ 194 more rows
naniar
also has a nice replace functions which will
allow you to precisely control which values you want replaced with NA’s
in each variable. This is a nice alternative to mutate()
and na_if()
.
life_history %>% #going back to the original data
replace_with_na(replace = list(newborn = "not measured",
weaning= -999,
wean_mass= -999,
afr= -999,
max_life= 0,
litter_size= -999,
gestation= -999,
mass= -999)) %>%
miss_var_summary()
## # A tibble: 13 × 3
## variable n_miss pct_miss
## <chr> <int> <num>
## 1 wean_mass 1039 72.2
## 2 max_life 841 58.4
## 3 litters_year 689 47.8
## 4 weaning 619 43.0
## 5 afr 607 42.2
## 6 newborn 595 41.3
## 7 gestation 418 29.0
## 8 mass 85 5.90
## 9 litter_size 84 5.83
## 10 order 0 0
## 11 family 0 0
## 12 genus 0 0
## 13 species 0 0
You can also use naniar to replace a specific value (like -999) with NA across the entire data set.
life_history %>% #going back to the original data
replace_with_na_all(condition = ~.x == -999)%>%
miss_var_summary()
## # A tibble: 13 × 3
## variable n_miss pct_miss
## <chr> <int> <num>
## 1 wean_mass 1039 72.2
## 2 litters_year 689 47.8
## 3 weaning 619 43.0
## 4 afr 607 42.2
## 5 gestation 418 29.0
## 6 mass 85 5.90
## 7 litter_size 84 5.83
## 8 order 0 0
## 9 family 0 0
## 10 genus 0 0
## 11 species 0 0
## 12 newborn 0 0
## 13 max_life 0 0
Finally, naniar has some built-in examples of common values or character strings used to represent NA’s. The chunk below will use these built-in parameters to replace NA’s across the entire data set.
common_na_strings
## [1] "missing" "NA" "N A" "N/A" "#N/A" "NA " " NA"
## [8] "N /A" "N / A" " N / A" "N / A " "na" "n a" "n/a"
## [15] "na " " na" "n /a" "n / a" " a / a" "n / a " "NULL"
## [22] "null" "" "\\?" "\\*" "\\."
common_na_numbers
## [1] -9 -99 -999 -9999 9999 66 77 88
life_history %>% #going back to the original data
replace_with_na_all(condition = ~.x %in% c(common_na_strings, common_na_numbers)) %>%
mutate(newborn=na_if(newborn, "not measured"))
## # A tibble: 1,440 × 13
## order family genus species mass gestation newborn weaning wean_mass afr
## <chr> <chr> <chr> <chr> <dbl> <dbl> <chr> <dbl> <dbl> <dbl>
## 1 Artiod… Antil… Anti… americ… 4.54e4 8.13 3246.36 3 8900 13.5
## 2 Artiod… Bovid… Addax nasoma… 1.82e5 9.39 5480 6.5 NA 27.3
## 3 Artiod… Bovid… Aepy… melamp… 4.15e4 6.35 5093 5.63 15900 16.7
## 4 Artiod… Bovid… Alce… busela… 1.5 e5 7.9 10166.… 6.5 NA 23.0
## 5 Artiod… Bovid… Ammo… clarkei 2.85e4 6.8 <NA> NA NA NA
## 6 Artiod… Bovid… Ammo… lervia 5.55e4 5.08 3810 4 NA 14.9
## 7 Artiod… Bovid… Anti… marsup… 3 e4 5.72 3910 4.04 NA 10.2
## 8 Artiod… Bovid… Anti… cervic… 3.75e4 5.5 3846 2.13 NA 20.1
## 9 Artiod… Bovid… Bison bison 4.98e5 8.93 20000 10.7 157500 29.4
## 10 Artiod… Bovid… Bison bonasus 5 e5 9.14 23000.… 6.6 NA 30.0
## # ℹ 1,430 more rows
## # ℹ 3 more variables: max_life <dbl>, litter_size <dbl>, litters_year <dbl>
Let’s practice evaluating NA’s in a large data set. The data are compiled from CITES. This is the international organization that tracks trade in endangered wildlife. You can find information about the data here.
Some key information:
country
codes
cites <- read_csv("data/cites.csv") %>% clean_names()
## Rows: 67161 Columns: 16
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (13): App., Taxon, Class, Order, Family, Genus, Importer, Exporter, Orig...
## dbl (3): Year, Importer reported quantity, Exporter reported quantity
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
naniar
to summarize the NA’s in each variable.miss_var_summary(cites)
## # A tibble: 16 × 3
## variable n_miss pct_miss
## <chr> <int> <num>
## 1 unit 60759 90.5
## 2 origin 41518 61.8
## 3 importer_reported_quantity 35295 52.6
## 4 exporter_reported_quantity 23140 34.5
## 5 class 20224 30.1
## 6 purpose 6059 9.02
## 7 genus 1459 2.17
## 8 exporter 573 0.853
## 9 source 544 0.810
## 10 family 461 0.686
## 11 importer 71 0.106
## 12 order 57 0.0849
## 13 year 0 0
## 14 app 0 0
## 15 taxon 0 0
## 16 term 0 0
group_by()
with naniar
. Look
specifically at class
and
exporter_reported_quantity
. For which taxonomic classes do
we have the highest number of missing export data?cites %>%
select(class, exporter_reported_quantity) %>%
group_by(class) %>%
miss_var_summary() %>%
arrange(desc(n_miss))
## # A tibble: 17 × 4
## # Groups: class [17]
## class variable n_miss pct_miss
## <chr> <chr> <int> <num>
## 1 <NA> exporter_reported_quantity 7002 34.6
## 2 Reptilia exporter_reported_quantity 5323 28.9
## 3 Anthozoa exporter_reported_quantity 3858 43.9
## 4 Mammalia exporter_reported_quantity 3731 43.9
## 5 Aves exporter_reported_quantity 1792 26.1
## 6 Actinopteri exporter_reported_quantity 726 26.3
## 7 Amphibia exporter_reported_quantity 190 45.2
## 8 Bivalvia exporter_reported_quantity 165 61.3
## 9 Gastropoda exporter_reported_quantity 104 54.5
## 10 Insecta exporter_reported_quantity 74 23.9
## 11 Hydrozoa exporter_reported_quantity 61 33.7
## 12 Elasmobranchii exporter_reported_quantity 58 51.3
## 13 Arachnida exporter_reported_quantity 32 47.8
## 14 Hirudinoidea exporter_reported_quantity 11 32.4
## 15 Holothuroidea exporter_reported_quantity 10 100
## 16 Dipneusti exporter_reported_quantity 3 75
## 17 Coelacanthi exporter_reported_quantity 0 0
Once you are sure that you know how NA’s are treated in the data,
then you can deal with them in advance using na()
as part
of the readr
package. Notice that you still need to deal
with some variables seperately. For example, you don’t want to replace
all 0’s with NA, just the max_life
variable.
read_csv(file = "data/mammal_lifehistories_v3.csv", na = c("NA", " ", ".", "-999", "not measured")) %>% clean_names()
## Warning: One or more parsing issues, call `problems()` on your data frame for details,
## e.g.:
## dat <- vroom(...)
## problems(dat)
## Rows: 1440 Columns: 13
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (4): order, family, Genus, species
## dbl (9): mass, gestation, newborn, weaning, wean mass, AFR, max. life, litte...
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
## # A tibble: 1,440 × 13
## order family genus species mass gestation newborn weaning wean_mass afr
## <chr> <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Artiod… Antil… Anti… americ… 4.54e4 8.13 3246. 3 8900 13.5
## 2 Artiod… Bovid… Addax nasoma… 1.82e5 9.39 5480 6.5 NA 27.3
## 3 Artiod… Bovid… Aepy… melamp… 4.15e4 6.35 5093 5.63 15900 16.7
## 4 Artiod… Bovid… Alce… busela… 1.5 e5 7.9 10167. 6.5 NA 23.0
## 5 Artiod… Bovid… Ammo… clarkei 2.85e4 6.8 NA NA NA NA
## 6 Artiod… Bovid… Ammo… lervia 5.55e4 5.08 3810 4 NA 14.9
## 7 Artiod… Bovid… Anti… marsup… 3 e4 5.72 3910 4.04 NA 10.2
## 8 Artiod… Bovid… Anti… cervic… 3.75e4 5.5 3846 2.13 NA 20.1
## 9 Artiod… Bovid… Bison bison 4.98e5 8.93 20000 10.7 157500 29.4
## 10 Artiod… Bovid… Bison bonasus 5 e5 9.14 23000. 6.6 NA 30.0
## # ℹ 1,430 more rows
## # ℹ 3 more variables: max_life <dbl>, litter_size <dbl>, litters_year <dbl>
–>Home