Learning Goals

At the end of this exercise, you will be able to:
1. Define NA and describe how they are treated in R.
2. Produce summaries of the number of NA’s in a data set.
3. Replace values with NA in a data set.

Install the package naniar

The naniar package includes some useful tools to manage NA’s.

#install.packages("naniar")

Load the libraries

library("tidyverse")
library("naniar")
library("janitor")

Review

When working with “wild” data, dealing with NA’s is a fundamental part of the data cleaning process. Data scientists spend much of their time cleaning and transforming data- including managing NA’s. There isn’t a single approach that will always work so you need to be careful about using replacement strategies across an entire data set. And, as the data sets become larger NA’s can become trickier to deal with.

For the following, we will use life history data for mammals. The data are from:
S. K. Morgan Ernest. 2003. Life history characteristics of placental non-volant mammals. Ecology 84:3402.

Load the mammals life history data and clean the names

life_history <- read_csv("data/mammal_lifehistories_v3.csv") %>% clean_names()

Are there any NA’s?

Sometimes using one or more of the summary functions can give us clues to how the authors have represented missing data. This doesn’t always work, but it is a good place to start.

glimpse(life_history)
## Rows: 1,440
## Columns: 13
## $ order        <chr> "Artiodactyla", "Artiodactyla", "Artiodactyla", "Artiodac…
## $ family       <chr> "Antilocapridae", "Bovidae", "Bovidae", "Bovidae", "Bovid…
## $ genus        <chr> "Antilocapra", "Addax", "Aepyceros", "Alcelaphus", "Ammod…
## $ species      <chr> "americana", "nasomaculatus", "melampus", "buselaphus", "…
## $ mass         <dbl> 45375.0, 182375.0, 41480.0, 150000.0, 28500.0, 55500.0, 3…
## $ gestation    <dbl> 8.13, 9.39, 6.35, 7.90, 6.80, 5.08, 5.72, 5.50, 8.93, 9.1…
## $ newborn      <chr> "3246.36", "5480", "5093", "10166.67", "not measured", "3…
## $ weaning      <dbl> 3.00, 6.50, 5.63, 6.50, -999.00, 4.00, 4.04, 2.13, 10.71,…
## $ wean_mass    <dbl> 8900, -999, 15900, -999, -999, -999, -999, -999, 157500, …
## $ afr          <dbl> 13.53, 27.27, 16.66, 23.02, -999.00, 14.89, 10.23, 20.13,…
## $ max_life     <dbl> 142, 308, 213, 240, 0, 251, 228, 255, 300, 324, 300, 314,…
## $ litter_size  <dbl> 1.85, 1.00, 1.00, 1.00, 1.00, 1.37, 1.00, 1.00, 1.00, 1.0…
## $ litters_year <dbl> 1.00, 0.99, 0.95, NA, NA, 2.00, NA, 1.89, 1.00, 1.00, 0.7…
summary(life_history)
##     order              family             genus             species         
##  Length:1440        Length:1440        Length:1440        Length:1440       
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##                                                                             
##       mass             gestation         newborn             weaning       
##  Min.   :     -999   Min.   :-999.00   Length:1440        Min.   :-999.00  
##  1st Qu.:       50   1st Qu.:-999.00   Class :character   1st Qu.:-999.00  
##  Median :      403   Median :   1.05   Mode  :character   Median :   0.73  
##  Mean   :   383577   Mean   :-287.25                      Mean   :-427.17  
##  3rd Qu.:     7009   3rd Qu.:   4.50                      3rd Qu.:   2.00  
##  Max.   :149000000   Max.   :  21.46                      Max.   :  48.00  
##                                                                            
##    wean_mass             afr             max_life        litter_size      
##  Min.   :    -999   Min.   :-999.00   Min.   :   0.00   Min.   :-999.000  
##  1st Qu.:    -999   1st Qu.:-999.00   1st Qu.:   0.00   1st Qu.:   1.000  
##  Median :    -999   Median :   2.50   Median :   0.00   Median :   2.270  
##  Mean   :   16049   Mean   :-408.12   Mean   :  93.19   Mean   : -55.634  
##  3rd Qu.:      10   3rd Qu.:  15.61   3rd Qu.: 147.25   3rd Qu.:   3.835  
##  Max.   :19075000   Max.   : 210.00   Max.   :1368.00   Max.   :  14.180  
##                                                                           
##   litters_year  
##  Min.   :0.140  
##  1st Qu.:1.000  
##  Median :1.000  
##  Mean   :1.636  
##  3rd Qu.:2.000  
##  Max.   :7.500  
##  NA's   :689

Where are the NA’s?

This will give you a quick summary of the number of NA’s in each variable. Notice that, at least for now, it doesn’t look like there are NA’s in most variables because they are still represented by -999.

life_history %>% 
  summarize(across(everything(), ~ sum(is.na(.))))
## # A tibble: 1 × 13
##   order family genus species  mass gestation newborn weaning wean_mass   afr
##   <int>  <int> <int>   <int> <int>     <int>   <int>   <int>     <int> <int>
## 1     0      0     0       0     0         0       0       0         0     0
## # ℹ 3 more variables: max_life <int>, litter_size <int>, litters_year <int>

Here we look for all of the -999’s.

life_history %>% 
  summarize(across(everything(), ~sum(.==-999)))
## # A tibble: 1 × 13
##   order family genus species  mass gestation newborn weaning wean_mass   afr
##   <int>  <int> <int>   <int> <int>     <int>   <int>   <int>     <int> <int>
## 1     0      0     0       0    85       418       0     619      1039   607
## # ℹ 3 more variables: max_life <int>, litter_size <int>, litters_year <int>

Dealing with NA’s using read_csv()

In the case of -999 as NA, we can use read_csv to replace these values when we import the data. But, I only do this once I understand how NA’s are represented in the data. Sometimes, they are represented in multiple ways.

life_history_no_nas <- 
  read_csv("data/mammal_lifehistories_v3.csv", na="-999") %>% 
  clean_names()

Rechecking for NA’s. Now we see that there are NA’s in many of the variables.

life_history_no_nas %>% 
  summarize(across(everything(), ~ sum(is.na(.))))
## # A tibble: 1 × 13
##   order family genus species  mass gestation newborn weaning wean_mass   afr
##   <int>  <int> <int>   <int> <int>     <int>   <int>   <int>     <int> <int>
## 1     0      0     0       0    85       418       0     619      1039   607
## # ℹ 3 more variables: max_life <int>, litter_size <int>, litters_year <int>

Did we catch them all?

Sometimes it can be helpful to do a quick scan using view() or glimpse() to see if there are any other odd values that might represent NA’s. Notice that not measured is used in newborn.

glimpse(life_history_no_nas)
## Rows: 1,440
## Columns: 13
## $ order        <chr> "Artiodactyla", "Artiodactyla", "Artiodactyla", "Artiodac…
## $ family       <chr> "Antilocapridae", "Bovidae", "Bovidae", "Bovidae", "Bovid…
## $ genus        <chr> "Antilocapra", "Addax", "Aepyceros", "Alcelaphus", "Ammod…
## $ species      <chr> "americana", "nasomaculatus", "melampus", "buselaphus", "…
## $ mass         <dbl> 45375.0, 182375.0, 41480.0, 150000.0, 28500.0, 55500.0, 3…
## $ gestation    <dbl> 8.13, 9.39, 6.35, 7.90, 6.80, 5.08, 5.72, 5.50, 8.93, 9.1…
## $ newborn      <chr> "3246.36", "5480", "5093", "10166.67", "not measured", "3…
## $ weaning      <dbl> 3.00, 6.50, 5.63, 6.50, NA, 4.00, 4.04, 2.13, 10.71, 6.60…
## $ wean_mass    <dbl> 8900, NA, 15900, NA, NA, NA, NA, NA, 157500, NA, NA, NA, …
## $ afr          <dbl> 13.53, 27.27, 16.66, 23.02, NA, 14.89, 10.23, 20.13, 29.4…
## $ max_life     <dbl> 142, 308, 213, 240, 0, 251, 228, 255, 300, 324, 300, 314,…
## $ litter_size  <dbl> 1.85, 1.00, 1.00, 1.00, 1.00, 1.37, 1.00, 1.00, 1.00, 1.0…
## $ litters_year <dbl> 1.00, 0.99, 0.95, NA, NA, 2.00, NA, 1.89, 1.00, 1.00, 0.7…
summary(life_history_no_nas)
##     order              family             genus             species         
##  Length:1440        Length:1440        Length:1440        Length:1440       
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##                                                                             
##       mass             gestation         newborn             weaning      
##  Min.   :        2   Min.   : 0.4900   Length:1440        Min.   : 0.300  
##  1st Qu.:       61   1st Qu.: 0.9925   Class :character   1st Qu.: 0.920  
##  Median :      606   Median : 2.1100   Mode  :character   Median : 1.690  
##  Mean   :   407701   Mean   : 3.8630                      Mean   : 3.967  
##  3rd Qu.:     8554   3rd Qu.: 6.0000                      3rd Qu.: 4.840  
##  Max.   :149000000   Max.   :21.4600                      Max.   :48.000  
##  NA's   :85          NA's   :418                          NA's   :619     
##    wean_mass              afr            max_life        litter_size    
##  Min.   :2.100e+00   Min.   :  0.70   Min.   :   0.00   Min.   : 1.000  
##  1st Qu.:2.010e+01   1st Qu.:  4.50   1st Qu.:   0.00   1st Qu.: 1.018  
##  Median :1.026e+02   Median : 12.00   Median :   0.00   Median : 2.500  
##  Mean   :6.022e+04   Mean   : 22.44   Mean   :  93.19   Mean   : 2.805  
##  3rd Qu.:2.000e+03   3rd Qu.: 28.24   3rd Qu.: 147.25   3rd Qu.: 4.000  
##  Max.   :1.908e+07   Max.   :210.00   Max.   :1368.00   Max.   :14.180  
##  NA's   :1039        NA's   :607                        NA's   :84      
##   litters_year  
##  Min.   :0.140  
##  1st Qu.:1.000  
##  Median :1.000  
##  Mean   :1.636  
##  3rd Qu.:2.000  
##  Max.   :7.500  
##  NA's   :689

Notice that max_life has no NA’s. Does that make sense? How likely is it that we know the lifespan for all of the species in the data set?

hist(life_history_no_nas$max_life)

life_history_no_nas %>% 
  filter(max_life==0) %>% 
  select(order, family, genus, species, max_life)
## # A tibble: 841 × 5
##    order        family  genus       species     max_life
##    <chr>        <chr>   <chr>       <chr>          <dbl>
##  1 Artiodactyla Bovidae Ammodorcas  clarkei            0
##  2 Artiodactyla Bovidae Capra       caucasica          0
##  3 Artiodactyla Bovidae Capra       ibex               0
##  4 Artiodactyla Bovidae Cephalophus niger              0
##  5 Artiodactyla Bovidae Cephalophus natalensis         0
##  6 Artiodactyla Bovidae Cephalophus leucogaster        0
##  7 Artiodactyla Bovidae Cephalophus ogilbyi            0
##  8 Artiodactyla Bovidae Cephalophus zebra              0
##  9 Artiodactyla Bovidae Cephalophus rufilatus          0
## 10 Artiodactyla Bovidae Cephalophus dorsalis           0
## # ℹ 831 more rows

Let’s use mutate() and use na_if() to replace 0’s with NA’s in max_life. This chunk allows us to address problems in a single variable.

life_history_no_nas <- life_history_no_nas %>% 
  mutate(max_life=na_if(max_life, 0))

naniar

naniar is a package that manages NA’s. Many of the functions it performs can also be performed using tidyverse functions, but it provides some nice alternatives.

miss_var_summary provides a summary of NA’s across the data frame.

miss_var_summary(life_history_no_nas)
## # A tibble: 13 × 3
##    variable     n_miss pct_miss
##    <chr>         <int>    <num>
##  1 wean_mass      1039    72.2 
##  2 max_life        841    58.4 
##  3 litters_year    689    47.8 
##  4 weaning         619    43.0 
##  5 afr             607    42.2 
##  6 gestation       418    29.0 
##  7 mass             85     5.90
##  8 litter_size      84     5.83
##  9 order             0     0   
## 10 family            0     0   
## 11 genus             0     0   
## 12 species           0     0   
## 13 newborn           0     0

A unique feature of naniar is that it can produce visuals to help evaluate NA’s. Here we use gg_miss_var to visualize the number of NA’s in each variable.

life_history_no_nas %>% 
  gg_miss_var()

We can also use geom_miss_point() to visualize where NA’s are located in a scatter plot. Here we see that there are NA’s in wean_mass across the range of mass.

life_history_no_nas %>% 
  ggplot(aes(x=log10(mass), y=log10(wean_mass)))+
  geom_point()+
  geom_miss_point()
## Warning: Removed 1044 rows containing missing values or values outside the scale range
## (`geom_point()`).

We can also use miss_var_summary with group_by(). This helps us better evaluate where NA’s are in the data.

life_history_no_nas %>% 
  group_by(order) %>% 
  miss_var_summary(order=T)
## # A tibble: 204 × 4
## # Groups:   order [17]
##    order        variable     n_miss pct_miss
##    <chr>        <chr>         <int>    <num>
##  1 Artiodactyla wean_mass       134    83.2 
##  2 Artiodactyla litters_year     77    47.8 
##  3 Artiodactyla weaning          68    42.2 
##  4 Artiodactyla max_life         62    38.5 
##  5 Artiodactyla afr              32    19.9 
##  6 Artiodactyla gestation        11     6.83
##  7 Artiodactyla litter_size       5     3.11
##  8 Artiodactyla mass              4     2.48
##  9 Artiodactyla family            0     0   
## 10 Artiodactyla genus             0     0   
## # ℹ 194 more rows

naniar has nice replace functions which will allow you to precisely control which values you want replaced with NA’s in each variable. This is a nice alternative to mutate() and na_if().

life_history %>% #going back to the original data
  replace_with_na(replace = list(newborn = "not measured", 
                                 weaning= -999, 
                                 wean_mass= -999, 
                                 afr= -999, 
                                 max_life= 0, 
                                 litter_size= -999, 
                                 gestation= -999, 
                                 mass= -999)) %>% 
miss_var_summary()
## # A tibble: 13 × 3
##    variable     n_miss pct_miss
##    <chr>         <int>    <num>
##  1 wean_mass      1039    72.2 
##  2 max_life        841    58.4 
##  3 litters_year    689    47.8 
##  4 weaning         619    43.0 
##  5 afr             607    42.2 
##  6 newborn         595    41.3 
##  7 gestation       418    29.0 
##  8 mass             85     5.90
##  9 litter_size      84     5.83
## 10 order             0     0   
## 11 family            0     0   
## 12 genus             0     0   
## 13 species           0     0

You can also use naniar to replace a specific value (like -999) with NA across the entire data set.

life_history %>% 
  replace_with_na_all(condition = ~.x ==-999) %>% 
  miss_var_summary()
## # A tibble: 13 × 3
##    variable     n_miss pct_miss
##    <chr>         <int>    <num>
##  1 wean_mass      1039    72.2 
##  2 litters_year    689    47.8 
##  3 weaning         619    43.0 
##  4 afr             607    42.2 
##  5 gestation       418    29.0 
##  6 mass             85     5.90
##  7 litter_size      84     5.83
##  8 order             0     0   
##  9 family            0     0   
## 10 genus             0     0   
## 11 species           0     0   
## 12 newborn           0     0   
## 13 max_life          0     0

Finally, naniar has some built-in examples of common values or character strings used to represent NA’s. The chunk below will use these built-in parameters to replace NA’s across the entire data set.

common_na_strings
##  [1] "missing" "NA"      "N A"     "N/A"     "#N/A"    "NA "     " NA"    
##  [8] "N /A"    "N / A"   " N / A"  "N / A "  "na"      "n a"     "n/a"    
## [15] "na "     " na"     "n /a"    "n / a"   " a / a"  "n / a "  "NULL"   
## [22] "null"    ""        "\\?"     "\\*"     "\\."
common_na_numbers
## [1]    -9   -99  -999 -9999  9999    66    77    88
life_history %>% 
  replace_with_na_all(condition = ~.x %in% c(common_na_numbers, common_na_strings)) %>% 
  mutate(newborn=na_if(newborn, "not measured"))
## # A tibble: 1,440 × 13
##    order   family genus species   mass gestation newborn weaning wean_mass   afr
##    <chr>   <chr>  <chr> <chr>    <dbl>     <dbl> <chr>     <dbl>     <dbl> <dbl>
##  1 Artiod… Antil… Anti… americ… 4.54e4      8.13 3246.36    3         8900  13.5
##  2 Artiod… Bovid… Addax nasoma… 1.82e5      9.39 5480       6.5         NA  27.3
##  3 Artiod… Bovid… Aepy… melamp… 4.15e4      6.35 5093       5.63     15900  16.7
##  4 Artiod… Bovid… Alce… busela… 1.5 e5      7.9  10166.…    6.5         NA  23.0
##  5 Artiod… Bovid… Ammo… clarkei 2.85e4      6.8  <NA>      NA           NA  NA  
##  6 Artiod… Bovid… Ammo… lervia  5.55e4      5.08 3810       4           NA  14.9
##  7 Artiod… Bovid… Anti… marsup… 3   e4      5.72 3910       4.04        NA  10.2
##  8 Artiod… Bovid… Anti… cervic… 3.75e4      5.5  3846       2.13        NA  20.1
##  9 Artiod… Bovid… Bison bison   4.98e5      8.93 20000     10.7     157500  29.4
## 10 Artiod… Bovid… Bison bonasus 5   e5      9.14 23000.…    6.6         NA  30.0
## # ℹ 1,430 more rows
## # ℹ 3 more variables: max_life <dbl>, litter_size <dbl>, litters_year <dbl>

Practice

Let’s practice evaluating NA’s in a large data set. The data are compiled from CITES. This is the international organization that tracks trade in endangered wildlife. You can find information about the data here.

Some key information:
country codes

  1. Import the data and do a little exploration. Be sure to clean the names if necessary.
cites <- read_csv("data/cites.csv") %>% 
  clean_names()
glimpse(cites)
## Rows: 67,161
## Columns: 16
## $ year                       <dbl> 2016, 2016, 2016, 2016, 2016, 2016, 2016, 2…
## $ app                        <chr> "I", "I", "I", "I", "I", "I", "I", "I", "I"…
## $ taxon                      <chr> "Aquila heliaca", "Aquila heliaca", "Haliae…
## $ class                      <chr> "Aves", "Aves", "Aves", "Aves", "Aves", "Av…
## $ order                      <chr> "Falconiformes", "Falconiformes", "Falconif…
## $ family                     <chr> "Accipitridae", "Accipitridae", "Accipitrid…
## $ genus                      <chr> "Aquila", "Aquila", "Haliaeetus", "Haliaeet…
## $ importer                   <chr> "TR", "XV", "BE", "BE", "DK", "XV", "BR", "…
## $ exporter                   <chr> "NL", "RS", "NO", "NO", "IS", "RS", "FR", "…
## $ origin                     <chr> "CZ", "RS", NA, NA, NA, "RS", NA, NA, NA, N…
## $ importer_reported_quantity <dbl> NA, NA, NA, NA, 700, NA, NA, NA, NA, NA, 10…
## $ exporter_reported_quantity <dbl> 1, 1, 43, 43, NA, 1, 12, 4, 2, 4, NA, 2, 1,…
## $ term                       <chr> "bodies", "bodies", "feathers", "specimens"…
## $ unit                       <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, "g"…
## $ purpose                    <chr> "T", "Q", "S", "S", "S", "Q", "S", "S", "S"…
## $ source                     <chr> "C", "O", "W", "W", "W", "O", "C", "U", "W"…
  1. Use naniar to summarize the NA’s in each variable.
miss_var_summary(cites)
## # A tibble: 16 × 3
##    variable                   n_miss pct_miss
##    <chr>                       <int>    <num>
##  1 unit                        60759  90.5   
##  2 origin                      41518  61.8   
##  3 importer_reported_quantity  35295  52.6   
##  4 exporter_reported_quantity  23140  34.5   
##  5 class                       20224  30.1   
##  6 purpose                      6059   9.02  
##  7 genus                        1459   2.17  
##  8 exporter                      573   0.853 
##  9 source                        544   0.810 
## 10 family                        461   0.686 
## 11 importer                       71   0.106 
## 12 order                          57   0.0849
## 13 year                            0   0     
## 14 app                             0   0     
## 15 taxon                           0   0     
## 16 term                            0   0
  1. Try using group_by() with naniar. Look specifically at class and exporter_reported_quantity. For which taxonomic classes do we have the highest number of missing export data?
cites %>% 
  select(class, exporter_reported_quantity) %>% 
  group_by(class) %>% 
  miss_var_summary() %>% 
  arrange(desc(n_miss))
## # A tibble: 17 × 4
## # Groups:   class [17]
##    class          variable                   n_miss pct_miss
##    <chr>          <chr>                       <int>    <num>
##  1 <NA>           exporter_reported_quantity   7002     34.6
##  2 Reptilia       exporter_reported_quantity   5323     28.9
##  3 Anthozoa       exporter_reported_quantity   3858     43.9
##  4 Mammalia       exporter_reported_quantity   3731     43.9
##  5 Aves           exporter_reported_quantity   1792     26.1
##  6 Actinopteri    exporter_reported_quantity    726     26.3
##  7 Amphibia       exporter_reported_quantity    190     45.2
##  8 Bivalvia       exporter_reported_quantity    165     61.3
##  9 Gastropoda     exporter_reported_quantity    104     54.5
## 10 Insecta        exporter_reported_quantity     74     23.9
## 11 Hydrozoa       exporter_reported_quantity     61     33.7
## 12 Elasmobranchii exporter_reported_quantity     58     51.3
## 13 Arachnida      exporter_reported_quantity     32     47.8
## 14 Hirudinoidea   exporter_reported_quantity     11     32.4
## 15 Holothuroidea  exporter_reported_quantity     10    100  
## 16 Dipneusti      exporter_reported_quantity      3     75  
## 17 Coelacanthi    exporter_reported_quantity      0      0

–>Home