Learning Goals

At the end of this exercise, you will be able to:
1. Define NA and describe how they are treated in R.
2. Produce summaries of the number of NA’s in a data set.
3. Replace values with NA in a data set as appropriate.

Install the package naniar

#install.packages("naniar")

Load the libraries

library("tidyverse")
library("naniar")
library("janitor")

Review

When working with “wild” data, dealing with NA’s is a fundamental part of the data cleaning process. Data scientists spend most of their time cleaning and transforming data- including managing NA’s. There isn’t a single approach that will always work so you need to be careful about using replacement strategies across an entire data set. And, as the data sets become larger NA’s can become trickier to deal with.

For the following, we will use life history data for mammals. The data are from:
S. K. Morgan Ernest. 2003. Life history characteristics of placental non-volant mammals. Ecology 84:3402.

Load the mammals life history data and clean the names

life_history <- read_csv("data/mammal_lifehistories_v3.csv") %>% clean_names()

Are there any NA’s?

Sometimes using one or more of the summary functions can give us clues to how the authors have represented missing data. This doesn’t always work, but it is a good place to start.

summary(life_history)
##     order              family             genus             species         
##  Length:1440        Length:1440        Length:1440        Length:1440       
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##                                                                             
##       mass             gestation         newborn             weaning       
##  Min.   :     -999   Min.   :-999.00   Length:1440        Min.   :-999.00  
##  1st Qu.:       50   1st Qu.:-999.00   Class :character   1st Qu.:-999.00  
##  Median :      403   Median :   1.05   Mode  :character   Median :   0.73  
##  Mean   :   383577   Mean   :-287.25                      Mean   :-427.17  
##  3rd Qu.:     7009   3rd Qu.:   4.50                      3rd Qu.:   2.00  
##  Max.   :149000000   Max.   :  21.46                      Max.   :  48.00  
##                                                                            
##    wean_mass             afr             max_life        litter_size      
##  Min.   :    -999   Min.   :-999.00   Min.   :   0.00   Min.   :-999.000  
##  1st Qu.:    -999   1st Qu.:-999.00   1st Qu.:   0.00   1st Qu.:   1.000  
##  Median :    -999   Median :   2.50   Median :   0.00   Median :   2.270  
##  Mean   :   16049   Mean   :-408.12   Mean   :  93.19   Mean   : -55.634  
##  3rd Qu.:      10   3rd Qu.:  15.61   3rd Qu.: 147.25   3rd Qu.:   3.835  
##  Max.   :19075000   Max.   : 210.00   Max.   :1368.00   Max.   :  14.180  
##                                                                           
##   litters_year  
##  Min.   :0.140  
##  1st Qu.:1.000  
##  Median :1.000  
##  Mean   :1.636  
##  3rd Qu.:2.000  
##  Max.   :7.500  
##  NA's   :689

This will give you a quick summary of the number of NA’s in each variable. Notice that, at least for now, it doesn’t look like there are NA’s in most variables. But, given the size of the data this is unlikely.

life_history %>% 
  summarize(across(everything(), ~ sum(is.na(.))))
## # A tibble: 1 × 13
##   order family genus species  mass gestation newborn weaning wean_mass   afr
##   <int>  <int> <int>   <int> <int>     <int>   <int>   <int>     <int> <int>
## 1     0      0     0       0     0         0       0       0         0     0
## # ℹ 3 more variables: max_life <int>, litter_size <int>, litters_year <int>

Practice

  1. Can we use a single approach to deal with NA’s in this data set? Given what you learned in the previous lab, how would you manage the NA values?
life_history_no_nas <- 
  read_csv("data/mammal_lifehistories_v3.csv", na = c("NA", " ", ".", "-999")) %>% 
  clean_names() # notice that I am creating a new dataframe that doesn't have any NA's
## Warning: One or more parsing issues, call `problems()` on your data frame for details,
## e.g.:
##   dat <- vroom(...)
##   problems(dat)
## Rows: 1440 Columns: 13
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (5): order, family, Genus, species, newborn
## dbl (8): mass, gestation, weaning, wean mass, AFR, max. life, litter size, l...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Recheck

life_history_no_nas %>% 
  summarize(across(everything(), ~ sum(is.na(.))))
## # A tibble: 1 × 13
##   order family genus species  mass gestation newborn weaning wean_mass   afr
##   <int>  <int> <int>   <int> <int>     <int>   <int>   <int>     <int> <int>
## 1     0      0     0       0    85       418       0     619      1039   607
## # ℹ 3 more variables: max_life <int>, litter_size <int>, litters_year <int>

naniar

naniar is a package that is built to manage NA’s. Many of the functions it performs can also be performed using tidyverse functions, but it does provide some interesting alternatives.

miss_var_summary provides a summary of NA’s across the data frame.

miss_var_summary(life_history_no_nas)
## # A tibble: 13 × 3
##    variable     n_miss pct_miss
##    <chr>         <int>    <num>
##  1 wean_mass      1039    72.2 
##  2 litters_year    689    47.8 
##  3 weaning         619    43.0 
##  4 afr             607    42.2 
##  5 gestation       418    29.0 
##  6 mass             85     5.90
##  7 litter_size      84     5.83
##  8 order             0     0   
##  9 family            0     0   
## 10 genus             0     0   
## 11 species           0     0   
## 12 newborn           0     0   
## 13 max_life          0     0

Notice that max_life has no NA’s. Does that make sense in the biological context of this data?

hist(life_history_no_nas$max_life) #we found another way that NA's are represented

life_history_no_nas %>%
  filter(max_life==0) %>% 
  select(order, family, genus, species, max_life)
## # A tibble: 841 × 5
##    order        family  genus       species     max_life
##    <chr>        <chr>   <chr>       <chr>          <dbl>
##  1 Artiodactyla Bovidae Ammodorcas  clarkei            0
##  2 Artiodactyla Bovidae Capra       caucasica          0
##  3 Artiodactyla Bovidae Capra       ibex               0
##  4 Artiodactyla Bovidae Cephalophus niger              0
##  5 Artiodactyla Bovidae Cephalophus natalensis         0
##  6 Artiodactyla Bovidae Cephalophus leucogaster        0
##  7 Artiodactyla Bovidae Cephalophus ogilbyi            0
##  8 Artiodactyla Bovidae Cephalophus zebra              0
##  9 Artiodactyla Bovidae Cephalophus rufilatus          0
## 10 Artiodactyla Bovidae Cephalophus dorsalis           0
## # ℹ 831 more rows

Let’s use mutate() and use na_if() to replace 0’s with NA’s in max_life. This chunk allows us to address problems in a single variable.

life_history_no_nas <- life_history_no_nas %>% 
  mutate(max_life=na_if(max_life, 0))

We can also use miss_var_summary with group_by(). This helps us better evaluate where NA’s are in the data.

life_history_no_nas %>%
  group_by(order) %>%
  miss_var_summary(order=T)
## # A tibble: 204 × 4
## # Groups:   order [17]
##    order        variable     n_miss pct_miss
##    <chr>        <chr>         <int>    <num>
##  1 Artiodactyla wean_mass       134    83.2 
##  2 Artiodactyla litters_year     77    47.8 
##  3 Artiodactyla weaning          68    42.2 
##  4 Artiodactyla max_life         62    38.5 
##  5 Artiodactyla afr              32    19.9 
##  6 Artiodactyla gestation        11     6.83
##  7 Artiodactyla litter_size       5     3.11
##  8 Artiodactyla mass              4     2.48
##  9 Artiodactyla family            0     0   
## 10 Artiodactyla genus             0     0   
## # ℹ 194 more rows

naniar also has a nice replace functions which will allow you to precisely control which values you want replaced with NA’s in each variable. This is a nice alternative to mutate() and na_if().

life_history %>% #going back to the original data
  replace_with_na(replace = list(newborn = "not measured", 
                                 weaning= -999, 
                                 wean_mass= -999, 
                                 afr= -999, 
                                 max_life= 0, 
                                 litter_size= -999, 
                                 gestation= -999, 
                                 mass= -999)) %>% 
miss_var_summary()
## # A tibble: 13 × 3
##    variable     n_miss pct_miss
##    <chr>         <int>    <num>
##  1 wean_mass      1039    72.2 
##  2 max_life        841    58.4 
##  3 litters_year    689    47.8 
##  4 weaning         619    43.0 
##  5 afr             607    42.2 
##  6 newborn         595    41.3 
##  7 gestation       418    29.0 
##  8 mass             85     5.90
##  9 litter_size      84     5.83
## 10 order             0     0   
## 11 family            0     0   
## 12 genus             0     0   
## 13 species           0     0

You can also use naniar to replace a specific value (like -999) with NA across the entire data set.

life_history %>% #going back to the original data
  replace_with_na_all(condition = ~.x == -999)%>% 
miss_var_summary()
## # A tibble: 13 × 3
##    variable     n_miss pct_miss
##    <chr>         <int>    <num>
##  1 wean_mass      1039    72.2 
##  2 litters_year    689    47.8 
##  3 weaning         619    43.0 
##  4 afr             607    42.2 
##  5 gestation       418    29.0 
##  6 mass             85     5.90
##  7 litter_size      84     5.83
##  8 order             0     0   
##  9 family            0     0   
## 10 genus             0     0   
## 11 species           0     0   
## 12 newborn           0     0   
## 13 max_life          0     0

Finally, naniar has some built-in examples of common values or character strings used to represent NA’s. The chunk below will use these built-in parameters to replace NA’s across the entire data set.

common_na_strings
##  [1] "missing" "NA"      "N A"     "N/A"     "#N/A"    "NA "     " NA"    
##  [8] "N /A"    "N / A"   " N / A"  "N / A "  "na"      "n a"     "n/a"    
## [15] "na "     " na"     "n /a"    "n / a"   " a / a"  "n / a "  "NULL"   
## [22] "null"    ""        "\\?"     "\\*"     "\\."
common_na_numbers
## [1]    -9   -99  -999 -9999  9999    66    77    88
life_history %>% #going back to the original data
  replace_with_na_all(condition = ~.x %in% c(common_na_strings, common_na_numbers)) %>% 
  mutate(newborn=na_if(newborn, "not measured"))
## # A tibble: 1,440 × 13
##    order   family genus species   mass gestation newborn weaning wean_mass   afr
##    <chr>   <chr>  <chr> <chr>    <dbl>     <dbl> <chr>     <dbl>     <dbl> <dbl>
##  1 Artiod… Antil… Anti… americ… 4.54e4      8.13 3246.36    3         8900  13.5
##  2 Artiod… Bovid… Addax nasoma… 1.82e5      9.39 5480       6.5         NA  27.3
##  3 Artiod… Bovid… Aepy… melamp… 4.15e4      6.35 5093       5.63     15900  16.7
##  4 Artiod… Bovid… Alce… busela… 1.5 e5      7.9  10166.…    6.5         NA  23.0
##  5 Artiod… Bovid… Ammo… clarkei 2.85e4      6.8  <NA>      NA           NA  NA  
##  6 Artiod… Bovid… Ammo… lervia  5.55e4      5.08 3810       4           NA  14.9
##  7 Artiod… Bovid… Anti… marsup… 3   e4      5.72 3910       4.04        NA  10.2
##  8 Artiod… Bovid… Anti… cervic… 3.75e4      5.5  3846       2.13        NA  20.1
##  9 Artiod… Bovid… Bison bison   4.98e5      8.93 20000     10.7     157500  29.4
## 10 Artiod… Bovid… Bison bonasus 5   e5      9.14 23000.…    6.6         NA  30.0
## # ℹ 1,430 more rows
## # ℹ 3 more variables: max_life <dbl>, litter_size <dbl>, litters_year <dbl>

Practice

Let’s practice evaluating NA’s in a large data set. The data are compiled from CITES. This is the international organization that tracks trade in endangered wildlife. You can find information about the data here.

Some key information:
country codes

  1. Import the data and do a little exploration. Be sure to clean the names if necessary.
cites <- read_csv("data/cites.csv") %>% clean_names()
## Rows: 67161 Columns: 16
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (13): App., Taxon, Class, Order, Family, Genus, Importer, Exporter, Orig...
## dbl  (3): Year, Importer reported quantity, Exporter reported quantity
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
  1. Use naniar to summarize the NA’s in each variable.
miss_var_summary(cites)
## # A tibble: 16 × 3
##    variable                   n_miss pct_miss
##    <chr>                       <int>    <num>
##  1 unit                        60759  90.5   
##  2 origin                      41518  61.8   
##  3 importer_reported_quantity  35295  52.6   
##  4 exporter_reported_quantity  23140  34.5   
##  5 class                       20224  30.1   
##  6 purpose                      6059   9.02  
##  7 genus                        1459   2.17  
##  8 exporter                      573   0.853 
##  9 source                        544   0.810 
## 10 family                        461   0.686 
## 11 importer                       71   0.106 
## 12 order                          57   0.0849
## 13 year                            0   0     
## 14 app                             0   0     
## 15 taxon                           0   0     
## 16 term                            0   0
  1. Try using group_by() with naniar. Look specifically at class and exporter_reported_quantity. For which taxonomic classes do we have the highest number of missing export data?
cites %>%
  select(class, exporter_reported_quantity) %>% 
  group_by(class) %>%
  miss_var_summary() %>% 
  arrange(desc(n_miss))
## # A tibble: 17 × 4
## # Groups:   class [17]
##    class          variable                   n_miss pct_miss
##    <chr>          <chr>                       <int>    <num>
##  1 <NA>           exporter_reported_quantity   7002     34.6
##  2 Reptilia       exporter_reported_quantity   5323     28.9
##  3 Anthozoa       exporter_reported_quantity   3858     43.9
##  4 Mammalia       exporter_reported_quantity   3731     43.9
##  5 Aves           exporter_reported_quantity   1792     26.1
##  6 Actinopteri    exporter_reported_quantity    726     26.3
##  7 Amphibia       exporter_reported_quantity    190     45.2
##  8 Bivalvia       exporter_reported_quantity    165     61.3
##  9 Gastropoda     exporter_reported_quantity    104     54.5
## 10 Insecta        exporter_reported_quantity     74     23.9
## 11 Hydrozoa       exporter_reported_quantity     61     33.7
## 12 Elasmobranchii exporter_reported_quantity     58     51.3
## 13 Arachnida      exporter_reported_quantity     32     47.8
## 14 Hirudinoidea   exporter_reported_quantity     11     32.4
## 15 Holothuroidea  exporter_reported_quantity     10    100  
## 16 Dipneusti      exporter_reported_quantity      3     75  
## 17 Coelacanthi    exporter_reported_quantity      0      0

Dealing with NA’s in advance- nuclear option

Once you are sure that you know how NA’s are treated in the data, then you can deal with them in advance using na() as part of the readr package. Notice that you still need to deal with some variables seperately. For example, you don’t want to replace all 0’s with NA, just the max_life variable.

read_csv(file = "data/mammal_lifehistories_v3.csv", na = c("NA", " ", ".", "-999", "not measured")) %>% clean_names()
## Warning: One or more parsing issues, call `problems()` on your data frame for details,
## e.g.:
##   dat <- vroom(...)
##   problems(dat)
## Rows: 1440 Columns: 13
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (4): order, family, Genus, species
## dbl (9): mass, gestation, newborn, weaning, wean mass, AFR, max. life, litte...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
## # A tibble: 1,440 × 13
##    order   family genus species   mass gestation newborn weaning wean_mass   afr
##    <chr>   <chr>  <chr> <chr>    <dbl>     <dbl>   <dbl>   <dbl>     <dbl> <dbl>
##  1 Artiod… Antil… Anti… americ… 4.54e4      8.13   3246.    3         8900  13.5
##  2 Artiod… Bovid… Addax nasoma… 1.82e5      9.39   5480     6.5         NA  27.3
##  3 Artiod… Bovid… Aepy… melamp… 4.15e4      6.35   5093     5.63     15900  16.7
##  4 Artiod… Bovid… Alce… busela… 1.5 e5      7.9   10167.    6.5         NA  23.0
##  5 Artiod… Bovid… Ammo… clarkei 2.85e4      6.8      NA    NA           NA  NA  
##  6 Artiod… Bovid… Ammo… lervia  5.55e4      5.08   3810     4           NA  14.9
##  7 Artiod… Bovid… Anti… marsup… 3   e4      5.72   3910     4.04        NA  10.2
##  8 Artiod… Bovid… Anti… cervic… 3.75e4      5.5    3846     2.13        NA  20.1
##  9 Artiod… Bovid… Bison bison   4.98e5      8.93  20000    10.7     157500  29.4
## 10 Artiod… Bovid… Bison bonasus 5   e5      9.14  23000.    6.6         NA  30.0
## # ℹ 1,430 more rows
## # ℹ 3 more variables: max_life <dbl>, litter_size <dbl>, litters_year <dbl>

That’s it! Let’s take a break and then move on to part 2!

–>Home