Lab 13.2

Learning Goals

At the end of this exercise, you will be able to:
1. Join data frames in R.
2. Format dates in a data frame.

Load the libraries

library("tidyverse")
library("janitor")
library("lubridate") #this will help us manage dates

Joining Files

Sometimes data we need is stored in a separate file or becomes available later and we need to join it to our existing data in order to work with it. Let’s use an easy example to explore some of the different types of joins. This section was based on the excellent example provided by Holly Emblem.

Let’s assume we have a database of customers and their purchases.

table_A <- read_csv("data/table_A.csv")
table_B <- read_csv("data/table_B.csv")

table_A

## # A tibble: 3 × 5
##   customer_ID n_orders total_value_orders ave_value_orders last_order_date
##         <dbl>    <dbl>              <dbl>            <dbl> <chr>          
## 1           1        5                500              100 12/7/20        
## 2           2       10                240               24 1/2/18         
## 3           3        4                 40               10 3/6/20

table_B

## # A tibble: 3 × 4
##   customer_ID   age gender favorite
##         <dbl> <dbl> <chr>  <chr>   
## 1           1    50 M      t_shirts
## 2           2    44 F      leggings
## 3           4    30 F      jumpers

The general syntax for joins is…

#join_type(firstTable, secondTable, by=columnTojoinOn)

inner_join
Where did customers 3 and 4 go? Since they are not on both tables, they were dropped. So, inner_join only keeps the rows that have matching values in both tables; it is very strict about matching.

inner_join(table_A, table_B, by="customer_ID")

## # A tibble: 2 × 8
##   customer_ID n_orders total_value_orders ave_value_orders last_order_date   age
##         <dbl>    <dbl>              <dbl>            <dbl> <chr>           <dbl>
## 1           1        5                500              100 12/7/20            50
## 2           2       10                240               24 1/2/18             44
## # ℹ 2 more variables: gender <chr>, favorite <chr>

left_join Where did customer 4 go? Notice the missing data for customer 3. So, left_join keeps all the rows in the first table and only the matching rows in the second table. If there is no match, it fills in with NA.

left_join(table_A, table_B, by="customer_ID")

## # A tibble: 3 × 8
##   customer_ID n_orders total_value_orders ave_value_orders last_order_date   age
##         <dbl>    <dbl>              <dbl>            <dbl> <chr>           <dbl>
## 1           1        5                500              100 12/7/20            50
## 2           2       10                240               24 1/2/18             44
## 3           3        4                 40               10 3/6/20             NA
## # ℹ 2 more variables: gender <chr>, favorite <chr>

right_join Where did customer 3 go? Notice the missing data for customer 4. right_join does the opposite of left_join. It keeps all the rows in the second table and only the matching rows in the first table. If there is no match, it fills in with NA.

right_join(table_A, table_B, by="customer_ID")

## # A tibble: 3 × 8
##   customer_ID n_orders total_value_orders ave_value_orders last_order_date   age
##         <dbl>    <dbl>              <dbl>            <dbl> <chr>           <dbl>
## 1           1        5                500              100 12/7/20            50
## 2           2       10                240               24 1/2/18             44
## 3           4       NA                 NA               NA <NA>               30
## # ℹ 2 more variables: gender <chr>, favorite <chr>

full_join full_join keeps all the rows in both tables and fills in with NA where there is no match.

full_join(table_A, table_B, by="customer_ID")

## # A tibble: 4 × 8
##   customer_ID n_orders total_value_orders ave_value_orders last_order_date   age
##         <dbl>    <dbl>              <dbl>            <dbl> <chr>           <dbl>
## 1           1        5                500              100 12/7/20            50
## 2           2       10                240               24 1/2/18             44
## 3           3        4                 40               10 3/6/20             NA
## 4           4       NA                 NA               NA <NA>               30
## # ℹ 2 more variables: gender <chr>, favorite <chr>

anti_join Provides the rows in the first table for which there are not matching values in the second table.

anti_join(table_A, table_B, by="customer_ID") #use to troubleshoot problematic joins

## # A tibble: 1 × 5
##   customer_ID n_orders total_value_orders ave_value_orders last_order_date
##         <dbl>    <dbl>              <dbl>            <dbl> <chr>          
## 1           3        4                 40               10 3/6/20

Dates

The lubridate package was created to deal with dates of all types. There are many useful functions in lubridate, most of which are designed to make working with dates easier. For example, you can easily get the current date and time with today() and now().

today()

## [1] "2026-03-09"

now()

## [1] "2026-03-09 10:17:56 PDT"

There are many different options for working with dates and date-times in lubridate including ymd, mdy, dmy, mdy_hms, ymd_hms. The formats below are common examples of how you might see dates in your data.

dateformat1 <- "20200922"
dateformat2 <- "09-22-2020"
dateformat3 <- "22/09/2020"
dateformat4 <- "09-22-2020 17:00:00"
dateformat5 <- "20200922 170000"

Converting dates

We can convert each of the examples above using lubridate.

ymd(dateformat1)

## [1] "2020-09-22"

mdy(dateformat2)

## [1] "2020-09-22"

dmy(dateformat3)

## [1] "2020-09-22"

mdy_hms(dateformat4)

## [1] "2020-09-22 17:00:00 UTC"

ymd_hms(dateformat5)

## [1] "2020-09-22 17:00:00 UTC"

Practice

For this practice, we will use data from my research on Usofila spiders. These spiders live mostly in caves and the data were collected by building a database of all the known museum specimens in the US. There are two files: 1. spider_records.csv which contains all the records of spiders and 2. spiders_locations.csv which contains the lat/long for each record. An accession number is a unique identifier for each specimen.

Load the two files as separate dataframes. Make sure to clean the column names.

spider_records <- read_csv("data/spider_records.csv") %>% clean_names()

spider_locs <- read_csv("data/spiders_locations.csv") %>% clean_names()

Join the files using an appropriate key and join type.

spiders_with_locs <- 
  left_join(spider_records, spider_locs, by="accession")

names(spiders_with_locs) #check to see if it worked

##  [1] "accession" "family"    "genus"     "country"   "state"     "county"   
##  [7] "locality"  "date"      "collector" "latitude"  "longitude"

Convert the date column to a date format using lubridate.

spiders_with_locs <-spiders_with_locs %>% 
  mutate(date=dmy(date))

## Warning: There was 1 warning in `mutate()`.
## ℹ In argument: `date = dmy(date)`.
## Caused by warning:
## !  2 failed to parse.

Save your final data to a .csv.

write.csv(spiders_with_locs, file="spiders_with_locs.csv", row.names=FALSE)

That’s it, let’s take a break!

–>Home