Learning Goals

At the end of this exercise, you will be able to:
1. Join data frames in R.
2. Format dates in a data frame.

Load the libraries

library("tidyverse")
library("janitor")
library("lubridate") #this will help us manage dates

Data

For this tutorial, we will use data from my research on Usofila spiders. These spiders live mostly in caves and the data were collected by building a database of all the known museum specimens in the US.

spider_records <- read_csv("data/spider_records.csv")
glimpse(spider_records) #let's have a quick look
## Rows: 270
## Columns: 9
## $ Accession <dbl> 9038521, 9038522, 9038523, 9038524, 9038525, 9038526, 903852…
## $ Family    <chr> "Telemidae", "Telemidae", "Telemidae", "Telemidae", "Telemid…
## $ Genus     <chr> "Usofila", "Usofila", "Usofila", "Usofila", "Usofila", "Usof…
## $ Country   <chr> "USA", "USA", "USA", "USA", "USA", "USA", "USA", "USA", "USA…
## $ State     <chr> "California", "California", "California", "California", "Cal…
## $ County    <chr> "Alameda", "Alameda", "Alameda", "Alameda", "Alameda", "Alam…
## $ Locality  <chr> "Berkeley", "Castro Valley", "Niles, off Niles Cyn. on Palom…
## $ Date      <chr> "2/3/1958", "24/3/1941", "2/1/1964", "18/2/1953", "25/1/1953…
## $ Collector <chr> "LM Smith", "WM Pearce", "V Roth", "WG Bentinck", "R Schuste…

Joining Files

Sometimes data we need is stored in a separate file or becomes available later and we need to join it to our existing data in order to work with it. Let’s use an easy example to explore some of the different types of joins. This section was based on the excellent example provided by Holly Emblem.

Let’s assume we have a database of customers and their purchases.

table_A <- read_csv("data/table_A.csv")
## Rows: 3 Columns: 5
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): last_order_date
## dbl (4): customer_ID, n_orders, total_value_orders, ave_value_orders
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
table_B <- read_csv("data/table_B.csv")
## Rows: 3 Columns: 4
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (2): gender, favorite
## dbl (2): customer_ID, age
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
table_A
## # A tibble: 3 × 5
##   customer_ID n_orders total_value_orders ave_value_orders last_order_date
##         <dbl>    <dbl>              <dbl>            <dbl> <chr>          
## 1           1        5                500              100 12/7/20        
## 2           2       10                240               24 1/2/18         
## 3           3        4                 40               10 3/6/20
table_B
## # A tibble: 3 × 4
##   customer_ID   age gender favorite
##         <dbl> <dbl> <chr>  <chr>   
## 1           1    50 M      t_shirts
## 2           2    44 F      leggings
## 3           4    30 F      jumpers

The general syntax for joins is…

#join_type(firstTable, secondTable, by=columnTojoinOn)

inner_join
Where did customers 3 and 4 go? Since they are not on both tables, they were dropped.

inner_join(table_A, table_B, by="customer_ID")
## # A tibble: 2 × 8
##   customer_ID n_orders total_value_orders ave_value_orders last_order_date   age
##         <dbl>    <dbl>              <dbl>            <dbl> <chr>           <dbl>
## 1           1        5                500              100 12/7/20            50
## 2           2       10                240               24 1/2/18             44
## # ℹ 2 more variables: gender <chr>, favorite <chr>

left_join Where did customer 4 go? Notice the missing data for customer 3.

left_join(table_A, table_B, by="customer_ID")
## # A tibble: 3 × 8
##   customer_ID n_orders total_value_orders ave_value_orders last_order_date   age
##         <dbl>    <dbl>              <dbl>            <dbl> <chr>           <dbl>
## 1           1        5                500              100 12/7/20            50
## 2           2       10                240               24 1/2/18             44
## 3           3        4                 40               10 3/6/20             NA
## # ℹ 2 more variables: gender <chr>, favorite <chr>

right_join Where did customer 3 go? Notice the missing data for customer 4.

right_join(table_A, table_B, by="customer_ID")
## # A tibble: 3 × 8
##   customer_ID n_orders total_value_orders ave_value_orders last_order_date   age
##         <dbl>    <dbl>              <dbl>            <dbl> <chr>           <dbl>
## 1           1        5                500              100 12/7/20            50
## 2           2       10                240               24 1/2/18             44
## 3           4       NA                 NA               NA <NA>               30
## # ℹ 2 more variables: gender <chr>, favorite <chr>

full_join

full_join(table_A, table_B, by="customer_ID")
## # A tibble: 4 × 8
##   customer_ID n_orders total_value_orders ave_value_orders last_order_date   age
##         <dbl>    <dbl>              <dbl>            <dbl> <chr>           <dbl>
## 1           1        5                500              100 12/7/20            50
## 2           2       10                240               24 1/2/18             44
## 3           3        4                 40               10 3/6/20             NA
## 4           4       NA                 NA               NA <NA>               30
## # ℹ 2 more variables: gender <chr>, favorite <chr>

anti_join Provides the rows in the first table for which there are not matching values in the second table.

anti_join(table_A, table_B, by="customer_ID")
## # A tibble: 1 × 5
##   customer_ID n_orders total_value_orders ave_value_orders last_order_date
##         <dbl>    <dbl>              <dbl>            <dbl> <chr>          
## 1           3        4                 40               10 3/6/20

Joining the spider data

Here, the latitude and longitude for each spider was recorded from the records at a later date, and now we need to join it to our spiders_records data frame. The lat/long were recorded into one single file for each observation. Let’s read in the lat/long data.

spiders_locs <- read_csv("data/spiders_locations.csv")

We will use a join here to merge lat/long to our data frame. Both files contain a unique identifier called Accession which we will use to join.

Practice

  1. Which type of join is most appropriate in this case?
spiders_with_locs <- 
  left_join(spider_records, spiders_locs, by = "Accession")
names(spiders_with_locs) #make sure it worked
##  [1] "Accession" "Family"    "Genus"     "Country"   "State"     "County"   
##  [7] "Locality"  "Date"      "Collector" "Latitude"  "Longitude"

As a side note, joining data can highlight problems or typos with the data when the join does not go as expected.

Formatting Dates

Did you notice the date column? It seems to be in the format Day/Month/Year and is of class character. We need to change this to a date format that R will recognize.

glimpse(spiders_with_locs)
## Rows: 270
## Columns: 11
## $ Accession <dbl> 9038521, 9038522, 9038523, 9038524, 9038525, 9038526, 903852…
## $ Family    <chr> "Telemidae", "Telemidae", "Telemidae", "Telemidae", "Telemid…
## $ Genus     <chr> "Usofila", "Usofila", "Usofila", "Usofila", "Usofila", "Usof…
## $ Country   <chr> "USA", "USA", "USA", "USA", "USA", "USA", "USA", "USA", "USA…
## $ State     <chr> "California", "California", "California", "California", "Cal…
## $ County    <chr> "Alameda", "Alameda", "Alameda", "Alameda", "Alameda", "Alam…
## $ Locality  <chr> "Berkeley", "Castro Valley", "Niles, off Niles Cyn. on Palom…
## $ Date      <chr> "2/3/1958", "24/3/1941", "2/1/1964", "18/2/1953", "25/1/1953…
## $ Collector <chr> "LM Smith", "WM Pearce", "V Roth", "WG Bentinck", "R Schuste…
## $ Latitude  <dbl> 37.87, 37.73, 37.60, 37.80, 37.80, 37.80, 37.80, 37.82, 37.8…
## $ Longitude <dbl> -122.24, -122.07, -121.95, -122.16, -122.16, -122.16, -122.1…

The lubridate package was created to deal with dates of all types. There are many useful functions in lubridate for working with dates. Please see the excellent tutorial provided by RichardOnData.

day <- today()
day
## [1] "2025-02-25"
str(day)
##  Date[1:1], format: "2025-02-25"
datetime <- now()
datetime
## [1] "2025-02-25 14:34:16 PST"

We want to change our date column to the standard “YEAR-MO-DA” format that R will recognize as a date. Right now it is listed as day-month-year. We can use the lubridate function dmy for this.

spiders_with_locs <- spiders_with_locs %>% 
  mutate(Date = dmy(Date))
## Warning: There was 1 warning in `mutate()`.
## ℹ In argument: `Date = dmy(Date)`.
## Caused by warning:
## !  2 failed to parse.

There are many different options for working with dates and datetimes in lubridate including ymd, mdy, dmy, mdy_hms, ymd_hms.

dateformat1 <- "20200922"
dateformat2 <- "09-22-2020"
dateformat3 <- "22/09/2020"
dateformat4 <- "09-22-2020 17:00:00"
dateformat5 <- "20200922 170000"

Practice

  1. Convert each of the examples above using lubridate.
ymd(dateformat1)
## [1] "2020-09-22"
mdy(dateformat2)
## [1] "2020-09-22"
dmy(dateformat3)
## [1] "2020-09-22"
mdy_hms(dateformat4)
## [1] "2020-09-22 17:00:00 UTC"
ymd_hms(dateformat5)
## [1] "2020-09-22 17:00:00 UTC"

Let’s save our final data to a .csv.

write.csv(spiders_with_locs, file = "spiders_with_locs.csv", row.names = FALSE)

That’s it, let’s take a break!

–>Home