At the end of this exercise, you will be able to:
1. Join data frames in R.
2. Format dates in a data frame.
library("tidyverse")
library("janitor")
library("lubridate") #this will help us manage dates
For this tutorial, we will use data from my research on Usofila spiders. These spiders live mostly in caves and the data were collected by building a database of all the known museum specimens in the US.
spider_records <- read_csv("data/spider_records.csv")
glimpse(spider_records) #let's have a quick look
## Rows: 270
## Columns: 9
## $ Accession <dbl> 9038521, 9038522, 9038523, 9038524, 9038525, 9038526, 903852…
## $ Family <chr> "Telemidae", "Telemidae", "Telemidae", "Telemidae", "Telemid…
## $ Genus <chr> "Usofila", "Usofila", "Usofila", "Usofila", "Usofila", "Usof…
## $ Country <chr> "USA", "USA", "USA", "USA", "USA", "USA", "USA", "USA", "USA…
## $ State <chr> "California", "California", "California", "California", "Cal…
## $ County <chr> "Alameda", "Alameda", "Alameda", "Alameda", "Alameda", "Alam…
## $ Locality <chr> "Berkeley", "Castro Valley", "Niles, off Niles Cyn. on Palom…
## $ Date <chr> "2/3/1958", "24/3/1941", "2/1/1964", "18/2/1953", "25/1/1953…
## $ Collector <chr> "LM Smith", "WM Pearce", "V Roth", "WG Bentinck", "R Schuste…
Sometimes data we need is stored in a separate file or becomes available later and we need to join it to our existing data in order to work with it. Let’s use an easy example to explore some of the different types of joins. This section was based on the excellent example provided by Holly Emblem.
Let’s assume we have a database of customers and their purchases.
table_A <- read_csv("data/table_A.csv")
## Rows: 3 Columns: 5
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): last_order_date
## dbl (4): customer_ID, n_orders, total_value_orders, ave_value_orders
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
table_B <- read_csv("data/table_B.csv")
## Rows: 3 Columns: 4
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (2): gender, favorite
## dbl (2): customer_ID, age
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
table_A
## # A tibble: 3 × 5
## customer_ID n_orders total_value_orders ave_value_orders last_order_date
## <dbl> <dbl> <dbl> <dbl> <chr>
## 1 1 5 500 100 12/7/20
## 2 2 10 240 24 1/2/18
## 3 3 4 40 10 3/6/20
table_B
## # A tibble: 3 × 4
## customer_ID age gender favorite
## <dbl> <dbl> <chr> <chr>
## 1 1 50 M t_shirts
## 2 2 44 F leggings
## 3 4 30 F jumpers
The general syntax for joins is…
#join_type(firstTable, secondTable, by=columnTojoinOn)
inner_join
Where did customers 3 and 4 go? Since they are not on both tables, they
were dropped.
inner_join(table_A, table_B, by="customer_ID")
## # A tibble: 2 × 8
## customer_ID n_orders total_value_orders ave_value_orders last_order_date age
## <dbl> <dbl> <dbl> <dbl> <chr> <dbl>
## 1 1 5 500 100 12/7/20 50
## 2 2 10 240 24 1/2/18 44
## # ℹ 2 more variables: gender <chr>, favorite <chr>
left_join
Where did customer 4 go? Notice the missing
data for customer 3.
left_join(table_A, table_B, by="customer_ID")
## # A tibble: 3 × 8
## customer_ID n_orders total_value_orders ave_value_orders last_order_date age
## <dbl> <dbl> <dbl> <dbl> <chr> <dbl>
## 1 1 5 500 100 12/7/20 50
## 2 2 10 240 24 1/2/18 44
## 3 3 4 40 10 3/6/20 NA
## # ℹ 2 more variables: gender <chr>, favorite <chr>
right_join
Where did customer 3 go? Notice the missing
data for customer 4.
right_join(table_A, table_B, by="customer_ID")
## # A tibble: 3 × 8
## customer_ID n_orders total_value_orders ave_value_orders last_order_date age
## <dbl> <dbl> <dbl> <dbl> <chr> <dbl>
## 1 1 5 500 100 12/7/20 50
## 2 2 10 240 24 1/2/18 44
## 3 4 NA NA NA <NA> 30
## # ℹ 2 more variables: gender <chr>, favorite <chr>
full_join
full_join(table_A, table_B, by="customer_ID")
## # A tibble: 4 × 8
## customer_ID n_orders total_value_orders ave_value_orders last_order_date age
## <dbl> <dbl> <dbl> <dbl> <chr> <dbl>
## 1 1 5 500 100 12/7/20 50
## 2 2 10 240 24 1/2/18 44
## 3 3 4 40 10 3/6/20 NA
## 4 4 NA NA NA <NA> 30
## # ℹ 2 more variables: gender <chr>, favorite <chr>
anti_join
Provides the rows in the first table for which
there are not matching values in the second table.
anti_join(table_A, table_B, by="customer_ID")
## # A tibble: 1 × 5
## customer_ID n_orders total_value_orders ave_value_orders last_order_date
## <dbl> <dbl> <dbl> <dbl> <chr>
## 1 3 4 40 10 3/6/20
Here, the latitude and longitude for each spider was recorded from
the records at a later date, and now we need to join it to our
spiders_records
data frame. The lat/long were recorded into
one single file for each observation. Let’s read in the lat/long
data.
spiders_locs <- read_csv("data/spiders_locations.csv")
We will use a join here to merge lat/long to our data frame. Both
files contain a unique identifier called Accession
which we
will use to join.
spiders_with_locs <-
left_join(spider_records, spiders_locs, by = "Accession")
names(spiders_with_locs) #make sure it worked
## [1] "Accession" "Family" "Genus" "Country" "State" "County"
## [7] "Locality" "Date" "Collector" "Latitude" "Longitude"
As a side note, joining data can highlight problems or typos with the data when the join does not go as expected.
Did you notice the date column? It seems to be in the format Day/Month/Year and is of class character. We need to change this to a date format that R will recognize.
glimpse(spiders_with_locs)
## Rows: 270
## Columns: 11
## $ Accession <dbl> 9038521, 9038522, 9038523, 9038524, 9038525, 9038526, 903852…
## $ Family <chr> "Telemidae", "Telemidae", "Telemidae", "Telemidae", "Telemid…
## $ Genus <chr> "Usofila", "Usofila", "Usofila", "Usofila", "Usofila", "Usof…
## $ Country <chr> "USA", "USA", "USA", "USA", "USA", "USA", "USA", "USA", "USA…
## $ State <chr> "California", "California", "California", "California", "Cal…
## $ County <chr> "Alameda", "Alameda", "Alameda", "Alameda", "Alameda", "Alam…
## $ Locality <chr> "Berkeley", "Castro Valley", "Niles, off Niles Cyn. on Palom…
## $ Date <chr> "2/3/1958", "24/3/1941", "2/1/1964", "18/2/1953", "25/1/1953…
## $ Collector <chr> "LM Smith", "WM Pearce", "V Roth", "WG Bentinck", "R Schuste…
## $ Latitude <dbl> 37.87, 37.73, 37.60, 37.80, 37.80, 37.80, 37.80, 37.82, 37.8…
## $ Longitude <dbl> -122.24, -122.07, -121.95, -122.16, -122.16, -122.16, -122.1…
The lubridate
package was created to deal with dates of
all types. There are many useful functions in lubridate
for
working with dates. Please see the excellent tutorial provided by RichardOnData.
day <- today()
day
## [1] "2025-02-25"
str(day)
## Date[1:1], format: "2025-02-25"
datetime <- now()
datetime
## [1] "2025-02-25 14:34:16 PST"
We want to change our date column to the standard “YEAR-MO-DA” format
that R will recognize as a date. Right now it is listed as
day-month-year. We can use the lubridate function dmy
for
this.
spiders_with_locs <- spiders_with_locs %>%
mutate(Date = dmy(Date))
## Warning: There was 1 warning in `mutate()`.
## ℹ In argument: `Date = dmy(Date)`.
## Caused by warning:
## ! 2 failed to parse.
There are many different options for working with dates and datetimes
in lubridate including ymd
, mdy
,
dmy
, mdy_hms
, ymd_hms
.
dateformat1 <- "20200922"
dateformat2 <- "09-22-2020"
dateformat3 <- "22/09/2020"
dateformat4 <- "09-22-2020 17:00:00"
dateformat5 <- "20200922 170000"
ymd(dateformat1)
## [1] "2020-09-22"
mdy(dateformat2)
## [1] "2020-09-22"
dmy(dateformat3)
## [1] "2020-09-22"
mdy_hms(dateformat4)
## [1] "2020-09-22 17:00:00 UTC"
ymd_hms(dateformat5)
## [1] "2020-09-22 17:00:00 UTC"
Let’s save our final data to a .csv
.
write.csv(spiders_with_locs, file = "spiders_with_locs.csv", row.names = FALSE)
–>Home