At the end of this exercise, you will be able to:
1. Join data frames in R.
2. Format dates in a data frame.
library("tidyverse")
library("janitor")
library("lubridate") #this will help us manage dates
Sometimes data we need is stored in a separate file or becomes available later and we need to join it to our existing data in order to work with it. Let’s use an easy example to explore some of the different types of joins. This section was based on the excellent example provided by Holly Emblem.
Let’s assume we have a database of customers and their purchases.
table_A <- read_csv("data/table_A.csv")
table_B <- read_csv("data/table_B.csv")
table_A
## # A tibble: 3 × 5
## customer_ID n_orders total_value_orders ave_value_orders last_order_date
## <dbl> <dbl> <dbl> <dbl> <chr>
## 1 1 5 500 100 12/7/20
## 2 2 10 240 24 1/2/18
## 3 3 4 40 10 3/6/20
table_B
## # A tibble: 3 × 4
## customer_ID age gender favorite
## <dbl> <dbl> <chr> <chr>
## 1 1 50 M t_shirts
## 2 2 44 F leggings
## 3 4 30 F jumpers
The general syntax for joins is…
#join_type(firstTable, secondTable, by=columnTojoinOn)
inner_join
Where did customers 3 and 4 go? Since they are not on both tables, they
were dropped. So, inner_join only keeps the rows that have
matching values in both tables; it is very strict about matching.
inner_join(table_A, table_B, by="customer_ID")
## # A tibble: 2 × 8
## customer_ID n_orders total_value_orders ave_value_orders last_order_date age
## <dbl> <dbl> <dbl> <dbl> <chr> <dbl>
## 1 1 5 500 100 12/7/20 50
## 2 2 10 240 24 1/2/18 44
## # ℹ 2 more variables: gender <chr>, favorite <chr>
left_join Where did customer 4 go? Notice the missing
data for customer 3. So, left_join keeps all the rows in
the first table and only the matching rows in the second table. If there
is no match, it fills in with NA.
left_join(table_A, table_B, by="customer_ID")
## # A tibble: 3 × 8
## customer_ID n_orders total_value_orders ave_value_orders last_order_date age
## <dbl> <dbl> <dbl> <dbl> <chr> <dbl>
## 1 1 5 500 100 12/7/20 50
## 2 2 10 240 24 1/2/18 44
## 3 3 4 40 10 3/6/20 NA
## # ℹ 2 more variables: gender <chr>, favorite <chr>
right_join Where did customer 3 go? Notice the missing
data for customer 4. right_join does the opposite of
left_join. It keeps all the rows in the second table and
only the matching rows in the first table. If there is no match, it
fills in with NA.
right_join(table_A, table_B, by="customer_ID")
## # A tibble: 3 × 8
## customer_ID n_orders total_value_orders ave_value_orders last_order_date age
## <dbl> <dbl> <dbl> <dbl> <chr> <dbl>
## 1 1 5 500 100 12/7/20 50
## 2 2 10 240 24 1/2/18 44
## 3 4 NA NA NA <NA> 30
## # ℹ 2 more variables: gender <chr>, favorite <chr>
full_join full_join keeps all the rows in
both tables and fills in with NA where there is no match.
full_join(table_A, table_B, by="customer_ID")
## # A tibble: 4 × 8
## customer_ID n_orders total_value_orders ave_value_orders last_order_date age
## <dbl> <dbl> <dbl> <dbl> <chr> <dbl>
## 1 1 5 500 100 12/7/20 50
## 2 2 10 240 24 1/2/18 44
## 3 3 4 40 10 3/6/20 NA
## 4 4 NA NA NA <NA> 30
## # ℹ 2 more variables: gender <chr>, favorite <chr>
anti_join Provides the rows in the first table for which
there are not matching values in the second table.
anti_join(table_A, table_B, by="customer_ID") #use to troubleshoot problematic joins
## # A tibble: 1 × 5
## customer_ID n_orders total_value_orders ave_value_orders last_order_date
## <dbl> <dbl> <dbl> <dbl> <chr>
## 1 3 4 40 10 3/6/20
The lubridate package was created to deal with dates of
all types. There are many useful functions in lubridate,
most of which are designed to make working with dates easier. For
example, you can easily get the current date and time with
today() and now().
today()
## [1] "2026-03-09"
now()
## [1] "2026-03-09 10:17:56 PDT"
There are many different options for working with dates and
date-times in lubridate including ymd, mdy,
dmy, mdy_hms, ymd_hms. The
formats below are common examples of how you might see dates in your
data.
dateformat1 <- "20200922"
dateformat2 <- "09-22-2020"
dateformat3 <- "22/09/2020"
dateformat4 <- "09-22-2020 17:00:00"
dateformat5 <- "20200922 170000"
We can convert each of the examples above using lubridate.
ymd(dateformat1)
## [1] "2020-09-22"
mdy(dateformat2)
## [1] "2020-09-22"
dmy(dateformat3)
## [1] "2020-09-22"
mdy_hms(dateformat4)
## [1] "2020-09-22 17:00:00 UTC"
ymd_hms(dateformat5)
## [1] "2020-09-22 17:00:00 UTC"
For this practice, we will use data from my research on Usofila spiders. These spiders live mostly in caves and the data were collected by building a database of all the known museum specimens in the US. There are two files: 1. spider_records.csv which contains all the records of spiders and 2. spiders_locations.csv which contains the lat/long for each record. An accession number is a unique identifier for each specimen.
spider_records <- read_csv("data/spider_records.csv") %>% clean_names()
spider_locs <- read_csv("data/spiders_locations.csv") %>% clean_names()
spiders_with_locs <-
left_join(spider_records, spider_locs, by="accession")
names(spiders_with_locs) #check to see if it worked
## [1] "accession" "family" "genus" "country" "state" "county"
## [7] "locality" "date" "collector" "latitude" "longitude"
spiders_with_locs <-spiders_with_locs %>%
mutate(date=dmy(date))
## Warning: There was 1 warning in `mutate()`.
## ℹ In argument: `date = dmy(date)`.
## Caused by warning:
## ! 2 failed to parse.
.csv.write.csv(spiders_with_locs, file="spiders_with_locs.csv", row.names=FALSE)
–>Home