At the end of this exercise, you will be able to:
1. Join data frames in R.
2. Format dates in a data frame.
library("tidyverse")
library("janitor")
library("lubridate") #this will help us manage dates
Sometimes data we need is stored in a separate file or becomes available later and we need to join it to our existing data in order to work with it. Let’s use an easy example to explore some of the different types of joins. This section was based on the excellent example provided by Holly Emblem.
Let’s assume we have a database of customers and their purchases.
The general syntax for joins is…
#join_type(firstTable, secondTable, by=columnTojoinOn)
inner_join
Where did customers 3 and 4 go? Since they are not on both tables, they
were dropped. So, inner_join only keeps the rows that have
matching values in both tables; it is very strict about matching.
left_join Where did customer 4 go? Notice the missing
data for customer 3. So, left_join keeps all the rows in
the first table and only the matching rows in the second table. If there
is no match, it fills in with NA.
right_join Where did customer 3 go? Notice the missing
data for customer 4. right_join does the opposite of
left_join. It keeps all the rows in the second table and
only the matching rows in the first table. If there is no match, it
fills in with NA.
full_join full_join keeps all the rows in
both tables and fills in with NA where there is no match.
anti_join Provides the rows in the first table for which
there are not matching values in the second table.
The lubridate package was created to deal with dates of
all types. There are many useful functions in lubridate,
most of which are designed to make working with dates easier. For
example, you can easily get the current date and time with
today() and now().
There are many different options for working with dates and
date-times in lubridate including ymd, mdy,
dmy, mdy_hms, ymd_hms. The
formats below are common examples of how you might see dates in your
data.
dateformat1 <- "20200922"
dateformat2 <- "09-22-2020"
dateformat3 <- "22/09/2020"
dateformat4 <- "09-22-2020 17:00:00"
dateformat5 <- "20200922 170000"
We can convert each of the examples above using lubridate.
For this practice, we will use data from my research on Usofila spiders. These spiders live mostly in caves and the data were collected by building a database of all the known museum specimens in the US. There are two files: 1. spider_records.csv which contains all the records of spiders and 2. spiders_locations.csv which contains the lat/long for each record. An accession number is a unique identifier for each specimen.
Load the two files as separate dataframes. Make sure to clean the column names.
Join the files using an appropriate key and join type.
Convert the date column to a date format using lubridate.
Save your final data to a .csv.
–>Home