Answer the following questions and/or complete the exercises in
RMarkdown. Please embed all of your code and push the final work to your
repository. Your report should be organized, clean, and run free from
errors. Remember, you must remove the # for any included
code chunks to run.
library("tidyverse")
library("janitor")
For this assignment, we will use data from a study on elephants and the effects of poaching on tusk size.
Reference: Chiyo, Patrick I., Vincent Obanda, and David K. Korir. “Illegal tusk harvest and the decline of tusk size in the African elephant.” Ecology and Evolution 5, 22: 5216–5229 (2015). Data deposited at Dryad Digital Repository.
1. Before starting data analysis, read the abstract of the paper to get an idea of the questions being asked. In 2-3 sentences, describe what the study is testing and the variables involved.
The study is evaluating whether or not poaching has had an impact on tusk size in African elephants. Because poachers disproportionally target large-tusked males, the authors hypothesize that this has had an impact on the average tusk size of elephants in the population over time. They are studying populations before and after the onset of poaching.
2. Load elephants.csv and store it as a new
object called elephants.
elephants <- read_csv("data/elephants.csv")
## Rows: 777 Columns: 7
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (3): Years of sample collection, Elephant ID, Sex
## dbl (4): Estimated Age (years), shoulder Height in cm, Tusk Length in cm, T...
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
3. Clean the data by converting variable names to lowercase with no spaces or special characters.
elephants <- elephants %>%
clean_names()
4. Use one or more of the summary functions you have learned to get an idea of the structure of the data.
glimpse(elephants)
## Rows: 777
## Columns: 7
## $ years_of_sample_collection <chr> "1966-68", "1966-68", "1966-68", "1966-68",…
## $ elephant_id <chr> "12", "34", "162", "292", "11", "152", "264…
## $ sex <chr> "f", "f", "f", "f", "f", "f", "f", "f", "f"…
## $ estimated_age_years <dbl> 0.080, 0.080, 0.083, 0.083, 0.250, 0.250, 0…
## $ shoulder_height_in_cm <dbl> 102, 89, 89, 92, 133, 100, 93, 108, 108, 12…
## $ tusk_length_in_cm <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
## $ tusk_circumference_in_cm <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
5. Use mutate() Change the variables
years_of_sample_collection, elephant_id, and
sex to factors. Be sure to store the output as a new
dataframe and use it for the remaining questions.
elephants <- elephants %>%
mutate(
years_of_sample_collection = as.factor(years_of_sample_collection),
elephant_id = as.factor(elephant_id),
sex = as.factor(sex))
glimpse(elephants)
## Rows: 777
## Columns: 7
## $ years_of_sample_collection <fct> 1966-68, 1966-68, 1966-68, 1966-68, 1966-68…
## $ elephant_id <fct> 12, 34, 162, 292, 11, 152, 264, 263, 266, 2…
## $ sex <fct> f, f, f, f, f, f, f, f, f, f, f, f, f, f, f…
## $ estimated_age_years <dbl> 0.080, 0.080, 0.083, 0.083, 0.250, 0.250, 0…
## $ shoulder_height_in_cm <dbl> 102, 89, 89, 92, 133, 100, 93, 108, 108, 12…
## $ tusk_length_in_cm <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
## $ tusk_circumference_in_cm <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
6. From which years were data collected? Show the sample periods below.
elephants %>%
distinct(years_of_sample_collection)
## # A tibble: 2 × 1
## years_of_sample_collection
## <fct>
## 1 1966-68
## 2 2005-13
7. How many males and females were sampled in this study?
elephants %>%
count(sex)
## # A tibble: 2 × 2
## sex n
## <fct> <int>
## 1 f 416
## 2 m 361
8. What is the mean, median, and standard deviation for age of males and females included in the study? Separate the results by year of sample collection. Does the sampling look even between years and sexes?
elephants %>%
filter(sex=="m", years_of_sample_collection=="1966-68") %>%
summarize(mean_age_m=mean(estimated_age_years, na.rm=TRUE),
median_age_m=median(estimated_age_years, na.rm=TRUE),
sd_age_m=sd(estimated_age_years, na.rm=TRUE),
n=n())
## # A tibble: 1 × 4
## mean_age_m median_age_m sd_age_m n
## <dbl> <dbl> <dbl> <int>
## 1 10.8 8 9.19 282
elephants %>%
filter(sex=="m", years_of_sample_collection=="2005-13") %>%
summarize(mean_age_m=mean(estimated_age_years, na.rm=TRUE),
median_age_m=median(estimated_age_years, na.rm=TRUE),
sd_age_m=sd(estimated_age_years, na.rm=TRUE),
n=n())
## # A tibble: 1 × 4
## mean_age_m median_age_m sd_age_m n
## <dbl> <dbl> <dbl> <int>
## 1 16.7 9 13.9 79
elephants %>%
filter(sex=="f", years_of_sample_collection=="1966-68") %>%
summarize(mean_age_m=mean(estimated_age_years, na.rm=TRUE),
median_age_m=median(estimated_age_years, na.rm=TRUE),
sd_age_m=sd(estimated_age_years, na.rm=TRUE),
n=n())
## # A tibble: 1 × 4
## mean_age_m median_age_m sd_age_m n
## <dbl> <dbl> <dbl> <int>
## 1 17.6 15 13.6 323
elephants %>%
filter(sex=="f", years_of_sample_collection=="2005-13") %>%
summarize(mean_age_m=mean(estimated_age_years, na.rm=TRUE),
median_age_m=median(estimated_age_years, na.rm=TRUE),
sd_age_m=sd(estimated_age_years, na.rm=TRUE),
n=n())
## # A tibble: 1 × 4
## mean_age_m median_age_m sd_age_m n
## <dbl> <dbl> <dbl> <int>
## 1 17.9 17.5 11.0 93
9. Is age (independent variable) a positive predictor of tusk length (dependent variable)? Create a plot that shows the relationship between these variables and add a linear model fit line.
elephants %>%
ggplot(aes(x=estimated_age_years, y=tusk_length_in_cm))+
geom_point()+
geom_smooth(method="lm")+
labs(x="Age (years)",
y="Tusk Length (cm)",
title="Relationship between Age and Tusk Length in Elephants")
## `geom_smooth()` using formula = 'y ~ x'
## Warning: Removed 182 rows containing non-finite outside the scale range
## (`stat_smooth()`).
## Warning: Removed 182 rows containing missing values or values outside the scale range
## (`geom_point()`).
10. Is shoulder height (independent variable) a positive predictor of tusk length (dependent variable)? Create a plot that shows the relationship between these variables and add a linear model fit line.
Yes, shoulder height is a positive predictor of tusk length as shown by the upward trend in the scatter plot and the positive slope of the linear model fit line.
elephants %>%
ggplot(aes(x=shoulder_height_in_cm, y=tusk_length_in_cm))+
geom_point()+
geom_smooth(method="lm")+
labs(x="Shoulder Height (cm)",
y="Tusk Length (cm)",
title="Relationship between Shoulder Height and Tusk Length in Elephants")
## `geom_smooth()` using formula = 'y ~ x'
## Warning: Removed 181 rows containing non-finite outside the scale range
## (`stat_smooth()`).
## Warning: Removed 181 rows containing missing values or values outside the scale range
## (`geom_point()`).
11. The authors argue that because poachers preferentially target elephants with large tusks, this has resulted in a decrease in average tusk length. Is this supported by the data? Show your code and calculations below.
Females
elephants %>%
filter(years_of_sample_collection == "1966-68", sex == "f") %>%
summarize(mean_tusk_length_f_pre = mean(tusk_length_in_cm, na.rm = TRUE))
## # A tibble: 1 × 1
## mean_tusk_length_f_pre
## <dbl>
## 1 95.9
elephants %>%
filter(years_of_sample_collection == "2005-13", sex == "f") %>%
summarize(mean_tusk_length_f_post = mean(tusk_length_in_cm, na.rm = TRUE))
## # A tibble: 1 × 1
## mean_tusk_length_f_post
## <dbl>
## 1 71.2
Males
elephants %>%
filter(years_of_sample_collection == "1966-68", sex == "m") %>%
summarize(mean_tusk_length_m_pre = mean(tusk_length_in_cm, na.rm = TRUE))
## # A tibble: 1 × 1
## mean_tusk_length_m_pre
## <dbl>
## 1 98.0
elephants %>%
filter(years_of_sample_collection == "2005-13", sex == "m") %>%
summarize(mean_tusk_length_m_post = mean(tusk_length_in_cm, na.rm = TRUE))
## # A tibble: 1 × 1
## mean_tusk_length_m_post
## <dbl>
## 1 85.5
group_by
elephants %>%
group_by(years_of_sample_collection, sex) %>%
summarize(mean_tusk_length = mean(tusk_length_in_cm, na.rm = TRUE)) %>%
arrange(sex)
## `summarise()` has regrouped the output.
## ℹ Summaries were computed grouped by years_of_sample_collection and sex.
## ℹ Output is grouped by years_of_sample_collection.
## ℹ Use `summarise(.groups = "drop_last")` to silence this message.
## ℹ Use `summarise(.by = c(years_of_sample_collection, sex))` for per-operation
## grouping (`?dplyr::dplyr_by`) instead.
## # A tibble: 4 × 3
## # Groups: years_of_sample_collection [2]
## years_of_sample_collection sex mean_tusk_length
## <fct> <fct> <dbl>
## 1 1966-68 f 95.9
## 2 2005-13 f 71.2
## 3 1966-68 m 98.0
## 4 2005-13 m 85.5
12. Male elephants reach effective sexual maturity at 25 years while females are sexually mature at 12 years. Make a new dataframe that extracts only the males and females at sexual maturity. Then, make a plot that shows the range of tusk length between the two sample periods for these mature elephants.
elephants %>%
filter((estimated_age_years >= 25 & sex=="m") | (estimated_age_years>=12 & sex=="f")) %>%
ggplot(aes(x=sex,
y=tusk_length_in_cm,
fill=years_of_sample_collection))+
geom_boxplot()+
labs(x="Sex",
y="Tusk Length (cm)",
title="Tusk Length of Mature Elephants by Sex")
## Warning: Removed 44 rows containing non-finite outside the scale range
## (`stat_boxplot()`).
Please knit your work as an .html file and upload to Canvas. Homework is due before the start of the next lab. No late work is accepted. Make sure to use the formatting conventions of RMarkdown to make your report neat and clean!
elephants %>%
filter(years_of_sample_collection == "1966-68") %>%
group_by(sex) %>%
summarize(across(where(is.numeric),
~ mean(.x, na.rm = TRUE),
.names = "mean_{.col}"))
## # A tibble: 2 × 5
## sex mean_estimated_age_years mean_shoulder_height_i…¹ mean_tusk_length_in_cm
## <fct> <dbl> <dbl> <dbl>
## 1 f 17.6 206. 95.9
## 2 m 10.8 202. 98.0
## # ℹ abbreviated name: ¹mean_shoulder_height_in_cm
## # ℹ 1 more variable: mean_tusk_circumference_in_cm <dbl>
elephants %>%
filter(years_of_sample_collection == "2005-13") %>%
group_by(sex) %>%
summarize(across(where(is.numeric),
~ mean(.x, na.rm = TRUE),
.names = "mean_{.col}"))
## # A tibble: 2 × 5
## sex mean_estimated_age_years mean_shoulder_height_i…¹ mean_tusk_length_in_cm
## <fct> <dbl> <dbl> <dbl>
## 1 f 17.9 229. 71.2
## 2 m 16.7 233. 85.5
## # ℹ abbreviated name: ¹mean_shoulder_height_in_cm
## # ℹ 1 more variable: mean_tusk_circumference_in_cm <dbl>
elephants %>%
group_by(years_of_sample_collection, sex) %>%
summarize(across(where(is.numeric),
~ mean(.x, na.rm = TRUE),
.names = "mean_{.col}")) %>%
arrange(sex)
## `summarise()` has regrouped the output.
## ℹ Summaries were computed grouped by years_of_sample_collection and sex.
## ℹ Output is grouped by years_of_sample_collection.
## ℹ Use `summarise(.groups = "drop_last")` to silence this message.
## ℹ Use `summarise(.by = c(years_of_sample_collection, sex))` for per-operation
## grouping (`?dplyr::dplyr_by`) instead.
## # A tibble: 4 × 6
## # Groups: years_of_sample_collection [2]
## years_of_sample_collection sex mean_estimated_age_y…¹ mean_shoulder_height…²
## <fct> <fct> <dbl> <dbl>
## 1 1966-68 f 17.6 206.
## 2 2005-13 f 17.9 229.
## 3 1966-68 m 10.8 202.
## 4 2005-13 m 16.7 233.
## # ℹ abbreviated names: ¹mean_estimated_age_years, ²mean_shoulder_height_in_cm
## # ℹ 2 more variables: mean_tusk_length_in_cm <dbl>,
## # mean_tusk_circumference_in_cm <dbl>
elephants %>%
filter(estimated_age_years >= 12 & sex=="f") %>%
ggplot(aes(x=shoulder_height_in_cm, y=tusk_length_in_cm, color=years_of_sample_collection))+
geom_point()+
geom_smooth(method="lm", se=T)
## `geom_smooth()` using formula = 'y ~ x'
## Warning: Removed 31 rows containing non-finite outside the scale range
## (`stat_smooth()`).
## Warning: Removed 31 rows containing missing values or values outside the scale range
## (`geom_point()`).