Answer the following questions and complete the exercises in
RMarkdown. Please embed all of your code and push your final work to
your repository. Your final lab report should be organized, clean, and
run free from errors. Remember, you must remove the #
for
the included code chunks to run. Be sure to add your name to the author
header above.
Make sure to use the formatting conventions of RMarkdown to make your report neat and clean!
library(tidyverse)
library(janitor)
library(naniar)
For this homework, we will take a departure from biological data and
use data about California colleges. These data are a subset of the
national college scorecard (https://collegescorecard.ed.gov/data/). Load the
ca_college_data.csv
as a new object called
colleges
.
colleges <- readr::read_csv("data/ca_college_data.csv") %>% janitor::clean_names()
## Rows: 341 Columns: 10
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (4): INSTNM, CITY, STABBR, ZIP
## dbl (6): ADM_RATE, SAT_AVG, PCIP26, COSTT4_A, C150_4_POOLED, PFTFTUG1_EF
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
The variables are a bit hard to decipher, here is a key:
INSTNM: Institution name
CITY: California city
STABBR: Location state
ZIP: Zip code
ADM_RATE: Admission rate
SAT_AVG: SAT average score
PCIP26: Percentage of degrees awarded in Biological And Biomedical
Sciences
COSTT4_A: Annual cost of attendance
C150_4_POOLED: 4-year completion rate
PFTFTUG1_EF: Percentage of undergraduate students who are first-time,
full-time degree/certificate-seeking undergraduate students
glimpse(colleges)
## Rows: 341
## Columns: 10
## $ instnm <chr> "Grossmont College", "College of the Sequoias", "College…
## $ city <chr> "El Cajon", "Visalia", "San Mateo", "Ventura", "Oxnard",…
## $ stabbr <chr> "CA", "CA", "CA", "CA", "CA", "CA", "CA", "CA", "CA", "C…
## $ zip <chr> "92020-1799", "93277-2214", "94402-3784", "93003-3872", …
## $ adm_rate <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
## $ sat_avg <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
## $ pcip26 <dbl> 0.0016, 0.0066, 0.0038, 0.0035, 0.0085, 0.0151, 0.0000, …
## $ costt4_a <dbl> 7956, 8109, 8278, 8407, 8516, 8577, 8580, 9181, 9281, 93…
## $ c150_4_pooled <dbl> NA, NA, NA, NA, NA, NA, 0.2334, NA, NA, NA, NA, 0.1704, …
## $ pftftug1_ef <dbl> 0.3546, 0.5413, 0.3567, 0.3824, 0.2753, 0.4286, 0.2307, …
names(colleges)
## [1] "instnm" "city" "stabbr" "zip"
## [5] "adm_rate" "sat_avg" "pcip26" "costt4_a"
## [9] "c150_4_pooled" "pftftug1_ef"
naniar::miss_var_summary(colleges)
## # A tibble: 10 × 3
## variable n_miss pct_miss
## <chr> <int> <num>
## 1 sat_avg 276 80.9
## 2 adm_rate 240 70.4
## 3 c150_4_pooled 221 64.8
## 4 costt4_a 124 36.4
## 5 pftftug1_ef 53 15.5
## 6 pcip26 35 10.3
## 7 instnm 0 0
## 8 city 0 0
## 9 stabbr 0 0
## 10 zip 0 0
colleges %>%
count(city) %>%
arrange(desc(n))
## # A tibble: 161 × 2
## city n
## <chr> <int>
## 1 Los Angeles 24
## 2 San Diego 18
## 3 San Francisco 15
## 4 Sacramento 10
## 5 Berkeley 9
## 6 Oakland 9
## 7 Claremont 7
## 8 Pasadena 6
## 9 Fresno 5
## 10 Irvine 5
## # ℹ 151 more rows
colleges %>%
count(city) %>%
top_n(10, n) %>%
ggplot(aes(x=reorder(city,n), y=n))+
geom_col()+
coord_flip()
COSTT4_A
is the annual cost of each
institution. Which city has the highest average cost? Where is it
located?colleges %>%
group_by(city) %>%
summarize(mean_cost_yr=mean(costt4_a, na.rm=TRUE),
total=n()) %>%
arrange(desc(mean_cost_yr))
## # A tibble: 161 × 3
## city mean_cost_yr total
## <chr> <dbl> <int>
## 1 Claremont 66498 7
## 2 Malibu 66152 1
## 3 Valencia 64686 1
## 4 Orange 64501 3
## 5 Redlands 61542 1
## 6 Moraga 61095 1
## 7 Atherton 56035 1
## 8 Thousand Oaks 54373 1
## 9 Rancho Palos Verdes 50758 1
## 10 La Verne 50603 1
## # ℹ 151 more rows
colleges %>%
filter(instnm== "University of California-Davis" | city=="Claremont" & costt4_a!="NA") %>%
ggplot(aes(x=instnm, y=costt4_a)) +
geom_col()+
coord_flip()
ADM_RATE
is the admissions rate by college
and C150_4_POOLED
is the four-year completion rate. Use a
scatterplot to show the relationship between these two variables. What
do you think this means?colleges %>%
ggplot(aes(x=adm_rate, y=c150_4_pooled))+
geom_point(na.rm=T)+
geom_smooth(method=lm, se=F, na.rm=T)
## `geom_smooth()` using formula = 'y ~ x'
colleges %>%
ggplot(aes(x=costt4_a, y=c150_4_pooled))+
geom_point(na.rm=T)+
geom_smooth(method=lm, se=F, na.rm=T)
## `geom_smooth()` using formula = 'y ~ x'
INSTNM
is the institution name. We
are only interested in the University of California colleges. Make a new
data frame that is restricted to UC institutions. You can remove
Hastings College of Law
and UC San Francisco
as we are only interested in undergraduate institutions.colleges %>%
filter(if_any(everything(), ~ str_detect(.x, "University of California")))
## # A tibble: 10 × 10
## instnm city stabbr zip adm_rate sat_avg pcip26 costt4_a c150_4_pooled
## <chr> <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 University… La J… CA 92093 0.357 1324 0.216 31043 0.872
## 2 University… Irvi… CA 92697 0.406 1206 0.107 31198 0.876
## 3 University… Rive… CA 92521 0.663 1078 0.149 31494 0.73
## 4 University… Los … CA 9009… 0.180 1334 0.155 33078 0.911
## 5 University… Davis CA 9561… 0.423 1218 0.198 33904 0.850
## 6 University… Sant… CA 9506… 0.578 1201 0.193 34608 0.776
## 7 University… Berk… CA 94720 0.169 1422 0.105 34924 0.916
## 8 University… Sant… CA 93106 0.358 1281 0.108 34998 0.816
## 9 University… San … CA 9410… NA NA NA NA NA
## 10 University… San … CA 9414… NA NA NA NA NA
## # ℹ 1 more variable: pftftug1_ef <dbl>
univ_calif <- colleges %>%
filter(str_detect(instnm, "University of California"))
univ_calif
## # A tibble: 10 × 10
## instnm city stabbr zip adm_rate sat_avg pcip26 costt4_a c150_4_pooled
## <chr> <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 University… La J… CA 92093 0.357 1324 0.216 31043 0.872
## 2 University… Irvi… CA 92697 0.406 1206 0.107 31198 0.876
## 3 University… Rive… CA 92521 0.663 1078 0.149 31494 0.73
## 4 University… Los … CA 9009… 0.180 1334 0.155 33078 0.911
## 5 University… Davis CA 9561… 0.423 1218 0.198 33904 0.850
## 6 University… Sant… CA 9506… 0.578 1201 0.193 34608 0.776
## 7 University… Berk… CA 94720 0.169 1422 0.105 34924 0.916
## 8 University… Sant… CA 93106 0.358 1281 0.108 34998 0.816
## 9 University… San … CA 9410… NA NA NA NA NA
## 10 University… San … CA 9414… NA NA NA NA NA
## # ℹ 1 more variable: pftftug1_ef <dbl>
Remove Hastings College of Law
and
UC San Francisco
and store the final data frame as a new
object univ_calif_final
.
univ_calif_final <- univ_calif %>%
filter(city!="San Francisco")
univ_calif_final
## # A tibble: 8 × 10
## instnm city stabbr zip adm_rate sat_avg pcip26 costt4_a c150_4_pooled
## <chr> <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 University … La J… CA 92093 0.357 1324 0.216 31043 0.872
## 2 University … Irvi… CA 92697 0.406 1206 0.107 31198 0.876
## 3 University … Rive… CA 92521 0.663 1078 0.149 31494 0.73
## 4 University … Los … CA 9009… 0.180 1334 0.155 33078 0.911
## 5 University … Davis CA 9561… 0.423 1218 0.198 33904 0.850
## 6 University … Sant… CA 9506… 0.578 1201 0.193 34608 0.776
## 7 University … Berk… CA 94720 0.169 1422 0.105 34924 0.916
## 8 University … Sant… CA 93106 0.358 1281 0.108 34998 0.816
## # ℹ 1 more variable: pftftug1_ef <dbl>
Use separate()
to separate institution name into two new
columns “UNIV” and “CAMPUS”.
univ_calif_final <- univ_calif_final %>%
separate(instnm, c("univ", "campus"), sep="-")
univ_calif_final
## # A tibble: 8 × 11
## univ campus city stabbr zip adm_rate sat_avg pcip26 costt4_a c150_4_pooled
## <chr> <chr> <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Univ… San D… La J… CA 92093 0.357 1324 0.216 31043 0.872
## 2 Univ… Irvine Irvi… CA 92697 0.406 1206 0.107 31198 0.876
## 3 Univ… River… Rive… CA 92521 0.663 1078 0.149 31494 0.73
## 4 Univ… Los A… Los … CA 9009… 0.180 1334 0.155 33078 0.911
## 5 Univ… Davis Davis CA 9561… 0.423 1218 0.198 33904 0.850
## 6 Univ… Santa… Sant… CA 9506… 0.578 1201 0.193 34608 0.776
## 7 Univ… Berke… Berk… CA 94720 0.169 1422 0.105 34924 0.916
## 8 Univ… Santa… Sant… CA 93106 0.358 1281 0.108 34998 0.816
## # ℹ 1 more variable: pftftug1_ef <dbl>
ADM_RATE
is the admissions rate by campus.
Which UC has the lowest and highest admissions rates? Produce a
numerical summary and an appropriate plot.univ_calif_final %>%
select(campus, adm_rate) %>%
arrange(adm_rate)
## # A tibble: 8 × 2
## campus adm_rate
## <chr> <dbl>
## 1 Berkeley 0.169
## 2 Los Angeles 0.180
## 3 San Diego 0.357
## 4 Santa Barbara 0.358
## 5 Irvine 0.406
## 6 Davis 0.423
## 7 Santa Cruz 0.578
## 8 Riverside 0.663
univ_calif_final %>%
select(campus, adm_rate) %>%
top_n(-1, adm_rate)
## # A tibble: 1 × 2
## campus adm_rate
## <chr> <dbl>
## 1 Berkeley 0.169
univ_calif_final %>%
ggplot(aes(x=reorder(campus, adm_rate), y=adm_rate))+
geom_col()+
coord_flip()
univ_calif_final %>%
select(campus, pcip26) %>%
arrange(pcip26)
## # A tibble: 8 × 2
## campus pcip26
## <chr> <dbl>
## 1 Berkeley 0.105
## 2 Irvine 0.107
## 3 Santa Barbara 0.108
## 4 Riverside 0.149
## 5 Los Angeles 0.155
## 6 Santa Cruz 0.193
## 7 Davis 0.198
## 8 San Diego 0.216
univ_calif_final %>%
ggplot(aes(x=reorder(campus, pcip26), y=pcip26))+
geom_col()