About

Author

I’m Chen Zhang (AlexZ), an undergraduate Data Science specialist at the University of Toronto. I built this project as my JSC370 (Data Science II) final in the Winter 2026 term, with feedback and supervision from the JSC370 course team.

Project Context

JSC370 final projects extend the student’s midterm study with modeling, interactive visualizations, and a GitHub-hosted website. This project takes the Option C midterm topic — Clinical Trial Access Gap for Type 2 Diabetes — and layers predictive modeling (Elastic Net, Random Forest, XGBoost with 5-fold cross-validation and SHAP) on top of the descriptive state- and county-level analyses performed at the midterm stage.

Data Sources

All data were acquired via public APIs. No human-subject or restricted data is used.

  • ClinicalTrials.gov v2 API — U.S. Type 2 diabetes studies (cumulative registry snapshot).
  • CDC PLACES (Socrata) — state and county age-adjusted diabetes prevalence (2022 release; BRFSS 2020–2022).
  • American Community Survey 5-Year 2022 — state and county socioeconomic covariates.
  • 2022 Census Gazetteer + FCC Census API — geocoding city/state strings to 5-digit county FIPS.
  • NPI Registry — endocrinologist density by county; state-level infrastructure counts.
  • Census Rurality (2020 Decennial) — rural population share by state.
  • Medicaid expansion status — static policy lookup as of January 2024.

Acknowledgements

Built with Quarto, Python (pandas, scikit-learn, XGBoost, SHAP, Plotly), and hosted on GitHub Pages. Thank you to the JSC370 teaching team for feedback on the midterm draft that shaped this final project.

Reproducibility

This project ships with a conda environment specification (jsc370.full.yml), a .env template, and a reproducible Quarto pipeline. See the README for step-by-step reproduction instructions.