Clinical Trial Access Gap for Type 2 Diabetes
A state- and county-level study of U.S. diabetes research access
Project Summary
Type 2 diabetes affects tens of millions of Americans, and where the burden is heaviest is shaped heavily by socioeconomic structure. Clinical trials are the pipeline that develops new therapies, but trial sites get placed where sponsors and hosting institutions already have infrastructure, not where patient need is greatest. This project joins the ClinicalTrials.gov registry with CDC PLACES burden estimates, ACS socioeconomic data, rurality, Medicaid expansion status, and healthcare-infrastructure proxies. It asks two questions: where is U.S. diabetes-trial access thinnest relative to burden, and do trial-access features add predictive signal for county-level diabetes burden after socioeconomic structure is already in the model?
Headline findings
- 3,646 U.S. Type 2 diabetes studies and 47,118 site records were flattened from the ClinicalTrials.gov registry and geocoded to county FIPS codes.
- Most counties don’t host a trial. 73.4% of U.S. counties have no diabetes trial site, and among those counties the median distance to the nearest site is 58.4 km.
- State trial density doesn’t track state diabetes burden. A coverage residual (observed density minus the density expected for the state’s burden decile) flags systematic under- and over-coverage, and industry-sponsored density runs about 6.9× non-industry density nationally.
- Trial-access features add little once socioeconomic structure is in the model. In a matched comparison across Elastic Net, Random Forest, and XGBoost (5-fold CV on 3,221 counties), adding log trial count and log distance-to-nearest-site on top of the SES baseline moves cross-validated R² by at most about 0.003.
- Every feature-importance view agrees. Standardized Elastic Net coefficients, Random Forest permutation importance, and XGBoost SHAP values all place trial access below every major socioeconomic block.
Explore the project
- Read the full report — Introduction, methods, results, conclusions.
- Download the report (PDF) — downloadable, self-contained version of the full report.
- Interactive visualizations — three interactive figures (HW5 deliverable): state choropleth, county distance histogram, and county scatter.
- Watch the 5-minute presentation — video walkthrough of the website and main findings.
- GitHub repository — full source, data, and pipeline.
- Full computational pipeline — the end-to-end Quarto notebook that produces the model-ready datasets and all figures lives on GitHub at
option-c-trial-access/product/option-c.qmd. It is not served here to keep the site focused; graders can view or re-run it directly from the repo.