Available at: https://digitalcommons.calpoly.edu/theses/3014
Date of Award
6-2025
Degree Name
MS in Statistics
Department/Program
Statistics
College
College of Science and Mathematics
Advisor
Charlotte Mann
Advisor Department
Statistics
Advisor College
College of Science and Mathematics
Abstract
Randomized controlled trials (RCTs) are typically the gold standard for evaluating treatment efficacy in the medical field, yet they can have small sample sizes due to logistical, ethical, and financial constraints. This limitation can result in imprecise treatment effect estimates. Recent methods have sought to enhance the precision of RCT estimates by incorporating information from large, observational, “auxiliary” datasets. An auxiliary dataset includes units that were not randomized in the trial itself, but may be similar to the RCT sample. By leveraging predictive models trained on these auxiliary data, researchers can adjust for potentially more powerful covariates, thereby reducing variance in treatment effect estimation without compromising the integrity of the randomization. Previous applications of this approach have shown its efficacy using educational experiments, where the auxiliary data originate from the same data source as the RCT. To extend and validate the robustness of this approach across domains, this thesis applies the estimation approach to a medical RCT, using publicly available data from an entirely different source.
We analyzed the CHOICES (CTN-0055) RCT, which investigated the feasibility and acceptability of extended-release naltrexone (XR-NTX) as treatment for HIV-infected individuals with opioid or alcohol use disorders for 51 participants. We supplement this analysis with data from NHANES (National Health and Nutrition Examination Survey), a nationally representative survey that collects extensive health and nutrition data from thousands of adults and children across the United States, making it an ideal large-scale auxiliary dataset for our analysis. We leveraged the auxiliary NHANES dataset to develop an auxiliary model that predicts recent alcohol use. We compared methods that integrate experimental and auxiliary data using these model predictions to more standard estimators of the effect of XR-NTX on alcohol use. Our findings did not demonstrate improved precision from incorporating auxiliary model predictions, highlighting potential challenges when applying auxiliary data using an external data source. This case study provides insights into the practical limitations and considerations of using auxiliary data for precision enhancement in small-sample medical RCTs.