Agentic Trial Emulation to Learn Health System-specific Drug Effects At Scale
Kauffman, J.; Duan, L.; Gelman, S.; Klang, E.; Sakhuja, A.; Bhatt, D. L.; Reddy, V. Y. Y.; Charney, A.; Nadkarni, G.; Qu, Y.; Huang, K.; Lampert, J.; Glicksberg, B. S.
Show abstract
ObjectiveElectronic Health Record (EHR)-based trial emulation can support translation of randomized clinical trial (RCT) evidence into practice, yet emulations often diverge from published RCT results. We hypothesized that these discrepancies are structured and learnable properties of a health systems data-generating process, and that autonomous agentic workflows can generate discrepancies at the scale required for cumulative learning. Materials and MethodsWe developed an agentic trial emulation framework that (1) uses an autonomous LLM agent (Biomni) to execute an end-to-end, instruction-driven emulation pipeline against an OMOP CDM database and (2) calibrates EHR estimates to RCT results with a Bayesian hierarchical model. Biomni performed protocol parsing, OMOP concept set construction, cohort building, confounder adjustment, and treatment effect estimation; it also synthesized literature-derived, comparison-specific priors for expected EHR-RCT disagreement. Five atrial fibrillation anticoagulation trials were emulated using Mount Sinais OMOP-mapped EHR, with three independent runs per trial to quantify agent-induced analytic variability. Discrepancies between EHR-derived and published log-hazard ratios were modeled as the sum of a literature-informed reproducibility expectation, an institution-specific systematic shift, and residual heterogeneity. Performance was assessed using leave-one-out cross-validation across four in-domain DOAC-versus-warfarin trials, with one out-of-distribution evaluation (apixaban versus aspirin). ResultsIn pooled leave-one-out validation, calibration reduced mean absolute error from 0.567 to 0.224 log-hazard ratio (60.5% reduction) and achieved 100% empirical coverage of 95% posterior predictive intervals across held-out trials (4/4). The posterior institution-specific shift was consistently positive across folds (median 0.364-0.580), indicating systematic attenuation of DOAC benefit in the local EHR beyond literature-expected disagreement; residual heterogeneity was moderate (median 0.199-0.264). For the out-of-distribution AVERROES trial, calibrated error decreased from 0.379 to 0.051 (86.5% reduction), with the published effect within the 95% credible interval. Discussion and ConclusionAutonomous emulation with agents enables repeated, standardized trial replications that convert EHR-RCT disagreement into data for learning institution-level transport properties. Separating comparison-specific reproducibility expectations from system-level shifts yields calibrated, uncertainty-aware local interpretations of trial evidence.
Matching journals
The top 2 journals account for 50% of the predicted probability mass.