Back

Agentic Trial Emulation to Learn Health System-specific Drug Effects At Scale

Kauffman, J.; Duan, L.; Gelman, S.; Klang, E.; Sakhuja, A.; Bhatt, D. L.; Reddy, V. Y. Y.; Charney, A.; Nadkarni, G.; Qu, Y.; Huang, K.; Lampert, J.; Glicksberg, B. S.

2026-02-20 health informatics
10.64898/2026.02.19.26346539 medRxiv
Show abstract

ObjectiveElectronic Health Record (EHR)-based trial emulation can support translation of randomized clinical trial (RCT) evidence into practice, yet emulations often diverge from published RCT results. We hypothesized that these discrepancies are structured and learnable properties of a health systems data-generating process, and that autonomous agentic workflows can generate discrepancies at the scale required for cumulative learning. Materials and MethodsWe developed an agentic trial emulation framework that (1) uses an autonomous LLM agent (Biomni) to execute an end-to-end, instruction-driven emulation pipeline against an OMOP CDM database and (2) calibrates EHR estimates to RCT results with a Bayesian hierarchical model. Biomni performed protocol parsing, OMOP concept set construction, cohort building, confounder adjustment, and treatment effect estimation; it also synthesized literature-derived, comparison-specific priors for expected EHR-RCT disagreement. Five atrial fibrillation anticoagulation trials were emulated using Mount Sinais OMOP-mapped EHR, with three independent runs per trial to quantify agent-induced analytic variability. Discrepancies between EHR-derived and published log-hazard ratios were modeled as the sum of a literature-informed reproducibility expectation, an institution-specific systematic shift, and residual heterogeneity. Performance was assessed using leave-one-out cross-validation across four in-domain DOAC-versus-warfarin trials, with one out-of-distribution evaluation (apixaban versus aspirin). ResultsIn pooled leave-one-out validation, calibration reduced mean absolute error from 0.567 to 0.224 log-hazard ratio (60.5% reduction) and achieved 100% empirical coverage of 95% posterior predictive intervals across held-out trials (4/4). The posterior institution-specific shift was consistently positive across folds (median 0.364-0.580), indicating systematic attenuation of DOAC benefit in the local EHR beyond literature-expected disagreement; residual heterogeneity was moderate (median 0.199-0.264). For the out-of-distribution AVERROES trial, calibrated error decreased from 0.379 to 0.051 (86.5% reduction), with the published effect within the 95% credible interval. Discussion and ConclusionAutonomous emulation with agents enables repeated, standardized trial replications that convert EHR-RCT disagreement into data for learning institution-level transport properties. Separating comparison-specific reproducibility expectations from system-level shifts yields calibrated, uncertainty-aware local interpretations of trial evidence.

Matching journals

The top 2 journals account for 50% of the predicted probability mass.

1
Journal of the American Medical Informatics Association
61 papers in training set
Top 0.1%
38.3%
2
npj Digital Medicine
97 papers in training set
Top 0.2%
22.8%
50% of probability mass above
3
JAMIA Open
37 papers in training set
Top 0.5%
2.6%
4
Journal of Biomedical Informatics
45 papers in training set
Top 0.6%
2.4%
5
Nature Communications
4913 papers in training set
Top 46%
2.1%
6
BMC Medical Research Methodology
43 papers in training set
Top 0.6%
1.7%
7
European Heart Journal - Digital Health
15 papers in training set
Top 0.4%
1.5%
8
PLOS Digital Health
91 papers in training set
Top 2%
1.5%
9
Clinical and Translational Science
21 papers in training set
Top 0.5%
1.5%
10
Journal of Medical Internet Research
85 papers in training set
Top 3%
1.3%
11
The Lancet Digital Health
25 papers in training set
Top 0.5%
1.3%
12
BMC Medical Informatics and Decision Making
39 papers in training set
Top 2%
1.2%
13
PLOS Computational Biology
1633 papers in training set
Top 20%
1.1%
14
BMJ Health & Care Informatics
13 papers in training set
Top 0.6%
1.1%
15
PLOS ONE
4510 papers in training set
Top 61%
1.1%
16
JMIR Medical Informatics
17 papers in training set
Top 1%
1.0%
17
BMC Medicine
163 papers in training set
Top 6%
0.9%
18
Journal of Clinical Epidemiology
28 papers in training set
Top 0.5%
0.9%
19
Trials
25 papers in training set
Top 1%
0.9%
20
Scientific Reports
3102 papers in training set
Top 70%
0.9%
21
JAMA Network Open
127 papers in training set
Top 4%
0.8%
22
JCO Clinical Cancer Informatics
18 papers in training set
Top 1.0%
0.7%
23
BMJ
49 papers in training set
Top 1%
0.7%
24
Annals of Internal Medicine
27 papers in training set
Top 1%
0.7%
25
Scientific Data
174 papers in training set
Top 3%
0.5%
26
BMJ Open
554 papers in training set
Top 14%
0.5%
27
Philosophical Transactions of the Royal Society B
51 papers in training set
Top 7%
0.5%
28
European Respiratory Journal
54 papers in training set
Top 3%
0.5%
29
Frontiers in Public Health
140 papers in training set
Top 10%
0.5%
30
JMIR Public Health and Surveillance
45 papers in training set
Top 5%
0.5%