Back

From Study Design to Executable Code: Automating Target Trial Emulation with Large Language Models

Kim, H.; Kim, M.; Kim, S.; You, S. C.

2026-03-14 health informatics
10.64898/2026.03.13.26348306 medRxiv
Show abstract

IntroductionImplementing target trial emulation (TTE) study methods as end-to-end executable analytic code is technically demanding, and producing standardized, reproducible scripts consistently across research teams remains a persistent challenge. We aimed to develop a framework that translates free-text study descriptions into standardized analytic specifications and executable Strategus R scripts for the Observational Health Data Sciences and Informatics (OHDSI) ecosystem. MethodsWe developed THESEUS (Text-guided Health-study Estimation and Specification Engine Using Strategus), which operates through two sequential steps. Large language models (LLMs) first map descriptions of the study into a constrained JavaScript Object Notation (JSON) schema (standardization step), after which the structured specifications are converted into R scripts with a self-auditing loop for error correction (code generation step). We evaluated eight proprietary LLMs using texts extracted from the methods section of 15 OHDSI-based TTE studies, and externally validated the framework on texts from 5 non-OHDSI studies, across three input settings: primary analysis text only, full analyses text, and full methods sections. Standardization was evaluated at the study-level (whether all parameters in a study were correctly extracted) and at the field-level (sensitivity and false positive rate per individual parameter) with field-level evaluation applied to the full analyses text and full methods sections input settings. Code generation was assessed by executability of the produced R scripts before and after self-auditing. ResultsIn the standardization step, study-level accuracy across models ranged from 0.91 to 0.98 for primary analysis, 0.67 to 0.87 for full analyses, and 0.67 to 0.85 for full methods sections in OHDSI studies, whereas the corresponding ranges were 0.73 to 0.93, 0.60 to 0.87, and 0.27 to 0.47 in non-OHDSI studies. At the field-level, sensitivity across models under the full analyses text input setting ranged from 0.73 to 0.90 with 0.27 to 0.67 false positives per study in OHDSI studies, and from 0.71 to 0.90 with 0.20 to 1.00 false positives per study in non-OHDSI studies, depending on input setting. For code generation, first-run executability ranged from 0.80 to 1.00 for OHDSI studies and improved to 0.93 to 1.00 after self-auditing. In non-OHDSI studies, first-run executability ranged from 0.60 to 1.00, improving to 1.00 after self-auditing. DiscussionTHESEUS demonstrates that pairing a standardized data model with a structured analysis framework enables reliable LLM-powered automation of the coding step in observational research. THESEUS supports the reliable translation of natural-language study descriptions into executable, shareable code in standardized observational research settings. This approach has the potential to lower the technical barriers to participation in observational research for a broader range of investigators.

Matching journals

The top 6 journals account for 50% of the predicted probability mass.

1
Journal of the American Medical Informatics Association
61 papers in training set
Top 0.1%
23.1%
2
BMC Medical Research Methodology
43 papers in training set
Top 0.1%
14.7%
3
PLOS ONE
4510 papers in training set
Top 35%
4.1%
4
Nature Communications
4913 papers in training set
Top 38%
3.7%
5
JAMIA Open
37 papers in training set
Top 0.4%
3.7%
6
Journal of Biomedical Informatics
45 papers in training set
Top 0.4%
3.7%
50% of probability mass above
7
Journal of Medical Internet Research
85 papers in training set
Top 2%
3.0%
8
The Lancet Digital Health
25 papers in training set
Top 0.2%
2.8%
9
BMJ Open
554 papers in training set
Top 7%
2.5%
10
European Journal of Epidemiology
40 papers in training set
Top 0.2%
2.1%
11
BMC Medicine
163 papers in training set
Top 3%
1.7%
12
Journal of Clinical Epidemiology
28 papers in training set
Top 0.3%
1.7%
13
Trials
25 papers in training set
Top 0.8%
1.7%
14
npj Digital Medicine
97 papers in training set
Top 2%
1.7%
15
BMJ
49 papers in training set
Top 0.7%
1.4%
16
JMIR Medical Informatics
17 papers in training set
Top 0.9%
1.4%
17
Research Synthesis Methods
20 papers in training set
Top 0.1%
1.4%
18
Scientific Reports
3102 papers in training set
Top 65%
1.3%
19
JAMA Network Open
127 papers in training set
Top 3%
1.3%
20
Bioinformatics
1061 papers in training set
Top 8%
1.0%
21
BMJ Health & Care Informatics
13 papers in training set
Top 0.7%
1.0%
22
International Journal of Medical Informatics
25 papers in training set
Top 1%
0.9%
23
JMIR Public Health and Surveillance
45 papers in training set
Top 3%
0.9%
24
BMC Medical Informatics and Decision Making
39 papers in training set
Top 2%
0.8%
25
Wellcome Open Research
57 papers in training set
Top 2%
0.8%
26
International Journal of Epidemiology
74 papers in training set
Top 2%
0.8%
27
Philosophical Transactions of the Royal Society B
51 papers in training set
Top 5%
0.8%
28
BMC Bioinformatics
383 papers in training set
Top 7%
0.7%
29
Annals of Internal Medicine
27 papers in training set
Top 1%
0.7%
30
PLOS Digital Health
91 papers in training set
Top 3%
0.7%