Back

From Protocol to Analysis Plan: Development and Validation of a Large Language Model Pipeline for Statistical Analysis Plan Generation using Artificial Intelligence (SAPAI)

Jafari, H.; Chu, P.; Lange, M.; Maher, F.; Glen, C.; Pearson, O. J.; Burges, C.; Martyn, M.; Cross, S.; Carter, B.; Emsley, R.; Forbes, G.

2026-03-19 health systems and quality improvement
10.64898/2026.03.19.26348626 medRxiv
Show abstract

Background: Statistical Analysis Plans (SAPs) are essential for trial transparency and credibility but are resource-intensive to produce. While Large Language Models (LLMs) have shown promise in drafting protocols, their ability to generate high-quality, protocol-compliant SAPs remains untested against current content guidance. This study developed and validated an LLM-based pipeline for drafting SAPs from clinical trial protocols. Methods: We developed a structured, section-by-section prompting pipeline aligned with standard SAP guidance. We applied this pipeline to nine clinical trial protocols using three leading LLMs: OpenAI GPT-5, Anthropic Claude Sonnet 4, and Google Gemini 2.5 Pro. The resulting 27 SAPs were evaluated against a 46-item quality checklist derived from the published SAP guidelines. Items were double-scored by independent trial statisticians on a 0 to 3 scale for accuracy. We compared performance across LLMs and between item types (descriptive vs. statistical reasoning) using mixed-effects logistic regression. Results: Across 9 trials, the models produced SAP drafts with high overall accuracy (77% to 78%), with no difference in performance between the three LLMs (p=0.79) but varied by content type (p < 0.001). All models performed well on descriptive items (e.g., administrative details, trial design), with lower accuracy for items requiring statistical reasoning (e.g., modelling strategies, sensitivity analyses). Accuracy for statistical items ranged from 67% to 72%, whereas descriptive items achieved 81% to 83% accuracy. Qualitatively, models were prone to specific failure modes in complex sections, such as omitting necessary details for secondary outcome models or hallucinating sensitivity analyses. Discussion: Current LLMs can effectively draft portions of SAPs, offering the potential for substantial time savings in trial documentation. However, a human-in-the-loop approach remains mandatory; while models demonstrate strong capability in producing descriptive content, their independent application to complex statistical methodology design still requires further methodological development and training. Future work should explore advanced prompt engineering, such as retrieval-augmented generation or agentic workflows, to improve reasoning capabilities.

Matching journals

The top 6 journals account for 50% of the predicted probability mass.

1
PLOS ONE
4510 papers in training set
Top 14%
14.1%
2
Journal of the American Medical Informatics Association
61 papers in training set
Top 0.3%
10.3%
3
Journal of Clinical Epidemiology
28 papers in training set
Top 0.1%
8.3%
4
BMC Medical Informatics and Decision Making
39 papers in training set
Top 0.5%
6.3%
5
Medical Decision Making
10 papers in training set
Top 0.1%
6.2%
6
Research Synthesis Methods
20 papers in training set
Top 0.1%
6.2%
50% of probability mass above
7
Trials
25 papers in training set
Top 0.3%
4.2%
8
BMJ Open
554 papers in training set
Top 6%
3.5%
9
PLOS Digital Health
91 papers in training set
Top 0.8%
3.5%
10
Journal of Biomedical Informatics
45 papers in training set
Top 0.5%
3.5%
11
npj Digital Medicine
97 papers in training set
Top 2%
2.6%
12
JAMA Network Open
127 papers in training set
Top 2%
2.0%
13
Computer Methods and Programs in Biomedicine
27 papers in training set
Top 0.3%
1.9%
14
IEEE Journal of Biomedical and Health Informatics
34 papers in training set
Top 0.9%
1.8%
15
Scientific Data
174 papers in training set
Top 1.0%
1.8%
16
Epilepsy Research
12 papers in training set
Top 0.2%
1.8%
17
F1000Research
79 papers in training set
Top 2%
1.3%
18
Healthcare
16 papers in training set
Top 1.0%
1.3%
19
Frontiers in Digital Health
20 papers in training set
Top 0.8%
1.3%
20
BMJ Health & Care Informatics
13 papers in training set
Top 0.6%
1.2%
21
BMC Medical Research Methodology
43 papers in training set
Top 1%
0.9%
22
PLOS Biology
408 papers in training set
Top 17%
0.9%
23
BMJ Open Quality
15 papers in training set
Top 0.8%
0.8%
24
JMIRx Med
31 papers in training set
Top 2%
0.8%
25
European Heart Journal - Digital Health
15 papers in training set
Top 0.6%
0.7%
26
Journal of Medical Internet Research
85 papers in training set
Top 5%
0.7%
27
Journal of Clinical and Translational Science
11 papers in training set
Top 0.5%
0.7%
28
Scientific Reports
3102 papers in training set
Top 77%
0.7%
29
JMIR Research Protocols
18 papers in training set
Top 2%
0.6%