Back

Prompt Engineering Enables Open-Source LLMs to Match Proprietary Models in Diagnostic Accuracy for Annotation of Radiology Reports

Petersen, L. A.; Beck, M. S.; Xu, J. J.; Andersen, M. B.; Bruun, F. J.

2025-10-20 radiology and imaging
10.1101/2025.10.13.25337785 medRxiv
Show abstract

AimThe aim of this study was to test whether open-source Large Language Models (LLMs) can match the diagnostic accuracy of proprietary models in annotating Danish trauma radiology reports across three clinical findings. Materials and MethodsThis retrospective study included 2,939 radiology reports of trauma radiographs collected from three Danish emergency departments. The data were split, with 600 cases for prompt engineering and 2,339 for model evaluation. Eight LLMs, GPT-4o and GPT-4o-mini (OpenAI), and six Llama3 variants (Meta) were prompted to annotate the reports for fractures, effusions, and luxations. The reference standard was human annotations. The diagnostic performance was assessed using accuracy, sensitivity, specificity, PPV, and NPV with 95% confidence intervals. ResultsPrompt engineering improved the Match-score for Llama3-8b from 77.8% (95% CI: 74.4% - 81.1%) to 94.3% (95% CI: 92.5% - 96.2%). GPT-4o achieved the highest overall diagnostic accuracy at 97.9% (95% CI: 97.3% - 98.5%), followed by Llama3.1-405b (97.1% (95% CI: 96.4% - 97.8%)), GPT-4o-mini (96.9% (95% CI: 96.2% - 97.6%)), Llama3-8b (96.9% (95% CI: 95.9% - 97.3%)), and Llama3.1-70b (96.0% (95% CI: 95.2% - 96.8%)). Across the three specific findings, all models performed best for fractures, whereas effusion and luxation were more prone to errors. Of the error types, Semantic Confusion was the most frequent, with 53.2% to 59.4% of misclassifications. ConclusionSmall, open-source LLMs can accurately annotate Danish trauma radiology reports when supported by effective prompt engineering, achieving accuracy levels that rival proprietary competitors. They offer a viable, privacy-conscious alternative for clinical use, even in a low-resource language setting.

Matching journals

The top 5 journals account for 50% of the predicted probability mass.

1
The Lancet Digital Health
25 papers in training set
Top 0.1%
18.9%
2
European Radiology
14 papers in training set
Top 0.1%
15.0%
3
JCO Clinical Cancer Informatics
18 papers in training set
Top 0.1%
6.4%
4
PLOS ONE
4510 papers in training set
Top 27%
6.4%
5
Scientific Reports
3102 papers in training set
Top 27%
4.4%
50% of probability mass above
6
GigaScience
172 papers in training set
Top 0.3%
4.4%
7
npj Digital Medicine
97 papers in training set
Top 1.0%
4.4%
8
Nature Communications
4913 papers in training set
Top 37%
4.0%
9
PLOS Digital Health
91 papers in training set
Top 0.8%
3.3%
10
iScience
1063 papers in training set
Top 11%
1.9%
11
Diagnostics
48 papers in training set
Top 0.8%
1.9%
12
JMIR Medical Informatics
17 papers in training set
Top 0.7%
1.7%
13
JAMA Network Open
127 papers in training set
Top 2%
1.7%
14
eBioMedicine
130 papers in training set
Top 2%
1.5%
15
Annals of Translational Medicine
17 papers in training set
Top 0.9%
1.2%
16
IEEE Access
31 papers in training set
Top 0.6%
1.1%
17
Medical Physics
14 papers in training set
Top 0.5%
0.8%
18
Archives of Clinical and Biomedical Research
28 papers in training set
Top 2%
0.8%
19
BMC Medicine
163 papers in training set
Top 7%
0.8%
20
International Journal of Medical Informatics
25 papers in training set
Top 2%
0.7%
21
Computer Methods and Programs in Biomedicine
27 papers in training set
Top 1%
0.7%
22
Frontiers in Neuroinformatics
38 papers in training set
Top 0.9%
0.7%
23
Nature Medicine
117 papers in training set
Top 6%
0.7%
24
BMJ Open
554 papers in training set
Top 13%
0.7%
25
JMIRx Med
31 papers in training set
Top 3%
0.5%