Back

A Systematic Performance Evaluation of Three Large Language Models in Answering Questions on moderate Hyperthermia

Dennstaedt, F.; Cihoric, N.; Bachmann, N.; Filchenko, I.; Berclaz, L.; Crezee, H.; Curto, S.; Ghadjar, P.; Huebenthal, B.; Hurwitz, M. D.; Kok, P.; Lindner, L. H.; Marder, D.; Molitoris, J.; Notter, M.; Rahman, S.; Riesterer, O.; Spalek, M.; Trefna, H.; Zilli, T.; Rodrigues, D.; Fuerstner, M.; Stutz, E.

2026-03-26 oncology
10.64898/2026.03.25.26349254 medRxiv
Show abstract

BackgroundLarge Language Models (LLMs) have demonstrated expert-level performance across many medical domains, suggesting potential utility in clinical practice. However, their reliability in the highly specialized domain of moderate hyperthermia (HT) remains unknown. We therefore evaluated the performance of three modern LLMs in answering HT-related questions. MethodsWe conducted an evaluation study by posing 40 open-ended questions--22 clinical and 18 physics-related--to three modern LLMs (DeepSeek-V3, Llama-3.3-70B-Instruct, and GPT-4o). Responses were blinded, randomized, and evaluated by 19 international experts with either a clinical or physics background for quality (5-point Likert scale: 1=very bad, 2=bad, 3=acceptable, 4=good to 5=very good) and for potential harmfulness in clinical decision-making. ResultsA total of 1144 quality evaluation responses were collected. Overall reported mean quality scores were similar across models, with DeepSeek scoring 3.26, Llama 3.18, and GPT-4o 3.07, corresponding to an "acceptable" rating. Across expert evaluations, responses were considered potentially harmful in 17.8% of cases for DeepSeek, 19.3% for Llama, and 15.3% for GPT-4o. Notably, despite "acceptable" mean scores, approximately 25% of responses were rated "bad" to "very bad," and potentially harmful answers occurred in [~]15-19% of evaluations, indicating a non-trivial risk if used without domain expertise. ConclusionOur findings indicate that the performance of LLMs in HT in versions available at the time of investigation is only partially satisfactory. The proportion of poor-quality responses is too high and may lead non-domain experts to misinterpret the available clinical evidence and draw inappropriate clinical conclusions.

Matching journals

The top 8 journals account for 50% of the predicted probability mass.

1
International Journal of Radiation Oncology*Biology*Physics
21 papers in training set
Top 0.1%
8.6%
2
JCO Clinical Cancer Informatics
18 papers in training set
Top 0.1%
8.6%
3
PLOS ONE
4510 papers in training set
Top 25%
6.9%
4
Radiotherapy and Oncology
18 papers in training set
Top 0.1%
6.9%
5
Frontiers in Oncology
95 papers in training set
Top 0.5%
6.5%
6
Scientific Reports
3102 papers in training set
Top 22%
4.9%
7
Computers in Biology and Medicine
120 papers in training set
Top 0.6%
4.0%
8
BMJ Open
554 papers in training set
Top 5%
4.0%
50% of probability mass above
9
Interface Focus
14 papers in training set
Top 0.1%
3.7%
10
JAMA Network Open
127 papers in training set
Top 0.9%
3.7%
11
Medical Physics
14 papers in training set
Top 0.2%
3.7%
12
Scientific Data
174 papers in training set
Top 0.6%
2.8%
13
Annals of Biomedical Engineering
34 papers in training set
Top 0.4%
2.6%
14
Biology Methods and Protocols
53 papers in training set
Top 0.9%
1.7%
15
Artificial Intelligence in Medicine
15 papers in training set
Top 0.3%
1.7%
16
JCO Precision Oncology
14 papers in training set
Top 0.2%
1.7%
17
BMC Bioinformatics
383 papers in training set
Top 5%
1.4%
18
Journal of Clinical Epidemiology
28 papers in training set
Top 0.4%
1.2%
19
npj Digital Medicine
97 papers in training set
Top 3%
1.2%
20
JMIR Medical Informatics
17 papers in training set
Top 1%
1.2%
21
BMJ Health & Care Informatics
13 papers in training set
Top 0.6%
1.2%
22
Journal of Medical Imaging
11 papers in training set
Top 0.2%
1.2%
23
PeerJ
261 papers in training set
Top 11%
1.1%
24
iScience
1063 papers in training set
Top 26%
0.9%
25
Journal of Translational Medicine
46 papers in training set
Top 2%
0.9%
26
EClinicalMedicine
21 papers in training set
Top 1%
0.7%
27
Clinical Cancer Research
58 papers in training set
Top 2%
0.7%
28
Cancers
200 papers in training set
Top 5%
0.7%
29
Computer Methods and Programs in Biomedicine
27 papers in training set
Top 1%
0.7%
30
Database
51 papers in training set
Top 1%
0.7%