A Systematic Performance Evaluation of Three Large Language Models in Answering Questions on moderate Hyperthermia
Dennstaedt, F.; Cihoric, N.; Bachmann, N.; Filchenko, I.; Berclaz, L.; Crezee, H.; Curto, S.; Ghadjar, P.; Huebenthal, B.; Hurwitz, M. D.; Kok, P.; Lindner, L. H.; Marder, D.; Molitoris, J.; Notter, M.; Rahman, S.; Riesterer, O.; Spalek, M.; Trefna, H.; Zilli, T.; Rodrigues, D.; Fuerstner, M.; Stutz, E.
Show abstract
BackgroundLarge Language Models (LLMs) have demonstrated expert-level performance across many medical domains, suggesting potential utility in clinical practice. However, their reliability in the highly specialized domain of moderate hyperthermia (HT) remains unknown. We therefore evaluated the performance of three modern LLMs in answering HT-related questions. MethodsWe conducted an evaluation study by posing 40 open-ended questions--22 clinical and 18 physics-related--to three modern LLMs (DeepSeek-V3, Llama-3.3-70B-Instruct, and GPT-4o). Responses were blinded, randomized, and evaluated by 19 international experts with either a clinical or physics background for quality (5-point Likert scale: 1=very bad, 2=bad, 3=acceptable, 4=good to 5=very good) and for potential harmfulness in clinical decision-making. ResultsA total of 1144 quality evaluation responses were collected. Overall reported mean quality scores were similar across models, with DeepSeek scoring 3.26, Llama 3.18, and GPT-4o 3.07, corresponding to an "acceptable" rating. Across expert evaluations, responses were considered potentially harmful in 17.8% of cases for DeepSeek, 19.3% for Llama, and 15.3% for GPT-4o. Notably, despite "acceptable" mean scores, approximately 25% of responses were rated "bad" to "very bad," and potentially harmful answers occurred in [~]15-19% of evaluations, indicating a non-trivial risk if used without domain expertise. ConclusionOur findings indicate that the performance of LLMs in HT in versions available at the time of investigation is only partially satisfactory. The proportion of poor-quality responses is too high and may lead non-domain experts to misinterpret the available clinical evidence and draw inappropriate clinical conclusions.
Matching journals
The top 8 journals account for 50% of the predicted probability mass.