Comparing prognostic performance and reasoning between large language models and physicians
Gjertsen, M.; Yoon, W.; Afshar, M.; Temte, B.; Leding, B.; Halliday, S.; Bradley, K.; Kim, J.; Mitchell, J.; Sanders, A. K.; Croxford, E. L.; Caskey, J.; Churpek, M. M.; Mayampurath, A.; Gao, Y.; Miller, T.; Kruser, J. M.
Show abstract
ImportancePhysicians routinely prognosticate to guide care delivery and shared decision making, particularly when caring for patients with critical illnesses. Yet, these physician estimates are prone to inaccuracy and uncertainty. Artificial intelligence, including large language models (LLMs), show promise in supporting or improving this prognostication. However, the performance of contemporary LLMs in prognosticating for the heterogeneous population of critically ill patients remains poorly understood. ObjectiveTo characterize and compare the performance of LLMs and physicians when predicting 6-month mortality for hospitalized adults who survived critical illness. DesignEmbedded mixed methods study with elicitation and comparison of prognostic estimates and reasoning from LLMs and practicing physicians. SettingThe publicly available, deidentified Medical Information Mart for Intensive Care (MIMIC)-IV v2.2 dataset. ParticipantsWe randomly selected 100 hospitalizations of adult survivors of critical illness. Four contemporary LLMs (Open AI GPT-4o, o3- and o4-mini, and DeepSeek-R1) and 7 physicians provided independent prognostic estimates for each case (1,100 total estimates; 400 LLM and 700 physician). Main outcomes and measuresFor each case, LLMs and physicians used the hospital discharge summary and demographics to predict 6-month mortality (yes/no) and provide their reasoning (free text). We assessed prognostic performance using accuracy, sensitivity, and specificity, and used inductive, qualitative content analysis to characterize reasonings. ResultsMean physician accuracy for predicting mortality was 70.1% (95% CI 63.7-76.4%), with sensitivity of 59.7% (95% CI 50.6-68.8%) and specificity of 80.6% (95% CI 71.7-88.2%). The top-performing LLM (OpenAI o4-mini) accuracy was 78.0% (95% CI 70.0-86.0%), with sensitivity of 80.0% (95% CI 67.4-90.2%) and specificity of 76.0% (95% CI 63.3-88.0%). The difference between mean physician and top-performing LLM accuracy was not statistically significant (p = 0.5). Qualitative analysis revealed similar patterns in LLM and physician expressed reasoning, except that physicians regularly and explicitly reported uncertainty while LLMs did not. Conclusion and RelevanceIn this study, LLMs and physicians achieved comparable, moderate performance in predicting 6-month mortality after critical illness, with similar patterns in expressed reasoning. Our findings suggest LLMs could be used to support prognostication in clinical practice but also raise safety concerns due to the lack of LLM uncertainty expression. KEY POINTSO_ST_ABSQuestionC_ST_ABSHow does large language model (LLM) prognostic accuracy and reasoning compare to physicians when predicting 6-month mortality for adult survivors of critical illness? FindingsIn this embedded mixed methods study, physicians and large language models had comparable, moderate prognostic accuracy with similar expressed reasoning patterns except that LLMs did not explicitly express uncertainty. MeaningLarge language models may be able to support physician prognostication, although the inability of LLMs to express uncertainty poses an important safety consideration.
Matching journals
The top 4 journals account for 50% of the predicted probability mass.