Back

Comparing prognostic performance and reasoning between large language models and physicians

Gjertsen, M.; Yoon, W.; Afshar, M.; Temte, B.; Leding, B.; Halliday, S.; Bradley, K.; Kim, J.; Mitchell, J.; Sanders, A. K.; Croxford, E. L.; Caskey, J.; Churpek, M. M.; Mayampurath, A.; Gao, Y.; Miller, T.; Kruser, J. M.

2026-04-25 intensive care and critical care medicine
10.64898/2026.04.17.26350898 medRxiv
Show abstract

ImportancePhysicians routinely prognosticate to guide care delivery and shared decision making, particularly when caring for patients with critical illnesses. Yet, these physician estimates are prone to inaccuracy and uncertainty. Artificial intelligence, including large language models (LLMs), show promise in supporting or improving this prognostication. However, the performance of contemporary LLMs in prognosticating for the heterogeneous population of critically ill patients remains poorly understood. ObjectiveTo characterize and compare the performance of LLMs and physicians when predicting 6-month mortality for hospitalized adults who survived critical illness. DesignEmbedded mixed methods study with elicitation and comparison of prognostic estimates and reasoning from LLMs and practicing physicians. SettingThe publicly available, deidentified Medical Information Mart for Intensive Care (MIMIC)-IV v2.2 dataset. ParticipantsWe randomly selected 100 hospitalizations of adult survivors of critical illness. Four contemporary LLMs (Open AI GPT-4o, o3- and o4-mini, and DeepSeek-R1) and 7 physicians provided independent prognostic estimates for each case (1,100 total estimates; 400 LLM and 700 physician). Main outcomes and measuresFor each case, LLMs and physicians used the hospital discharge summary and demographics to predict 6-month mortality (yes/no) and provide their reasoning (free text). We assessed prognostic performance using accuracy, sensitivity, and specificity, and used inductive, qualitative content analysis to characterize reasonings. ResultsMean physician accuracy for predicting mortality was 70.1% (95% CI 63.7-76.4%), with sensitivity of 59.7% (95% CI 50.6-68.8%) and specificity of 80.6% (95% CI 71.7-88.2%). The top-performing LLM (OpenAI o4-mini) accuracy was 78.0% (95% CI 70.0-86.0%), with sensitivity of 80.0% (95% CI 67.4-90.2%) and specificity of 76.0% (95% CI 63.3-88.0%). The difference between mean physician and top-performing LLM accuracy was not statistically significant (p = 0.5). Qualitative analysis revealed similar patterns in LLM and physician expressed reasoning, except that physicians regularly and explicitly reported uncertainty while LLMs did not. Conclusion and RelevanceIn this study, LLMs and physicians achieved comparable, moderate performance in predicting 6-month mortality after critical illness, with similar patterns in expressed reasoning. Our findings suggest LLMs could be used to support prognostication in clinical practice but also raise safety concerns due to the lack of LLM uncertainty expression. KEY POINTSO_ST_ABSQuestionC_ST_ABSHow does large language model (LLM) prognostic accuracy and reasoning compare to physicians when predicting 6-month mortality for adult survivors of critical illness? FindingsIn this embedded mixed methods study, physicians and large language models had comparable, moderate prognostic accuracy with similar expressed reasoning patterns except that LLMs did not explicitly express uncertainty. MeaningLarge language models may be able to support physician prognostication, although the inability of LLMs to express uncertainty poses an important safety consideration.

Matching journals

The top 4 journals account for 50% of the predicted probability mass.

1
PLOS ONE
4510 papers in training set
Top 7%
21.8%
2
BMJ Open
554 papers in training set
Top 1%
13.9%
3
Critical Care Explorations
15 papers in training set
Top 0.1%
8.1%
4
Journal of the American Medical Informatics Association
61 papers in training set
Top 0.4%
6.6%
50% of probability mass above
5
Biology Methods and Protocols
53 papers in training set
Top 0.1%
6.1%
6
Journal of General Internal Medicine
20 papers in training set
Top 0.1%
6.1%
7
BMC Medical Informatics and Decision Making
39 papers in training set
Top 0.9%
3.5%
8
Journal of Biomedical Informatics
45 papers in training set
Top 0.5%
3.5%
9
Scientific Reports
3102 papers in training set
Top 46%
2.5%
10
F1000Research
79 papers in training set
Top 2%
1.7%
11
JAMIA Open
37 papers in training set
Top 0.9%
1.6%
12
BMJ
49 papers in training set
Top 0.7%
1.4%
13
PLOS Digital Health
91 papers in training set
Top 2%
1.4%
14
Wellcome Open Research
57 papers in training set
Top 1%
1.4%
15
EClinicalMedicine
21 papers in training set
Top 0.4%
1.3%
16
JMIR Medical Informatics
17 papers in training set
Top 1.0%
1.3%
17
Healthcare
16 papers in training set
Top 1%
1.2%
18
JMIR Public Health and Surveillance
45 papers in training set
Top 3%
1.2%
19
npj Digital Medicine
97 papers in training set
Top 3%
1.2%
20
Emergency Medicine Journal
20 papers in training set
Top 0.5%
0.9%
21
JAMA Network Open
127 papers in training set
Top 4%
0.9%
22
International Journal of Environmental Research and Public Health
124 papers in training set
Top 6%
0.9%
23
BMC Public Health
147 papers in training set
Top 6%
0.7%
24
Journal of Medical Internet Research
85 papers in training set
Top 5%
0.7%
25
Archives of Clinical and Biomedical Research
28 papers in training set
Top 3%
0.6%