Back

Comparing prognostic performance and reasoning between large language models and physicians

Gjertsen, M.; Yoon, W.; Afshar, M.; Temte, B.; Leding, B.; Halliday, S.; Bradley, K.; Kim, J.; Mitchell, J.; Sanders, A. K.; Croxford, E. L.; Caskey, J.; Churpek, M. M.; Mayampurath, A.; Gao, Y.; Miller, T.; Kruser, J. M.

2026-04-25 intensive care and critical care medicine
10.64898/2026.04.17.26350898 medRxiv
Show abstract

Importance: Physicians routinely prognosticate to guide care delivery and shared decision making, particularly when caring for patients with critical illnesses. Yet, these physician estimates are prone to inaccuracy and uncertainty. Artificial intelligence, including large language models (LLMs), show promise in supporting or improving this prognostication. However, the performance of contemporary LLMs in prognosticating for the heterogeneous population of critically ill patients remains poorly understood. Objective: To characterize and compare the performance of LLMs and physicians when predicting 6-month mortality for hospitalized adults who survived critical illness. Design: Embedded mixed methods study with elicitation and comparison of prognostic estimates and reasoning from LLMs and practicing physicians. Setting: The publicly available, deidentified Medical Information Mart for Intensive Care (MIMIC)-IV v2.2 dataset. Participants: We randomly selected 100 hospitalizations of adult survivors of critical illness. Four contemporary LLMs (Open AI GPT-4o, o3- and o4-mini, and DeepSeek-R1) and 7 physicians provided independent prognostic estimates for each case (1,100 total estimates; 400 LLM and 700 physician). Main outcomes and measures: For each case, LLMs and physicians used the hospital discharge summary and demographics to predict 6-month mortality (yes/no) and provide their reasoning (free text). We assessed prognostic performance using accuracy, sensitivity, and specificity, and used inductive, qualitative content analysis to characterize reasonings. Results: Mean physician accuracy for predicting mortality was 70.1% (95% CI 63.7-76.4%), with sensitivity of 59.7% (95% CI 50.6-68.8%) and specificity of 80.6% (95% CI 71.7-88.2%). The top-performing LLM (OpenAI o4-mini) accuracy was 78.0% (95% CI 70.0-86.0%), with sensitivity of 80.0% (95% CI 67.4-90.2%) and specificity of 76.0% (95% CI 63.3-88.0%). The difference between mean physician and top-performing LLM accuracy was not statistically significant (p = 0.5). Qualitative analysis revealed similar patterns in LLM and physician expressed reasoning, except that physicians regularly and explicitly reported uncertainty while LLMs did not. Conclusion and Relevance: In this study, LLMs and physicians achieved comparable, moderate performance in predicting 6-month mortality after critical illness, with similar patterns in expressed reasoning. Our findings suggest LLMs could be used to support prognostication in clinical practice but also raise safety concerns due to the lack of LLM uncertainty expression.

Matching journals

The top 5 journals account for 50% of the predicted probability mass.

1
PLOS ONE
4510 papers in training set
Top 10%
18.2%
2
Critical Care Explorations
15 papers in training set
Top 0.1%
14.1%
3
BMC Medical Informatics and Decision Making
39 papers in training set
Top 0.4%
6.7%
4
BMJ Open
554 papers in training set
Top 4%
6.2%
5
Biology Methods and Protocols
53 papers in training set
Top 0.1%
6.2%
50% of probability mass above
6
Journal of the American Medical Informatics Association
61 papers in training set
Top 0.6%
4.7%
7
Scientific Reports
3102 papers in training set
Top 25%
4.7%
8
Journal of General Internal Medicine
20 papers in training set
Top 0.1%
4.7%
9
Journal of Biomedical Informatics
45 papers in training set
Top 0.5%
3.5%
10
JAMIA Open
37 papers in training set
Top 0.7%
1.8%
11
JMIR Medical Informatics
17 papers in training set
Top 0.6%
1.8%
12
PLOS Digital Health
91 papers in training set
Top 1%
1.7%
13
JAMA Network Open
127 papers in training set
Top 2%
1.6%
14
npj Digital Medicine
97 papers in training set
Top 2%
1.5%
15
BMJ
49 papers in training set
Top 0.8%
1.3%
16
EClinicalMedicine
21 papers in training set
Top 0.4%
1.3%
17
Healthcare
16 papers in training set
Top 1%
1.2%
18
JMIR Public Health and Surveillance
45 papers in training set
Top 3%
1.2%
19
Wellcome Open Research
57 papers in training set
Top 1%
1.1%
20
Emergency Medicine Journal
20 papers in training set
Top 0.4%
0.9%
21
International Journal of Environmental Research and Public Health
124 papers in training set
Top 6%
0.9%
22
Journal of Medical Internet Research
85 papers in training set
Top 4%
0.8%
23
F1000Research
79 papers in training set
Top 5%
0.7%
24
British Journal of Anaesthesia
14 papers in training set
Top 0.8%
0.7%
25
Physiological Measurement
12 papers in training set
Top 0.5%
0.6%