Back

Safety and Utility of an Agentic Large Language Model-Based Hospital Course Summarizer: A Prospective Real-World Pilot Study

Grolleau, F.; Liang, A. S.; Keyes, T.; Ma, S. P.; Lew, T.; Huynh, T. R.; Steele, N.; Chung, P.; Qin, P.; Chandra, G.; Wang, S. F.; Mullen, E.; Carpenter, L.; Hoppenfeld, M.; Morrin, M.; Kyerematen, B. A.; Ambers, N.; Kotecha, N.; Alsentzer, E.; Hom, J.; Shah, N. H.; Schulman, K.; Chen, J. H.

2026-02-06 health informatics
10.64898/2026.02.05.26345607 medRxiv
Show abstract

ImportanceHigh-quality discharge summaries are essential for safe care transitions but contribute substantially to clinician documentation burden and burnout. While retrospective studies suggest large language models (LLMs) can generate clinical summaries of comparable quality to physicians, prospective data on their safety, utility, and impact on clinician well-being in real-world environments are lacking. ObjectiveTo evaluate the safety, utilization, and impact on clinician burden of MedAgentBrief, an LLM-based agentic workflow for generating hospital course summaries, during prospective clinical deployment. Design, Setting, and ParticipantsSingle-arm prospective pilot study encompassing 384 hospital discharges at one academic inpatient medicine unit from August 1 to October 11, 2025, with baseline comparisons drawn from April 9 to July 31, 2025. InterventionMedAgentBrief, a custom agentic AI workflow utilizing Gemini 2.5 Pro, generated draft hospital course summaries nightly using the patients history and physical and daily progress notes. Drafts were securely emailed to physicians daily for review and optional use. Main Outcomes and MeasuresThe primary outcome was physician-reported potential for and severity of harm from unedited summaries (AHRQ Common Format Harm Scale). Secondary outcomes included utilization rate, error types (omissions, inaccuracies, hallucinations), time spent in discharge summaries (EHR logs), and changes in cognitive burden (NASA Task Load Index [NASA-TLX]) and burnout (Stanford Professional Fulfillment Index [PFI] Work Exhaustion Scale). ResultsThe system generated 1274 summaries. Of 384 discharges, physicians utilized AI content in 219 (57%) cases. Feedback on 100 summaries (40.2%) noted omissions (25%) and inaccuracies (20%) but rare hallucinations (2%). Physicians rated 88% of unedited summaries as having no harm potential and 1% as likely to cause moderate harm; no severe harm was reported. Physician burnout scores decreased significantly (1.75 vs 1.20; P = .03). Time savings were heterogeneous: 71% of physicians saw reductions in median documentation time (up to 2.9 minutes). Conclusions and RelevanceAn LLM-based agentic workflow produced hospital course summaries that were frequently utilized with mild to minimal risk of harm identified. The intervention was associated with a significant reduction in physician burnout, supporting the viability of AI summarization to mitigate documentation burden.

Matching journals

The top 2 journals account for 50% of the predicted probability mass.

1
Journal of the American Medical Informatics Association
61 papers in training set
Top 0.1%
39.9%
2
npj Digital Medicine
97 papers in training set
Top 0.3%
14.6%
50% of probability mass above
3
Journal of Medical Internet Research
85 papers in training set
Top 0.7%
6.4%
4
JAMIA Open
37 papers in training set
Top 0.2%
4.9%
5
Journal of Biomedical Informatics
45 papers in training set
Top 0.4%
3.6%
6
Journal of General Internal Medicine
20 papers in training set
Top 0.2%
3.3%
7
JMIR Medical Informatics
17 papers in training set
Top 0.6%
1.9%
8
Frontiers in Digital Health
20 papers in training set
Top 0.5%
1.8%
9
JAMA Network Open
127 papers in training set
Top 2%
1.8%
10
BMJ Health & Care Informatics
13 papers in training set
Top 0.4%
1.7%
11
BMC Medical Informatics and Decision Making
39 papers in training set
Top 1%
1.7%
12
Healthcare
16 papers in training set
Top 1%
1.0%
13
JMIR Formative Research
32 papers in training set
Top 1%
0.9%
14
PLOS ONE
4510 papers in training set
Top 64%
0.9%
15
JCO Clinical Cancer Informatics
18 papers in training set
Top 0.7%
0.9%
16
iScience
1063 papers in training set
Top 29%
0.8%
17
Scientific Reports
3102 papers in training set
Top 72%
0.8%
18
JAMA Pediatrics
10 papers in training set
Top 0.2%
0.8%
19
IEEE Journal of Biomedical and Health Informatics
34 papers in training set
Top 2%
0.7%
20
International Journal of Medical Informatics
25 papers in training set
Top 2%
0.7%
21
Annals of Internal Medicine
27 papers in training set
Top 1%
0.7%
22
Inflammatory Bowel Diseases
15 papers in training set
Top 0.3%
0.7%
23
Critical Care Explorations
15 papers in training set
Top 0.5%
0.7%
24
JMIR Public Health and Surveillance
45 papers in training set
Top 5%
0.5%