Back

AI Agents in Clinical Medicine: A Systematic Review

Gorenshtein, A.; Omar, M.; Glicksberg, B. S.; Nadkarni, G.; Klang, E.

2025-08-26 health informatics
10.1101/2025.08.22.25334232 medRxiv
Show abstract

BackgroundAI agents built on large language models (LLMs) can plan tasks, use external tools, and coordinate with other agents. Unlike standard LLMs, agents can execute multi-step processes, access real-time clinical information, and integrate multiple data sources. There has been interest in using such agents for clinical and administrative tasks, however, there is limited knowledge on their performance and whether multi-agent systems function better than a single agent for healthcare tasks. PurposeTo evaluate the performance of AI agents in healthcare, compare AI agent systems vs. standard LLMs and catalog the tools used for task completion Data SourcesPubMed, Web of Science, and Scopus from October 1, 2022, through August 5, 2025. Study SelectionPeer-reviewed studies implementing AI agents for clinical tasks with quantitative performance comparisons. Data ExtractionTwo reviewers (A.G., M.O.) independently extracted data on architectures, performance metrics, and clinical applications. Discrepancies were resolved by discussion, with a third reviewer (E.K.) consulted when consensus could not be reached. Data SynthesisTwenty studies met inclusion criteria. Across studies, all agent systems outperformed their baseline LLMs in accuracy performance. Improvements ranged from small gains to increases of over 60 percentage points, with a median improvement of 53 percentage points in single-agent tool-calling studies. These systems were particularly effective for discrete tasks such as medication dosing and evidence retrieval. Multi-agent systems showed optimal performance with up to 5 agents, and their effectiveness was particularly pronounced when dealing with highly complex tasks. The highest performance boost occurred when the complexity of the AI agent framework aligned with that of the task. LimitationsHeterogeneous outcomes precluded quantitative meta-analysis. Several studies relied on synthetic data, limiting generalizability. ConclusionsAI agents consistently improve clinical task performance of Base-LLMs when architecture matches task complexity. Our analysis indicates a step-change over base-LLMs, with AI agents opening previously inaccessible domains. Future efforts should be based on prospective, multi-center trials using real-world data to determine safety, task matched and cost-effectiveness. Primary Funding SourceThis work was supported in part through the computational and data resources and staff expertise provided by Scientific Computing and Data at the Icahn School of Medicine at Mount Sinai and supported by the Clinical and Translational Science Awards (CTSA) grant UL1TR004419 from the National Center for Advancing Translational Sciences. Research reported in this publication was also supported by the Office of Research Infrastructure of the National Institutes of Health under award number S10OD026880 and S10OD030463. The content is solely the responsibility of the authors and does not necessarily represent the official views of the National Institutes of Health. RegistrationPROSPERO CRD420251120318

Matching journals

The top 2 journals account for 50% of the predicted probability mass.

1
Journal of the American Medical Informatics Association
61 papers in training set
Top 0.1%
44.5%
2
npj Digital Medicine
97 papers in training set
Top 0.2%
19.9%
50% of probability mass above
3
BMJ Health & Care Informatics
13 papers in training set
Top 0.3%
2.2%
4
Journal of Biomedical Informatics
45 papers in training set
Top 0.7%
2.0%
5
JAMIA Open
37 papers in training set
Top 0.7%
1.8%
6
JCO Clinical Cancer Informatics
18 papers in training set
Top 0.4%
1.8%
7
Journal of Medical Internet Research
85 papers in training set
Top 2%
1.8%
8
Scientific Reports
3102 papers in training set
Top 55%
1.8%
9
PLOS Digital Health
91 papers in training set
Top 1%
1.8%
10
International Journal of Medical Informatics
25 papers in training set
Top 0.9%
1.6%
11
JAMA Network Open
127 papers in training set
Top 3%
1.4%
12
BMC Medicine
163 papers in training set
Top 4%
1.3%
13
PLOS ONE
4510 papers in training set
Top 59%
1.3%
14
BMC Medical Informatics and Decision Making
39 papers in training set
Top 2%
1.2%
15
Artificial Intelligence in Medicine
15 papers in training set
Top 0.5%
1.0%
16
Healthcare
16 papers in training set
Top 1%
1.0%
17
JMIR Medical Informatics
17 papers in training set
Top 1%
1.0%
18
Frontiers in Digital Health
20 papers in training set
Top 1%
1.0%
19
BMC Medical Research Methodology
43 papers in training set
Top 1%
0.8%
20
Bioinformatics
1061 papers in training set
Top 9%
0.8%
21
BMJ Open
554 papers in training set
Top 12%
0.8%
22
The Lancet Digital Health
25 papers in training set
Top 1%
0.8%
23
iScience
1063 papers in training set
Top 36%
0.7%
24
Computers in Biology and Medicine
120 papers in training set
Top 5%
0.7%
25
Journal of Clinical Epidemiology
28 papers in training set
Top 0.7%
0.5%
26
Inflammatory Bowel Diseases
15 papers in training set
Top 0.3%
0.5%