Back

Artificial intelligence-driven virtual tumorboard enhances precision care in myelodysplasticsyndromes

Swoboda, D. M.; DeZern, A. E.; England, J. T.; Venugopal, S.; Kehoe, T.; Aubrey, B. J.; Raddi, M. G.; Consagra, A.; Wang, J.; Andreadakis, J.; Rivero, G.; Stahl, M.; Zeidan, A. M.; Haferlach, T.; Brunner, A. M.; Buckstein, R.; Santini, V.; Della Porta, M. G.; Sekeres, M. A.; Nazha, A.

2026-03-27 hematology
10.64898/2026.03.26.26349088 medRxiv
Show abstract

Background: Large language models (LLMs) perform well on standardized medical exam questions, but their reliability for complex hematology decision making is uncertain. We compared four general-purpose LLMs (GPT-4o, GPT-o3, Claude Sonnet 4, and DeepSeek-V3) with a Virtual MDS Panel (VMP), a coordinated multi-agent AI system in which domain-specialized, rule-bound software agents (WHO/ICC guidelines; IPSS-R/IPSS-M; NCCN) collaborate to generate tumor-board-level recommendations. Methods: Each model generated diagnostic, prognostic, and treatment recommendations for 30 myelodysplastic syndrome cases. Nine international MDS experts from five institutions, blinded to model identity, completed 3,000 structured ratings using 5-point Likert scales for diagnosis, prognosis, and therapy and classified errors by severity. Results: General-purpose LLMs achieved modest expert ratings (overall mean scores: 3.7 for GPT-o3, 3.2 for GPT-4o, 3.1 for DeepSeek, and 3.0 for Claude) and contained major factual errors in at least 24% of responses. The VMP increased the proportion of outputs rated 4 or higher to 87% (vs. 34-66% for general-purpose models), improved mean scores to 4.3 overall (4.3 for diagnosis, 4.4 for prognosis, and 4.1 for therapy), and reduced major errors to 8%. Conclusions: In this blinded evaluation of 30 complex MDS cases, general-purpose LLMs produced clinically important errors at rates that raise safety concerns for autonomous hematology decision making. The VMP, a rule-bound, multi-agent architecture, approached expert-level accuracy supporting its potential role as an effective decision-support tool for MDS in the future.

Matching journals

The top 4 journals account for 50% of the predicted probability mass.

1
British Journal of Haematology
15 papers in training set
Top 0.1%
23.2%
2
npj Precision Oncology
48 papers in training set
Top 0.1%
15.2%
3
Blood Advances
54 papers in training set
Top 0.2%
6.6%
4
npj Digital Medicine
97 papers in training set
Top 0.7%
6.6%
50% of probability mass above
5
JCO Clinical Cancer Informatics
18 papers in training set
Top 0.1%
6.6%
6
Leukemia
39 papers in training set
Top 0.2%
5.0%
7
PLOS ONE
4510 papers in training set
Top 38%
3.7%
8
PLOS Computational Biology
1633 papers in training set
Top 9%
3.7%
9
Blood Cancer Journal
11 papers in training set
Top 0.1%
1.7%
10
Blood
67 papers in training set
Top 0.7%
1.7%
11
Scientific Reports
3102 papers in training set
Top 63%
1.4%
12
Transplantation
13 papers in training set
Top 0.3%
1.4%
13
Clinical Infectious Diseases
231 papers in training set
Top 4%
1.1%
14
Modern Pathology
21 papers in training set
Top 0.3%
0.9%
15
JCO Precision Oncology
14 papers in training set
Top 0.3%
0.9%
16
Cancer Research Communications
46 papers in training set
Top 0.9%
0.9%
17
JCI Insight
241 papers in training set
Top 6%
0.9%
18
Clinical Pharmacology & Therapeutics
25 papers in training set
Top 0.6%
0.9%
19
Frontiers in Digital Health
20 papers in training set
Top 1%
0.9%
20
European Journal of Cancer
10 papers in training set
Top 0.4%
0.8%
21
Nucleic Acids Research
1128 papers in training set
Top 16%
0.8%
22
Informatics in Medicine Unlocked
21 papers in training set
Top 1.0%
0.8%
23
PLOS Digital Health
91 papers in training set
Top 3%
0.7%
24
Journal of The Royal Society Interface
189 papers in training set
Top 5%
0.7%
25
Bioinformatics Advances
184 papers in training set
Top 5%
0.5%
26
Frontiers in Medicine
113 papers in training set
Top 8%
0.5%
27
Journal of the American Medical Informatics Association
61 papers in training set
Top 2%
0.5%
28
BMC Medical Informatics and Decision Making
39 papers in training set
Top 3%
0.5%