Back

A Comparative Study in Surgical AI: Datasets, Foundation Models, and Barriers to Med-AGI

Skobelev, K.; Fithian, E.; Baranovski, Y.; Cook, J.; Angara, S.; Otto, S.; Yi, Z.-F.; Zhu, J.; Donoho, D. A.; Han, X. Y.; Mainkar, N.; Masson-Forsythe, M.

2026-03-28 surgery
10.64898/2026.03.26.26349455 medRxiv
Show abstract

Recent Artificial Intelligence (AI) models have matched or exceeded human experts in several benchmarks of biomedical task performance, but have lagged behind on surgical image-analysis benchmarks. Since surgery requires integrating disparate tasks --- including multimodal data integration, human interaction, and physical effects --- generally-capable AI models could be particularly attractive as a collaborative tool if performance could be improved. On the one hand, the canonical approach of scaling architecture size and training data is attractive, especially since there are millions of hours of surgical video data generated per year. On the other hand, preparing surgical data for AI training requires significantly higher levels of professional expertise, and training on that data requires expensive computational resources. These trade-offs paint an uncertain picture of whether and to-what-extent modern AI could aid surgical practice. In this paper, we explore this question through a case study of surgical tool detection using state-of-the-art AI methods available in 2026. We demonstrate that even with multi-billion parameter models and extensive training, current Vision Language Models fall short in the seemingly simple task of tool detection in neurosurgery. Additionally, we show scaling experiments indicating that increasing model size and training time only leads to diminishing improvements in relevant performance metrics. Thus, our experiments suggest that current models could still face significant obstacles in surgical use cases. Moreover, some obstacles cannot be simply ``scaled away'' with additional compute and persist across diverse model architectures, raising the question of whether data and label availability are the only limiting factors. We discuss the main contributors to these constraints and advance potential solutions.

Matching journals

The top 3 journals account for 50% of the predicted probability mass.

1
PLOS Computational Biology
1633 papers in training set
Top 0.5%
23.1%
2
npj Digital Medicine
97 papers in training set
Top 0.2%
19.9%
3
Biology Methods and Protocols
53 papers in training set
Top 0.1%
8.6%
50% of probability mass above
4
Heliyon
146 papers in training set
Top 0.1%
5.0%
5
PLOS ONE
4510 papers in training set
Top 35%
4.1%
6
Nature Medicine
117 papers in training set
Top 1%
3.1%
7
Scientific Reports
3102 papers in training set
Top 42%
2.9%
8
Proceedings of the National Academy of Sciences
2130 papers in training set
Top 27%
2.1%
9
Nature Machine Intelligence
61 papers in training set
Top 2%
1.9%
10
eLife
5422 papers in training set
Top 41%
1.7%
11
Human Brain Mapping
295 papers in training set
Top 3%
1.4%
12
Nature
575 papers in training set
Top 12%
1.4%
13
Brain Communications
147 papers in training set
Top 2%
1.4%
14
iScience
1063 papers in training set
Top 21%
1.3%
15
Nature Human Behaviour
85 papers in training set
Top 3%
1.3%
16
Frontiers in Computational Neuroscience
53 papers in training set
Top 2%
0.9%
17
Expert Systems with Applications
11 papers in training set
Top 0.3%
0.9%
18
Communications Psychology
20 papers in training set
Top 0.3%
0.8%
19
Bioinformatics
1061 papers in training set
Top 9%
0.8%
20
JAMIA Open
37 papers in training set
Top 2%
0.7%
21
International Journal of Medical Informatics
25 papers in training set
Top 2%
0.7%
22
Frontiers in Neuroinformatics
38 papers in training set
Top 0.8%
0.7%
23
Nature Communications
4913 papers in training set
Top 63%
0.7%
24
BMC Medical Informatics and Decision Making
39 papers in training set
Top 3%
0.7%
25
Nature Methods
336 papers in training set
Top 6%
0.7%
26
Cancers
200 papers in training set
Top 6%
0.5%
27
Patterns
70 papers in training set
Top 3%
0.5%
28
GigaScience
172 papers in training set
Top 4%
0.5%
29
Frontiers in Medicine
113 papers in training set
Top 9%
0.5%