Back

Calibrating trust in AI-assisted pituitary surgery

Hudson, G. R.; Khan, D. Z.; Fayez, F.; Bhatia, S.; Bano, S.; Costanza, E.; Blandford, A.; Stoyanov, D.; McCulloch, P.; Marcus, H. J.; University College London Collaborators,

2026-06-04 surgery
10.64898/2026.06.02.26354735 medRxiv
Show abstract

Background: Endoscopic endonasal transsphenoidal surgery (EETS) requires navigation around neurocritical anatomy. Today, artificial intelligence clinical decision support systems (AI-CDSSs) can orientate surgeons, but clinician trust in AI remains unclear, limiting safe deployment. This study evaluates how modifiable design affects trust and performance in a real-world pituitary surgery AI-CDSS. Method: Online, 70 clinicians with pituitary surgery experience were randomised evenly to a Basic or Enhanced AI-CDSS which outline the sella on EETS operative video. The Enhanced group additionally received explanation of the model and previous publications, alongside confidence labels depicting outline reliability. Both groups annotated the sella on six video clips, first alone then with the optional AI-CDSS. Clips were ordered by declining AI performance, except for the final clip. Self-reported trust was measured using a 1-7 scale after each annotation, and performance was the DICE overlap between user annotations and the ground truth. Comparisons used Mann-Whitney U and permutation analysis. Results: Sixty-four participants (91%) finished the exercise (31 Basic, 33 Enhanced). When AI performed best, median trust was 5.00 in both arms (U=559, p=.521). However, when AI performed worst, trust was significantly lower for the Enhanced group (3.00 vs 3.67, U=668, p=.035), sustained in the final clip (3.67 vs 4.33 U=687, p=.019). User performance improved with the AI-CDSS, but with no significant difference between the groups on the best or worst AI performing clips. Nevertheless, for the best AI, senior clinicians had higher median performance in the Enhanced group (0.95 vs 0.90, U=75, p=.066). There was also less dispersion in the Enhanced group when AI was inaccurate (IQR: 0.07 vs 0.21, p=.004). Conclusion: Interface design can improve trust calibration in a surgical AI-CDSS and may increment performance in seniors when AI is accurate, and consistency when AI is inaccurate. In future, these features may form important safety checks during translation to the operating room.

Matching journals

The top 4 journals account for 50% of the predicted probability mass.

1
npj Digital Medicine
97 papers in training set
Top 0.1%
34.9%
2
PLOS ONE
4510 papers in training set
Top 23%
7.2%
3
Biology Methods and Protocols
53 papers in training set
Top 0.1%
6.7%
4
PLOS Computational Biology
1633 papers in training set
Top 7%
4.6%
50% of probability mass above
5
Scientific Reports
3102 papers in training set
Top 28%
4.2%
6
JMIR Formative Research
32 papers in training set
Top 0.4%
3.2%
7
BMC Medical Informatics and Decision Making
39 papers in training set
Top 1.0%
2.7%
8
Annals of Biomedical Engineering
34 papers in training set
Top 0.6%
1.9%
9
Healthcare
16 papers in training set
Top 0.5%
1.8%
10
Frontiers in Public Health
140 papers in training set
Top 4%
1.8%
11
Artificial Intelligence in Medicine
15 papers in training set
Top 0.3%
1.8%
12
Cancer Medicine
24 papers in training set
Top 0.7%
1.8%
13
Frontiers in Medicine
113 papers in training set
Top 4%
1.6%
14
BMC Neurology
12 papers in training set
Top 0.4%
1.6%
15
British Journal of Anaesthesia
14 papers in training set
Top 0.5%
1.4%
16
PLOS Biology
408 papers in training set
Top 14%
1.2%
17
JMIR Research Protocols
18 papers in training set
Top 1%
0.9%
18
Trials
25 papers in training set
Top 1%
0.8%
19
Bioengineering
24 papers in training set
Top 1%
0.8%
20
BMJ Open
554 papers in training set
Top 12%
0.8%
21
Journal of Medical Internet Research
85 papers in training set
Top 5%
0.7%
22
iScience
1063 papers in training set
Top 36%
0.7%
23
GigaScience
172 papers in training set
Top 4%
0.5%
24
Heliyon
146 papers in training set
Top 8%
0.5%
25
Journal of Neuroscience Methods
106 papers in training set
Top 2%
0.5%
26
Journal of Clinical Medicine
91 papers in training set
Top 8%
0.5%