Back

Developing and Testing an Engineering Framework for Curiosity-Driven and Humble AI in Clinical Decision Support

Arslan, J.; Benke, K.; Cajas, S.; Castro, R.; Celi, L. A.; Cruz Suarez, G. A.; Delos Reyes, R.; Engelmann, J.; Ercole, A.; Hilel, A.; Kalla, M.; Kinyera, L.; Lange, M.; Lunde, T. M.; Meni, M. J.; Ocampo Osorio, F.; Premo, A.; Sedlakova, J.; Vig, P.

2026-02-07 health informatics
10.64898/2026.02.06.26345664 medRxiv
Show abstract

BackgroundWe present BODHI (Balanced, Open-minded, Diagnostic, Humble, and Inquisitive), an engineering framework for curiosity-driven and humble clinical decision support AI. Despite growing capabilities, large language models (LLMs) often express inappropriate confidence, conflating statistical pattern recognition with genuine medical understanding. BODHI addresses this through a dual-reflective architecture that: (1) decomposes epistemic uncertainty into task-specific dimensions, and (2) constrains model responses using virtue-based stance rules derived from a Virtue Activation Matrix. MethodsWe validate the framework through controlled evaluation on 200 clinical vignettes from HealthBench Hard, assessing GPT-4o-mini and GPT-4.1-mini across 5 random seeds (1,800 total observations). Statistical analysis included bootstrap resampling, paired t-tests, and effect size computation (Supplementary Materials S3) FindingsBODHI significantly improved overall clinical response quality (GPT-4.1-mini: +17.3pp, p < 0.0001, Cohens d = 0.50; GPT-4o-mini: +7.4pp, p < 0.0001, Cohens d = 0.22) while achieving very large effect sizes on curiosity (context-seeking rate: Cohens d = 16.38 and 19.54) and humility (hedging: d = 5.80 for GPT-4.1-mini) metrics. Crucially, 97.3% of GPT-4.1-mini responses and 73.5% of GPT-4o-mini responses included appropriate clarifying questions, compared to 7.8% and 0.0% at baseline, demonstrating the frameworks effectiveness in eliciting information-gathering behavior. InterpretationThese findings suggest LLMs can be reliably constrained to operate within epistemic boundaries when provided with structured uncertainty decomposition and virtue-aligned response rules, offering a pathway toward safer clinical AI deployment.

Matching journals

The top 1 journal accounts for 50% of the predicted probability mass.

1
npj Digital Medicine
97 papers in training set
Top 0.1%
63.7%
50% of probability mass above
2
Journal of the American Medical Informatics Association
61 papers in training set
Top 0.5%
5.1%
3
Scientific Reports
3102 papers in training set
Top 48%
2.2%
4
Nature Machine Intelligence
61 papers in training set
Top 2%
1.7%
5
Nature Communications
4913 papers in training set
Top 50%
1.7%
6
JCO Clinical Cancer Informatics
18 papers in training set
Top 0.5%
1.4%
7
iScience
1063 papers in training set
Top 19%
1.4%
8
PLOS Digital Health
91 papers in training set
Top 2%
1.4%
9
Patterns
70 papers in training set
Top 1%
1.3%
10
The Lancet Digital Health
25 papers in training set
Top 0.6%
1.3%
11
Nature Medicine
117 papers in training set
Top 3%
1.3%
12
Frontiers in Digital Health
20 papers in training set
Top 1.0%
1.0%
13
Journal of Biomedical Informatics
45 papers in training set
Top 1%
0.9%
14
Computers in Biology and Medicine
120 papers in training set
Top 4%
0.9%
15
Bioinformatics
1061 papers in training set
Top 9%
0.8%
16
Nature Human Behaviour
85 papers in training set
Top 4%
0.8%
17
Nature Biomedical Engineering
42 papers in training set
Top 2%
0.8%
18
BMJ Health & Care Informatics
13 papers in training set
Top 0.9%
0.7%
19
International Journal of Medical Informatics
25 papers in training set
Top 2%
0.7%
20
Med
38 papers in training set
Top 1.0%
0.7%
21
PLOS ONE
4510 papers in training set
Top 70%
0.7%
22
Advanced Science
249 papers in training set
Top 23%
0.5%