Developing and Testing an Engineering Framework for Curiosity-Driven and Humble AI in Clinical Decision Support
Arslan, J.; Benke, K.; Cajas, S.; Castro, R.; Celi, L. A.; Cruz Suarez, G. A.; Delos Reyes, R.; Engelmann, J.; Ercole, A.; Hilel, A.; Kalla, M.; Kinyera, L.; Lange, M.; Lunde, T. M.; Meni, M. J.; Ocampo Osorio, F.; Premo, A.; Sedlakova, J.; Vig, P.
Show abstract
BackgroundWe present BODHI (Balanced, Open-minded, Diagnostic, Humble, and Inquisitive), an engineering framework for curiosity-driven and humble clinical decision support AI. Despite growing capabilities, large language models (LLMs) often express inappropriate confidence, conflating statistical pattern recognition with genuine medical understanding. BODHI addresses this through a dual-reflective architecture that: (1) decomposes epistemic uncertainty into task-specific dimensions, and (2) constrains model responses using virtue-based stance rules derived from a Virtue Activation Matrix. MethodsWe validate the framework through controlled evaluation on 200 clinical vignettes from HealthBench Hard, assessing GPT-4o-mini and GPT-4.1-mini across 5 random seeds (1,800 total observations). Statistical analysis included bootstrap resampling, paired t-tests, and effect size computation (Supplementary Materials S3) FindingsBODHI significantly improved overall clinical response quality (GPT-4.1-mini: +17.3pp, p < 0.0001, Cohens d = 0.50; GPT-4o-mini: +7.4pp, p < 0.0001, Cohens d = 0.22) while achieving very large effect sizes on curiosity (context-seeking rate: Cohens d = 16.38 and 19.54) and humility (hedging: d = 5.80 for GPT-4.1-mini) metrics. Crucially, 97.3% of GPT-4.1-mini responses and 73.5% of GPT-4o-mini responses included appropriate clarifying questions, compared to 7.8% and 0.0% at baseline, demonstrating the frameworks effectiveness in eliciting information-gathering behavior. InterpretationThese findings suggest LLMs can be reliably constrained to operate within epistemic boundaries when provided with structured uncertainty decomposition and virtue-aligned response rules, offering a pathway toward safer clinical AI deployment.
Matching journals
The top 1 journal accounts for 50% of the predicted probability mass.