Comparing Physicians' Assessments of Context-specific AI-powered clinical reasoning assistant with General-Purpose AI agent: A Prospective Multi-Site Physician Evaluation of VITA versus ChatGPT in India and Bangladesh
Mandke, C.; Agrawal, H. K.; Bharti, B.; Chansoria, M.; Gupta, G.; Rawat, S. K.; Sarkar, N. K.; Singh, A.; PS, S.; Walia, S.; VALID (Validation of AI in Low-resource and Indian Domains) Consortium,
Show abstract
BackgroundHealthcare providers in low- and middle-income countries (LMICs) are increasingly relying on Artificial Intelligence (AI) tools, yet most available AI assistants are general-purpose systems not designed for the specific clinical, epidemiological, and resource contexts of these settings. There is no evidence, from physicians assessments, on whether clinical reasoning support from purpose-built, context-specific and retrieval-augmented AI tools can outperform general-purpose AI agents. MethodsWe conducted a prospective multi-site validation study enrolling 37 physicians across India and Bangladesh. Each physician evaluated two AI tools (a) VITA (Validated Intelligence for Treatment and Assessment), a purpose-built (context-specific and retrieval-augmented) clinical reasoning AI assistant trained on India-specific guidelines, antimicrobial resistance patterns, and formulary constraints, and (b) ChatGPT Plus (version 5.2), a leading general-purpose AI assistant on six hypothetical clinical case vignettes (three predefined, three physician-selected). Evaluations were scored across six dimensions (differential diagnosis, clinical workup, treatment recommendation, dosing, clinical decision-making, and evidence quality) on a 1-5 Likert scale, yielding 444 observations. Analyses included paired t-tests, Wilcoxon signed-rank tests, and multivariate regressions with robust standard errors. ResultsVITA scored significantly higher than ChatGPT across all six evaluation dimensions. The mean composite score (sum of all dimensions, maximum = 30) was 25.4 for VITA versus 22.3 for ChatGPT (difference = +3.1 points, t = 8.31, p < 0.001). The largest advantage was in evidence quality (VITA: 4.46 vs. ChatGPT: 3.14, a 42% relative gap). VITAs advantage was consistent across both predefined and doctor-defined hypothetical cases and was robust to controls for physician demographics, case type, and evaluation order in multivariate regression (coefficient = +3.08, p < 0.001). ConclusionsIn this first systematic head-to-head physician evaluation of a purpose-built clinical reasoning AI assistant versus general-purpose AI in an LMIC setting, physicians consistently rated the context-specific tool as superior. These findings suggest that contextual relevance--including local guidelines, formulary constraints, and resistance patterns--matters for clinical AI adoption and quality in resource-limited settings.
Matching journals
The top 7 journals account for 50% of the predicted probability mass.