Large language models for self-administered conversational vignette assessment of provider competencies: A pilot and validation study in Vietnam with automated LLM-powered transcript classification

Daniels, B.; Zhang, W.; Nguyen, H.; Duong, D.

2026-03-04 health economics

10.64898/2026.03.02.26347479 medRxiv

Show abstract

We developed and validated a self-administered clinical vignette platform powered by a large language model (LLM), deployed through a SurveyCTO web survey, to measure primary health care provider competencies in Vietnam. In a pilot focus group, nine physicians rated LLM-simulated patient interactions as realistic (mean 3.78/5) and user-friendly. In the validation phase, 22 providers completed 132 vignette interactions across ten clinical scenarios in Vietnamese. Essential diagnostic checklist scores (human-coded from translated transcripts) correlated with expert clinician evaluations (Pearsons{rho} = 0.55-0.60). LLM-automated coding of checklist items from translated English transcripts correlated reasonably with human coding ({rho} = 0.53), and coding directly from Vietnamese transcripts performed comparably ({rho} = 0.51), suggesting that a separate translation step may not be necessary. The total cost of 132 chatbot interactions was under USD 2. LLM-driven conversational vignettes represent a low-cost and scalable method for assessing provider competencies in respondents local language, eliminating the need for extensive enumeration staffs while preserving the open-ended format critical to vignette validity, and additionally introducing flexible feature extraction from transcripts using grading rubrics. The platform is open-source and designed for replication in other health system contexts. Author summaryMeasuring the clinical skills of healthcare providers is essential for improving the quality of care, but current survey methods are expensive and require trained enumerators to travel to health facilities in person. We developed a new approach that uses large language models (LLMs) - the technology behind tools like ChatGPT and Claude - to simulate patients in realistic clinical conversations that healthcare providers can complete on their phones or laptops over the Internet in their own language. In Vietnam, we tested this tool with 31 physicians across ten clinical scenarios. Providers found the simulated patient conversations realistic and easy to use. We also tested whether LLMs could automatically score the conversations, which showed reasonable agreement with human scoring, and performed nearly as well when scoring directly from Vietnamese, without requiring a separate translation step. When we compared these results from our tool against holistic expert physician ratings of the same conversations, the scores agreed well, suggesting that automatic transcript grading based on rubrics produces meaningful measures of clinical skill. This tool costs less than two US dollars for over a hundred consultations and required no in-person surveyors, making it potentially transformative for routine, large-scale monitoring of healthcare quality in resource-limited settings. The platform and code are openly available for adaptation.

Large language models for self-administered conversational vignette assessment of provider competencies: A pilot and validation study in Vietnam with automated LLM-powered transcript classification

Matching journals