Back

Large language models for self-administered conversational vignette assessment of provider competencies: A pilot and validation study in Vietnam with automated LLM-powered transcript classification

Daniels, B.; Zhang, W.; Nguyen, H.; Duong, D.

2026-03-04 health economics
10.64898/2026.03.02.26347479 medRxiv
Show abstract

We developed and validated a self-administered clinical vignette platform powered by a large language model (LLM), deployed through a SurveyCTO web survey, to measure primary health care provider competencies in Vietnam. In a pilot focus group, nine physicians rated LLM-simulated patient interactions as realistic (mean 3.78/5) and user-friendly. In the validation phase, 22 providers completed 132 vignette interactions across ten clinical scenarios in Vietnamese. Essential diagnostic checklist scores (human-coded from translated transcripts) correlated with expert clinician evaluations (Pearsons{rho} = 0.55-0.60). LLM-automated coding of checklist items from translated English transcripts correlated reasonably with human coding ({rho} = 0.53), and coding directly from Vietnamese transcripts performed comparably ({rho} = 0.51), suggesting that a separate translation step may not be necessary. The total cost of 132 chatbot interactions was under USD 2. LLM-driven conversational vignettes represent a low-cost and scalable method for assessing provider competencies in respondents local language, eliminating the need for extensive enumeration staffs while preserving the open-ended format critical to vignette validity, and additionally introducing flexible feature extraction from transcripts using grading rubrics. The platform is open-source and designed for replication in other health system contexts. Author summaryMeasuring the clinical skills of healthcare providers is essential for improving the quality of care, but current survey methods are expensive and require trained enumerators to travel to health facilities in person. We developed a new approach that uses large language models (LLMs) - the technology behind tools like ChatGPT and Claude - to simulate patients in realistic clinical conversations that healthcare providers can complete on their phones or laptops over the Internet in their own language. In Vietnam, we tested this tool with 31 physicians across ten clinical scenarios. Providers found the simulated patient conversations realistic and easy to use. We also tested whether LLMs could automatically score the conversations, which showed reasonable agreement with human scoring, and performed nearly as well when scoring directly from Vietnamese, without requiring a separate translation step. When we compared these results from our tool against holistic expert physician ratings of the same conversations, the scores agreed well, suggesting that automatic transcript grading based on rubrics produces meaningful measures of clinical skill. This tool costs less than two US dollars for over a hundred consultations and required no in-person surveyors, making it potentially transformative for routine, large-scale monitoring of healthcare quality in resource-limited settings. The platform and code are openly available for adaptation.

Matching journals

The top 4 journals account for 50% of the predicted probability mass.

1
Journal of Medical Internet Research
85 papers in training set
Top 0.1%
34.7%
2
BMC Health Services Research
42 papers in training set
Top 0.2%
7.2%
3
PLOS ONE
4510 papers in training set
Top 24%
7.2%
4
BMJ Open
554 papers in training set
Top 3%
6.7%
50% of probability mass above
5
Journal of the American Medical Informatics Association
61 papers in training set
Top 0.9%
2.9%
6
PLOS Global Public Health
293 papers in training set
Top 3%
2.2%
7
Frontiers in Digital Health
20 papers in training set
Top 0.5%
2.0%
8
BMC Medical Informatics and Decision Making
39 papers in training set
Top 1%
1.8%
9
Scientific Reports
3102 papers in training set
Top 55%
1.8%
10
npj Digital Medicine
97 papers in training set
Top 2%
1.8%
11
Frontiers in Public Health
140 papers in training set
Top 4%
1.8%
12
International Journal of Medical Informatics
25 papers in training set
Top 0.8%
1.8%
13
JAMIA Open
37 papers in training set
Top 1.0%
1.4%
14
BJGP Open
12 papers in training set
Top 0.4%
1.4%
15
Journal of Biomedical Informatics
45 papers in training set
Top 0.9%
1.4%
16
Journal of General Internal Medicine
20 papers in training set
Top 0.6%
1.4%
17
Healthcare
16 papers in training set
Top 1%
1.2%
18
IEEE Journal of Biomedical and Health Informatics
34 papers in training set
Top 2%
0.9%
19
BMJ Health & Care Informatics
13 papers in training set
Top 0.7%
0.9%
20
Proceedings of the National Academy of Sciences
2130 papers in training set
Top 40%
0.9%
21
BMC Medical Research Methodology
43 papers in training set
Top 1%
0.9%
22
Cancer Medicine
24 papers in training set
Top 1%
0.8%
23
Medical Decision Making
10 papers in training set
Top 0.2%
0.8%
24
Nature Medicine
117 papers in training set
Top 4%
0.8%
25
Heliyon
146 papers in training set
Top 5%
0.8%
26
PLOS Digital Health
91 papers in training set
Top 3%
0.8%
27
Psychiatry Research
35 papers in training set
Top 1%
0.8%
28
European Radiology
14 papers in training set
Top 0.7%
0.8%
29
JMIR Formative Research
32 papers in training set
Top 2%
0.7%
30
BJPsych Open
25 papers in training set
Top 0.9%
0.5%