Back

Retrospective Quality Analysis of a Clinical RAG Chatbot: Observable Signals and Lessons Learned

Khashei, I.; Presciani, D.; Martinelli, L. P.; Grosjean, S.

2026-01-27 emergency medicine
10.64898/2026.01.26.26344757 medRxiv
Show abstract

Retrieval-augmented generation (RAG) is increasingly adopted to ground clinical conversational agents in external knowledge sources, yet many deployed prototypes lack the observability required for standard RAG evaluation. In particular, retrieved documents and grounding context are often not logged, preventing direct assessment of retrieval quality and faithfulness. We report a post-hoc evaluation of EMSy, a clinical RAG-based chatbot prototype, based on 2,660 multi-turn conversations collected between January and September 2025. Rather than benchmarking performance, we adopt an evaluation strategy based exclusively on observable signals. The analysis combines an exploratory intent analysis conducted on a random subset of heterogeneous interactions, automated quality scores available at the message and conversation level, and explicit user feedback, with 96.0% of rated conversations receiving positive feedback. Results indicate that message-level minimum scores capture localized low-quality responses that are not reflected by average conversation-level metrics, while user feedback reflects aggregate interaction impressions. This case study illustrates how diagnostic insights can be obtained under limited observability and identifies implications for the design and evaluation of future clinical RAG systems.

Matching journals

The top 5 journals account for 50% of the predicted probability mass.

1
Scientific Reports
3102 papers in training set
Top 1%
17.9%
2
Artificial Intelligence in Medicine
15 papers in training set
Top 0.1%
10.7%
3
PLOS ONE
4510 papers in training set
Top 21%
8.6%
4
Journal of Medical Internet Research
85 papers in training set
Top 0.5%
8.6%
5
npj Digital Medicine
97 papers in training set
Top 0.9%
4.4%
50% of probability mass above
6
PLOS Digital Health
91 papers in training set
Top 0.7%
3.7%
7
Frontiers in Digital Health
20 papers in training set
Top 0.3%
3.0%
8
Frontiers in Public Health
140 papers in training set
Top 3%
2.4%
9
PLOS Biology
408 papers in training set
Top 6%
2.1%
10
Proceedings of the National Academy of Sciences
2130 papers in training set
Top 27%
2.1%
11
Nature Medicine
117 papers in training set
Top 1%
2.1%
12
Journal of the American Medical Informatics Association
61 papers in training set
Top 1%
1.8%
13
iScience
1063 papers in training set
Top 14%
1.7%
14
Computers in Biology and Medicine
120 papers in training set
Top 3%
1.4%
15
International Journal of Medical Informatics
25 papers in training set
Top 1%
1.4%
16
Philosophical Transactions of the Royal Society B
51 papers in training set
Top 4%
1.3%
17
Bioinformatics
1061 papers in training set
Top 8%
1.0%
18
Nucleic Acids Research
1128 papers in training set
Top 15%
0.9%
19
IEEE Journal of Biomedical and Health Informatics
34 papers in training set
Top 2%
0.9%
20
Cureus
67 papers in training set
Top 4%
0.8%
21
Nature Human Behaviour
85 papers in training set
Top 4%
0.8%
22
Journal of Biomedical Informatics
45 papers in training set
Top 1%
0.8%
23
Healthcare
16 papers in training set
Top 2%
0.8%
24
Nature Communications
4913 papers in training set
Top 62%
0.8%
25
Computer Methods and Programs in Biomedicine
27 papers in training set
Top 0.9%
0.8%
26
BMC Medical Informatics and Decision Making
39 papers in training set
Top 2%
0.8%
27
Emergency Medicine Journal
20 papers in training set
Top 0.6%
0.8%
28
Cell Reports Methods
141 papers in training set
Top 5%
0.7%
29
PLOS Computational Biology
1633 papers in training set
Top 28%
0.5%
30
Scientific Data
174 papers in training set
Top 3%
0.5%