Retrospective Quality Analysis of a Clinical RAG Chatbot: Observable Signals and Lessons Learned

Khashei, I.; Presciani, D.; Martinelli, L. P.; Grosjean, S.

2026-01-27 emergency medicine

10.64898/2026.01.26.26344757 medRxiv

Show abstract

Retrieval-augmented generation (RAG) is increasingly adopted to ground clinical conversational agents in external knowledge sources, yet many deployed prototypes lack the observability required for standard RAG evaluation. In particular, retrieved documents and grounding context are often not logged, preventing direct assessment of retrieval quality and faithfulness. We report a post-hoc evaluation of EMSy, a clinical RAG-based chatbot prototype, based on 2,660 multi-turn conversations collected between January and September 2025. Rather than benchmarking performance, we adopt an evaluation strategy based exclusively on observable signals. The analysis combines an exploratory intent analysis conducted on a random subset of heterogeneous interactions, automated quality scores available at the message and conversation level, and explicit user feedback, with 96.0% of rated conversations receiving positive feedback. Results indicate that message-level minimum scores capture localized low-quality responses that are not reflected by average conversation-level metrics, while user feedback reflects aggregate interaction impressions. This case study illustrates how diagnostic insights can be obtained under limited observability and identifies implications for the design and evaluation of future clinical RAG systems.

Matching journals

●Non-profit ◐University press ○Commercial

The top 5 journals account for 50% of the predicted probability mass.

Only show non-profit

Scientific Reports

○ 3102 papers in training set

Artificial Intelligence in Medicine

○ 15 papers in training set

● 4510 papers in training set

Journal of Medical Internet Research

◐ 85 papers in training set

npj Digital Medicine

○ 97 papers in training set

50% of probability mass above

PLOS Digital Health

● 91 papers in training set

Frontiers in Digital Health

○ 20 papers in training set

Frontiers in Public Health

○ 140 papers in training set

● 408 papers in training set

Proceedings of the National Academy of Sciences

● 2130 papers in training set

Nature Medicine

○ 117 papers in training set

Journal of the American Medical Informatics Association

◐ 61 papers in training set

○ 1063 papers in training set

Computers in Biology and Medicine

○ 120 papers in training set

International Journal of Medical Informatics

○ 25 papers in training set

Philosophical Transactions of the Royal Society B

● 51 papers in training set

◐ 1061 papers in training set

Nucleic Acids Research

◐ 1128 papers in training set

IEEE Journal of Biomedical and Health Informatics

● 34 papers in training set

○ 67 papers in training set

Nature Human Behaviour

○ 85 papers in training set

Journal of Biomedical Informatics

○ 45 papers in training set

○ 16 papers in training set

Nature Communications

○ 4913 papers in training set

Computer Methods and Programs in Biomedicine

○ 27 papers in training set

BMC Medical Informatics and Decision Making

○ 39 papers in training set

Emergency Medicine Journal

● 20 papers in training set

Cell Reports Methods

○ 141 papers in training set

PLOS Computational Biology

● 1633 papers in training set

Scientific Data

○ 174 papers in training set