Back

Representation Before Retrieval: Structured Patient Artifacts Reduce Hallucination in Clinical AI Systems

Scanlin, J.; Cuesta, A.; Varsavsky, M.

2026-02-16 health informatics
10.64898/2026.02.13.26346256 medRxiv
Show abstract

BackgroundLarge language models show promise for clinical decision support, yet their propensity for hallucination--generating plausible but unsupported claims--poses sub-stantial patient safety risks. Retrieval-augmented generation (RAG) is widely assumed to mitigate this problem by grounding outputs in retrieved documents, but this assumption remains inadequately tested in clinical contexts where information density, temporal complexity, and safety stakes are uniquely high. MethodsWe developed a system that compiles heterogeneous patient data (electronic health records, wearables, genomics, imaging reports) into structured, machine-readable artifacts with explicit provenance tracking across seven clinical domains. We evaluated four conditions: baseline LLM (C0), RAG over raw clinical text (C1), artifact-augmented single-pass generation (C2), and artifact-augmented multi-step agent workflow with verification (C3). Using 100 synthetic patient vignettes evaluated across 3 random seeds (N = 300 per condition, 1,200 total), we measured unsupported claim rates, factual accuracy, temporal consistency, contraindication detection, and clinical safety metrics using GPT-4o-mini with physician-adjudicated safety review. ResultsRAG substantially increased hallucination: unsupported claim rates rose from 5.0% (95% CI: 3.8-6.4%) at baseline to 43.6% (95% CI: 40.1-47.2%) with retrieval--an 8.7-fold increase (p < 0.001, Cohens d = 2.31). Structured artifacts reduced unsupported claims to 8.4% (95% CI: 6.7-10.3%) in single-pass generation, a 40% relative reduction versus baseline (p = 0.02, d = 0.48). The agent workflow achieved 21.1% unsupported claims with the lowest contraindication miss rate (0.04) and highest clinician utility scores. Ablation analysis revealed that citation requirements and constraint checking contributed most to safety improvements. ConclusionsContrary to prevailing assumptions, RAG increases rather than decreases hallucination in clinical text generation. Structured representation with explicit provenance offers a more effective approach to grounding LLM outputs in verifiable patient data. We propose an information-theoretic framework explaining why representation quality determines the ceiling on factual reliability, while agentic verification affects uncertainty handling and safety constraint enforcement.

Matching journals

The top 2 journals account for 50% of the predicted probability mass.

1
npj Digital Medicine
97 papers in training set
Top 0.1%
41.5%
2
Journal of the American Medical Informatics Association
61 papers in training set
Top 0.1%
19.5%
50% of probability mass above
3
Scientific Reports
3102 papers in training set
Top 33%
3.8%
4
JCO Clinical Cancer Informatics
18 papers in training set
Top 0.3%
2.9%
5
PLOS ONE
4510 papers in training set
Top 44%
2.7%
6
Philosophical Transactions of the Royal Society B
51 papers in training set
Top 3%
1.9%
7
The Lancet Digital Health
25 papers in training set
Top 0.4%
1.7%
8
Journal of Biomedical Informatics
45 papers in training set
Top 0.9%
1.6%
9
BMJ Health & Care Informatics
13 papers in training set
Top 0.5%
1.6%
10
Nature Medicine
117 papers in training set
Top 3%
1.4%
11
Nature Communications
4913 papers in training set
Top 54%
1.4%
12
Frontiers in Digital Health
20 papers in training set
Top 0.9%
1.3%
13
Bioinformatics
1061 papers in training set
Top 9%
0.9%
14
Annals of Internal Medicine
27 papers in training set
Top 0.8%
0.8%
15
PLOS Digital Health
91 papers in training set
Top 2%
0.8%
16
iScience
1063 papers in training set
Top 28%
0.8%
17
JAMIA Open
37 papers in training set
Top 1%
0.8%
18
Computers in Biology and Medicine
120 papers in training set
Top 6%
0.5%
19
GENETICS
189 papers in training set
Top 2%
0.5%
20
BMJ Open
554 papers in training set
Top 14%
0.5%
21
Journal of Personalized Medicine
28 papers in training set
Top 2%
0.5%