Back

VaaS is a Multi-Layer Hallucination Reduction Pipeline for AI-Assisted Science: Production Validation and Prospective Benchmarking

Sabharwal, A.; Patel, M. S.; Carrano, A.; Rotman, M.; Wierson, W.; Ekker, S. C.

2026-03-30 health informatics
10.64898/2026.03.24.26348935 medRxiv
Show abstract

The deployment of large language models (LLMs) for science carries an intrinsic risk: hallucination of citations, fabricated drug approvals or clinical trials, and unsupported experimental outcomes. Here we describe the testing and deployment of a novel systematic, multi-layer approach called the Validation as a System (VaaS) pipeline, iteratively developed during the construction of an open-source, living Rare Disease Database (RDD). We report lessons learned and production results from 225 carefully annotated rare disease gene curations and a prospective 100-gene collection (99 net new), together representing over 3,000 verified citations. After three iterations of directed refinement, the net functional hallucination rate approached zero. We validated the pipeline using three complementary benchmarks: (1) VaaS-RIKER2, a 640-run prospective ablation study (4 conditions x 4 temperatures x 40 genes) plus 117 open-weight model runs on dedicated GPU hardware - unguided LLM output produced 95.9% Type II hallucination (wrong-topic citations that exist as real papers but carry a correct claim context yet do not support the cited claim); the full VaaS protocol achieved 0.0% Type I and 6.5% Type II, a >14-fold reduction; live PMID verification alone (C3) eliminated both error types entirely (0.0%/0.0%); (2) an independent L3 citation audit of Wave 3 (179 PMIDs, 99.4% valid, 0 Type I errors); and (3) the MedHallu clinical hallucination benchmark, on which the VaaS protocol achieved F1 = 0.9853 on the hard tier (cases where all benchmark ensemble models were fooled), compared to the published GPT-4o baseline of F1 = 0.811 (Pandit et al., 2025). Three independent open-weight models (llama3.2, qwen2.5:14b,mistral:7b) showed 81-87% Type II rates under unguided conditions, confirming that wrong-topic citation hallucination is structural and model-agnostic. In contrast, the corresponding VaaS rate was measured to be zero (n = 508 verified citations; 160 runs, C4 full protocol) under the same conditions. Human validation of [≥]50 entries confirmed zero Type I errors and less than 0.5% Type II errors in the manual curation test. The VaaS pipeline operated at less than [~]$1 overall per comprehensive gene review, demonstrating that citation-integrity standards in AI-assisted biomedical synthesis are achievable at production scale. The VaaS approach represents, to the authors' knowledge, the lowest measured hallucination system for science to date and is set to further accelerate the use of AI and AI agents for advancing research.

Matching journals

The top 7 journals account for 50% of the predicted probability mass.

1
Nature Methods
336 papers in training set
Top 0.4%
18.4%
2
Nature
575 papers in training set
Top 3%
8.3%
3
Nature Communications
4913 papers in training set
Top 24%
8.1%
4
Nature Machine Intelligence
61 papers in training set
Top 0.6%
4.8%
5
Nature Biotechnology
147 papers in training set
Top 2%
4.8%
6
Bioinformatics
1061 papers in training set
Top 5%
4.3%
7
Nature Genetics
240 papers in training set
Top 2%
4.1%
50% of probability mass above
8
Patterns
70 papers in training set
Top 0.2%
3.5%
9
Nature Computational Science
50 papers in training set
Top 0.2%
3.2%
10
Science
429 papers in training set
Top 10%
3.2%
11
Cell Systems
167 papers in training set
Top 5%
2.3%
12
Molecular Systems Biology
142 papers in training set
Top 0.4%
2.1%
13
Genome Biology
555 papers in training set
Top 4%
1.7%
14
Scientific Reports
3102 papers in training set
Top 60%
1.6%
15
Briefings in Bioinformatics
326 papers in training set
Top 4%
1.6%
16
eLife
5422 papers in training set
Top 47%
1.3%
17
Nature Medicine
117 papers in training set
Top 3%
1.3%
18
NAR Genomics and Bioinformatics
214 papers in training set
Top 3%
1.2%
19
GENETICS
189 papers in training set
Top 1.0%
1.1%
20
Nature Plants
84 papers in training set
Top 1%
0.9%
21
Cell Genomics
162 papers in training set
Top 5%
0.9%
22
Med
38 papers in training set
Top 0.5%
0.9%
23
Molecular Cell
308 papers in training set
Top 9%
0.9%
24
Cell Reports Medicine
140 papers in training set
Top 7%
0.9%
25
Nature Biomedical Engineering
42 papers in training set
Top 2%
0.7%
26
Cell
370 papers in training set
Top 17%
0.7%
27
GigaScience
172 papers in training set
Top 3%
0.7%
28
Computers in Biology and Medicine
120 papers in training set
Top 5%
0.7%
29
PLOS ONE
4510 papers in training set
Top 72%
0.6%
30
npj Digital Medicine
97 papers in training set
Top 4%
0.6%