Back

Measurement of retrieved chunk quality from real-world knowledge in retrieval-augmented generation: A Phase 1 foundational study

Fukataki, Y.; Hayashi, W.; Kitayama, M.; Ito, Y. M.

2026-01-02 health informatics
10.64898/2026.01.01.26343326 medRxiv
Show abstract

Retrieval-augmented generation (RAG) holds promise for supporting high-stakes medical decision-making. However, most research has focused on downstream optimization of parameters and algorithms. This Phase 1 foundational study quantitatively evaluated the upstream quality of knowledge documents and their impact on retrieval performance, using Japanese clinical research protocol manuals for Institutional Review Board pre-screening support as a case study. We established a three-tier evaluation framework: Level 1 assessed knowledge document quality through independent expert review across Structure, Granularity, and Noise dimensions; Level 2a evaluated the structural quality of retrieved chunks using large language model-as-a-Judge across five metrics; and Level 2b conducted proof-of-concept content appropriateness evaluation against a Gold Standard derived from international guidelines. Using Google Cloud Vertex AI Search, we analyzed 594 chunks from baseline knowledge (A-line: four institutional manuals as-is) and six chunks from optimized knowledge (B-line: proof of concept). Level 2a evaluations employed deterministic settings with five independent trials, achieving excellent reliability (intraclass correlation coefficient of 0.936). The results revealed substantial quality limitations in the A-line chunks: the median scores were 2.0 or below across all five metrics, with fewer than 20% of the chunks reaching practical utility thresholds (score of 4 or higher). Even among the top-ranked results, fewer than half met the practical utility criteria, except for Faithfulness. The inter-rater agreement in the Level 1 evaluation was fair (Fleiss kappa value of 0.269), indicating the need for framework refinement. The retrieved chunk lengths significantly exceeded the configured settings (median of 3,861 characters versus 500 tokens), potentially indicating information dilution. The B-line optimization achieved perfect scores across all metrics, demonstrating potential for improvement. These findings demonstrate that upstream document quality constrains retrieval performance, challenging assumptions regarding plug-and-play RAG deployment. Author SummaryArtificial intelligence systems that retrieve information from documents and generate responses are increasingly being used to support medical decision-making. We questioned the assumption that uploading existing documents is sufficient for fine-tuning the algorithms of these systems by investigating whether document quality is a limiting factor. We studied Japanese clinical research manuals used for research ethics review, assessing the efficacy of an AI system in retrieving information from these documents. We evaluated nearly 600 text segments retrieved by the system and found that fewer than one in five segments met our quality standards, even among the highest-ranked results. The system frequently retrieved excessively long passages obscuring key information. However, when we restructured one document section using clearer organization and formatting, the system achieved perfect performance scores. This improvement suggests that not only algorithm optimization but also document preparation is crucial for system effectiveness. Our findings challenge the "plug-and-play" assumption commonly used in AI deployment. For high-stakes medical applications, organizations cannot simply expect reliable results to be obtained by uploading existing documents. Instead, they must invest in preparing well-structured knowledge documents. This foundational work establishes measurement methods to guide such preparation, which is essential before these systems can safely support healthcare decision-making.

Matching journals

The top 2 journals account for 50% of the predicted probability mass.

1
Journal of the American Medical Informatics Association
61 papers in training set
Top 0.1%
43.5%
2
npj Digital Medicine
97 papers in training set
Top 0.3%
15.4%
50% of probability mass above
3
PLOS Digital Health
91 papers in training set
Top 0.2%
10.6%
4
BMJ Health & Care Informatics
13 papers in training set
Top 0.2%
3.0%
5
Scientific Reports
3102 papers in training set
Top 44%
2.7%
6
JAMIA Open
37 papers in training set
Top 0.7%
1.9%
7
Journal of Medical Internet Research
85 papers in training set
Top 2%
1.8%
8
PLOS ONE
4510 papers in training set
Top 57%
1.4%
9
BMC Medical Informatics and Decision Making
39 papers in training set
Top 2%
1.4%
10
Artificial Intelligence in Medicine
15 papers in training set
Top 0.4%
1.3%
11
Frontiers in Digital Health
20 papers in training set
Top 1.0%
1.0%
12
Computers in Biology and Medicine
120 papers in training set
Top 4%
0.9%
13
Journal of Biomedical Informatics
45 papers in training set
Top 1%
0.9%
14
Biology Methods and Protocols
53 papers in training set
Top 2%
0.8%
15
Philosophical Transactions of the Royal Society B
51 papers in training set
Top 5%
0.8%
16
International Journal of Medical Informatics
25 papers in training set
Top 2%
0.8%
17
Journal of NeuroEngineering and Rehabilitation
28 papers in training set
Top 0.9%
0.8%
18
JCO Clinical Cancer Informatics
18 papers in training set
Top 0.8%
0.8%
19
IEEE Journal of Biomedical and Health Informatics
34 papers in training set
Top 2%
0.7%
20
JMIR Medical Informatics
17 papers in training set
Top 2%
0.5%
21
Computer Methods and Programs in Biomedicine
27 papers in training set
Top 1%
0.5%
22
Healthcare
16 papers in training set
Top 2%
0.5%