Measurement of retrieved chunk quality from real-world knowledge in retrieval-augmented generation: A Phase 1 foundational study
Fukataki, Y.; Hayashi, W.; Kitayama, M.; Ito, Y. M.
Show abstract
Retrieval-augmented generation (RAG) holds promise for supporting high-stakes medical decision-making. However, most research has focused on downstream optimization of parameters and algorithms. This Phase 1 foundational study quantitatively evaluated the upstream quality of knowledge documents and their impact on retrieval performance, using Japanese clinical research protocol manuals for Institutional Review Board pre-screening support as a case study. We established a three-tier evaluation framework: Level 1 assessed knowledge document quality through independent expert review across Structure, Granularity, and Noise dimensions; Level 2a evaluated the structural quality of retrieved chunks using large language model-as-a-Judge across five metrics; and Level 2b conducted proof-of-concept content appropriateness evaluation against a Gold Standard derived from international guidelines. Using Google Cloud Vertex AI Search, we analyzed 594 chunks from baseline knowledge (A-line: four institutional manuals as-is) and six chunks from optimized knowledge (B-line: proof of concept). Level 2a evaluations employed deterministic settings with five independent trials, achieving excellent reliability (intraclass correlation coefficient of 0.936). The results revealed substantial quality limitations in the A-line chunks: the median scores were 2.0 or below across all five metrics, with fewer than 20% of the chunks reaching practical utility thresholds (score of 4 or higher). Even among the top-ranked results, fewer than half met the practical utility criteria, except for Faithfulness. The inter-rater agreement in the Level 1 evaluation was fair (Fleiss kappa value of 0.269), indicating the need for framework refinement. The retrieved chunk lengths significantly exceeded the configured settings (median of 3,861 characters versus 500 tokens), potentially indicating information dilution. The B-line optimization achieved perfect scores across all metrics, demonstrating potential for improvement. These findings demonstrate that upstream document quality constrains retrieval performance, challenging assumptions regarding plug-and-play RAG deployment. Author SummaryArtificial intelligence systems that retrieve information from documents and generate responses are increasingly being used to support medical decision-making. We questioned the assumption that uploading existing documents is sufficient for fine-tuning the algorithms of these systems by investigating whether document quality is a limiting factor. We studied Japanese clinical research manuals used for research ethics review, assessing the efficacy of an AI system in retrieving information from these documents. We evaluated nearly 600 text segments retrieved by the system and found that fewer than one in five segments met our quality standards, even among the highest-ranked results. The system frequently retrieved excessively long passages obscuring key information. However, when we restructured one document section using clearer organization and formatting, the system achieved perfect performance scores. This improvement suggests that not only algorithm optimization but also document preparation is crucial for system effectiveness. Our findings challenge the "plug-and-play" assumption commonly used in AI deployment. For high-stakes medical applications, organizations cannot simply expect reliable results to be obtained by uploading existing documents. Instead, they must invest in preparing well-structured knowledge documents. This foundational work establishes measurement methods to guide such preparation, which is essential before these systems can safely support healthcare decision-making.
Matching journals
The top 2 journals account for 50% of the predicted probability mass.