Relation Extraction for Diet, Non-Communicable Disease and Biomarker Associations (RECoDe): A CoDiet study
Choi, D.; Gu, Y.; Zong, K.; Lain, A. D.; Zaikis, D.; Rowlands, T.; Rei, M.; CoDiet Consortium, ; Beck, T.; Posma, J. M.
Show abstract
Diet plays a critical role in human health, with growing evidence linking dietary habits to disease outcomes. However, extracting structured dietary knowledge from biomedical literature remains challenging due to the lack of dedicated relation extraction datasets. To address this gap, we introduce RECoDe, a novel relation extraction (RE) dataset designed specifically for diet, disease, and related biomedical entities. RECoDe captures a diverse set of relation types, including a broad spectrum of positive association patterns and explicit negative examples, with over 5,000 human-annotated instances validated by up to five independent annotators. Furthermore, we benchmark various natural language processing (NLP) RE models, including BERT-based architectures and enhanced prompting techniques with locally deployed large language models (LLMs) to improve classification performance on underrepresented relation types. The best performing model gpt-oss-20B, a local LLM, achieved an F1-score of 64% for multi-class classification and 92% for binary classification using a hierarchical prompting strategy with a separate reflection step built in. To demonstrate the practical utility of RECoDe, we introduce the Contextual Co-occurrence Summarisation (Co-CoS) framework, which aggregates sentence-level relation extractions into document-level summaries and further integrates evidence across multiple documents. Co-CoS produces effect estimates consistent with established dietary knowledge, demonstrating its validity as a general framework for systematic evidence synthesis. AvailabilityThe code, models, and data will be made freely available upon acceptance.
Matching journals
The top 6 journals account for 50% of the predicted probability mass.