Back

Relation Extraction for Diet, Non-Communicable Disease and Biomarker Associations (RECoDe): A CoDiet study

Choi, D.; Gu, Y.; Zong, K.; Lain, A. D.; Zaikis, D.; Rowlands, T.; Rei, M.; CoDiet Consortium, ; Beck, T.; Posma, J. M.

2026-03-05 bioinformatics
10.64898/2026.03.03.709244 bioRxiv
Show abstract

Diet plays a critical role in human health, with growing evidence linking dietary habits to disease outcomes. However, extracting structured dietary knowledge from biomedical literature remains challenging due to the lack of dedicated relation extraction datasets. To address this gap, we introduce RECoDe, a novel relation extraction (RE) dataset designed specifically for diet, disease, and related biomedical entities. RECoDe captures a diverse set of relation types, including a broad spectrum of positive association patterns and explicit negative examples, with over 5,000 human-annotated instances validated by up to five independent annotators. Furthermore, we benchmark various natural language processing (NLP) RE models, including BERT-based architectures and enhanced prompting techniques with locally deployed large language models (LLMs) to improve classification performance on underrepresented relation types. The best performing model gpt-oss-20B, a local LLM, achieved an F1-score of 64% for multi-class classification and 92% for binary classification using a hierarchical prompting strategy with a separate reflection step built in. To demonstrate the practical utility of RECoDe, we introduce the Contextual Co-occurrence Summarisation (Co-CoS) framework, which aggregates sentence-level relation extractions into document-level summaries and further integrates evidence across multiple documents. Co-CoS produces effect estimates consistent with established dietary knowledge, demonstrating its validity as a general framework for systematic evidence synthesis. AvailabilityThe code, models, and data will be made freely available upon acceptance.

Matching journals

The top 6 journals account for 50% of the predicted probability mass.

1
Database
51 papers in training set
Top 0.1%
18.9%
2
Bioinformatics
1061 papers in training set
Top 3%
10.3%
3
Nucleic Acids Research
1128 papers in training set
Top 3%
6.5%
4
Nature Communications
4913 papers in training set
Top 28%
6.4%
5
Genome Medicine
154 papers in training set
Top 1%
4.9%
6
Journal of Biomedical Informatics
45 papers in training set
Top 0.3%
4.9%
50% of probability mass above
7
Bioinformatics Advances
184 papers in training set
Top 1.0%
4.0%
8
BMC Bioinformatics
383 papers in training set
Top 3%
2.4%
9
Scientific Reports
3102 papers in training set
Top 49%
2.1%
10
Genome Biology
555 papers in training set
Top 3%
2.1%
11
BioData Mining
15 papers in training set
Top 0.2%
1.9%
12
NAR Genomics and Bioinformatics
214 papers in training set
Top 1%
1.9%
13
Proceedings of the National Academy of Sciences
2130 papers in training set
Top 29%
1.9%
14
PLOS ONE
4510 papers in training set
Top 54%
1.7%
15
Journal of the American Medical Informatics Association
61 papers in training set
Top 1%
1.5%
16
Advanced Science
249 papers in training set
Top 13%
1.4%
17
npj Digital Medicine
97 papers in training set
Top 3%
1.2%
18
GigaScience
172 papers in training set
Top 2%
1.2%
19
Scientific Data
174 papers in training set
Top 1%
1.2%
20
Nature
575 papers in training set
Top 13%
1.1%
21
PLOS Computational Biology
1633 papers in training set
Top 21%
1.0%
22
The Lancet Digital Health
25 papers in training set
Top 0.9%
0.8%
23
Science
429 papers in training set
Top 19%
0.8%
24
Nature Medicine
117 papers in training set
Top 4%
0.8%
25
eLife
5422 papers in training set
Top 57%
0.8%
26
Computers in Biology and Medicine
120 papers in training set
Top 4%
0.8%
27
Briefings in Bioinformatics
326 papers in training set
Top 6%
0.8%
28
Frontiers in Genetics
197 papers in training set
Top 10%
0.7%
29
Molecular Systems Biology
142 papers in training set
Top 2%
0.7%
30
Med
38 papers in training set
Top 1%
0.7%