Back

Input design for unsupervised cross-national branded food database alignment using large language models

Nakagawa, S.; Yamamoto, A.

2026-05-25 nutrition
10.64898/2026.05.23.26353945 medRxiv
Show abstract

Cross-national alignment of branded food databases is essential for international nutritional epidemiology but lacks standardized methods. Existing approaches - including food ontologies, domain-specific fine-tuned language models, and manual expert mapping - require either substantial infrastructure or do not scale to thousands of items. We propose an unsupervised evaluation framework for large language model (LLM)-based food database alignment that requires no ground-truth labels. Using the Japan Branded Food Database (JBFD; 9,519 items, 71 mid-level categories) and USDA FoodData Central (448 categories) as a case study, we introduce two complementary metrics: weighted centroid distance (nutritional proximity between matched category pairs) and dominant category share (structural consistency of category-level assignments). We then conducted a systematic ablation study across eight input conditions (A-H), varying combinations of product name, nutrient profile, and semantic category label. Results showed that nutrient-only inputs yielded poor structural consistency despite low centroid distances, while semantic category labels achieved the highest dominant category share (89.3%) but introduced circularity due to their LLM-derived origin. Among circularity-free conditions, product name combined with minimal nutrient information (energy, protein, salt; condition E) achieved the best balance of centroid distance (0.471) and dominant category share (65.8%). Model comparison across Claude Haiku, Sonnet, and Opus confirmed that NO_MATCH rates were consistent across model sizes (12-14%), suggesting that prompt design contributes more to alignment quality than model scale. These findings provide practical guidance for input design in LLM-based food database alignment without ground-truth annotation.Sonnet 4.6

Matching journals

The top 6 journals account for 50% of the predicted probability mass.

1
BMC Medical Informatics and Decision Making
39 papers in training set
Top 0.1%
19.0%
2
Bioinformatics Advances
184 papers in training set
Top 0.1%
10.3%
3
PLOS ONE
4510 papers in training set
Top 18%
10.3%
4
Bioinformatics
1061 papers in training set
Top 4%
5.0%
5
Scientific Reports
3102 papers in training set
Top 22%
5.0%
6
BMC Bioinformatics
383 papers in training set
Top 2%
5.0%
50% of probability mass above
7
Nature Communications
4913 papers in training set
Top 39%
3.7%
8
GigaScience
172 papers in training set
Top 0.7%
2.8%
9
Database
51 papers in training set
Top 0.2%
2.8%
10
Scientific Data
174 papers in training set
Top 0.7%
2.4%
11
iScience
1063 papers in training set
Top 11%
1.9%
12
Methods in Ecology and Evolution
160 papers in training set
Top 1%
1.7%
13
Frontiers in Plant Science
240 papers in training set
Top 3%
1.7%
14
Computational and Structural Biotechnology Journal
216 papers in training set
Top 5%
1.5%
15
Briefings in Bioinformatics
326 papers in training set
Top 5%
1.3%
16
Science Advances
1098 papers in training set
Top 25%
1.0%
17
Public Health Nutrition
14 papers in training set
Top 0.5%
1.0%
18
Journal of Biomedical Informatics
45 papers in training set
Top 1%
0.9%
19
eLife
5422 papers in training set
Top 53%
0.9%
20
Current Developments in Nutrition
15 papers in training set
Top 0.7%
0.9%
21
Epidemics
104 papers in training set
Top 1%
0.9%
22
PLOS Computational Biology
1633 papers in training set
Top 23%
0.8%
23
Ecological Informatics
29 papers in training set
Top 0.6%
0.8%
24
Frontiers in Cell and Developmental Biology
218 papers in training set
Top 9%
0.8%
25
Frontiers in Physiology
93 papers in training set
Top 6%
0.8%
26
Computers in Biology and Medicine
120 papers in training set
Top 5%
0.7%
27
Journal of Hazardous Materials
19 papers in training set
Top 0.9%
0.7%
28
Journal of the American Society for Mass Spectrometry
33 papers in training set
Top 0.5%
0.7%
29
Healthcare
16 papers in training set
Top 2%
0.7%
30
Genomics, Proteomics & Bioinformatics
171 papers in training set
Top 7%
0.5%