Back

TogoMCP: Natural Language Querying of Life-Science Knowledge Graphs via Schema-Guided LLMs and the Model Context Protocol

Kinjo, A. R.; Yamamoto, Y.; Bustamante-Larriet, S.; Labra-Gayo, J. E.; Fujisawa, T.

2026-03-23 bioinformatics
10.64898/2026.03.19.713030 bioRxiv
Show abstract

Querying the RDF Portal knowledge graph maintained by DBCLS--which aggregates more than 70 life-science databases--requires proficiency in both SPARQL and database-specific RDF schemas, placing this resource beyond the reach of most researchers. Large Language Models (LLMs) can, in principle, translate natural-language questions into executable SPARQL, but without schema-level context, they frequently fabricate non-existent predicates or fail to resolve entity names to database-specific identifiers. We present TogoMCP, a system that recasts the LLM as a protocol-driven inference engine orchestrating specialized tools via the Model Context Protocol (MCP). Two mechanisms are essential to its design: (i) the MIE (Metadata-Interoperability-Exchange) file, a concise YAML document that dynamically supplies the LLM with each target databases structural and semantic context at query time; and (ii) a two-stage workflow separating entity resolution via external REST APIs from schema-guided SPARQL generation. On a benchmark of 50 biologically grounded questions spanning five types and 23 databases, TogoMCP achieved a large improvement over an unaided baseline (Cohens d = 0.92, Wilcoxon p < 10-6), with win rates exceeding 80% for question types with precise, verifiable answers. An ablation study identified MIE files as the single indispensable component: removing them reduced the effect to a non-significant level (d = 0.08), while a one-line instruction to load the relevant MIE file recovered the full benefit of an elaborate behavioral protocol. These results suggest a general design principle: concise, dynamically delivered schema context is more valuable than complex orchestration logic. Database URLhttps://togomcp.rdfportal.org/

Matching journals

The top 6 journals account for 50% of the predicted probability mass.

1
Bioinformatics
1061 papers in training set
Top 2%
14.7%
2
Nature Methods
336 papers in training set
Top 0.7%
12.7%
3
Nature Biotechnology
147 papers in training set
Top 0.6%
10.1%
4
Cell Systems
167 papers in training set
Top 2%
7.2%
5
Bioinformatics Advances
184 papers in training set
Top 0.7%
4.9%
6
Nature Communications
4913 papers in training set
Top 36%
4.2%
50% of probability mass above
7
GigaScience
172 papers in training set
Top 0.4%
4.2%
8
Genome Biology
555 papers in training set
Top 2%
3.6%
9
Nature
575 papers in training set
Top 7%
3.6%
10
Nucleic Acids Research
1128 papers in training set
Top 7%
2.9%
11
PLOS ONE
4510 papers in training set
Top 44%
2.7%
12
Genome Research
409 papers in training set
Top 2%
2.4%
13
Proceedings of the National Academy of Sciences
2130 papers in training set
Top 29%
1.9%
14
eLife
5422 papers in training set
Top 45%
1.5%
15
iScience
1063 papers in training set
Top 22%
1.2%
16
Journal of Molecular Biology
217 papers in training set
Top 3%
1.1%
17
Scientific Data
174 papers in training set
Top 2%
1.0%
18
Scientific Reports
3102 papers in training set
Top 71%
0.9%
19
Nature Genetics
240 papers in training set
Top 6%
0.9%
20
Patterns
70 papers in training set
Top 2%
0.9%
21
Briefings in Bioinformatics
326 papers in training set
Top 6%
0.9%
22
PLOS Computational Biology
1633 papers in training set
Top 23%
0.8%
23
Computational and Structural Biotechnology Journal
216 papers in training set
Top 10%
0.7%