Back

Extraction of Human Phenotype Ontology (HPO) Concepts from Clinical Notes Utilizing Large Language Models (LLM) with Model Context Protocol (MCP)

Larsen, M. E.; Campbell, I. M.; Orlando, L. A.; Robinson, P.; Walton, N. A.

2026-05-25 health informatics
10.64898/2026.05.23.26353963 medRxiv
Show abstract

Background: Accurate extraction of Human Phenotype Ontology (HPO) terms from clinical notes is essential for variant prioritization and genetic diagnosis. Large language models (LLMs) often struggle to balance precision, hallucination avoidance, and ontology mapping accuracy, and prior work has shown that retrieval-based grounding can improve performance for individual models. We hypothesized that real-time ontology grounding through external tools would improve these metrics across heterogeneous LLMs, and we evaluated the Model Context Protocol (MCP), a standardized open framework for integrating external tools, as a vendor-agnostic mechanism for delivering such grounding. Methods: Five LLMs (Claude Sonnet 4.5, GPT-5.1, Gemini 2.5 Pro, Grok 4.1, and Qwen3 30B) extracted HPO terms from four synthetic clinical genetics notes under two conditions: baseline ("No Tools," internal knowledge only) and tool-augmented ("With Tools"), with real-time HPO retrieval delivered through MCP for models with native support and through functionally equivalent native tool-calling interfaces otherwise. Each model performed [&ge;]50 runs per note per condition (>2,000 total runs). Performance was evaluated using Precision, Recall, and F1-score. Outputs were manually adjudicated to classify mapping errors and hallucinations. Results were benchmarked against a commercial EHR-based HPO extraction tool. Results: Tool augmentation significantly improved performance across all models. Mean aggregate F1-score increased from 0.46 (SD 0.22) in the baseline condition to 0.72 (SD 0.15) with tools (p < 0.001). Mapping Error Rate decreased from 40.9% to 7.8% (p < 0.001), and Precision increased from 56% to 90%. Performance gains were observed across all model families, including the open-weight Qwen3 model (F1 0.11[-&gt;]0.50). For inferred phenotypes, F1 improved from 0.20 to 0.34 (p < 0.001) without a significant increase in hallucination rate (p = 0.08). Compared with the commercial benchmark, tool-augmented LLMs achieved higher F1-scores and substantially greater recall for inferred phenotypes. Conclusions: Real-time ontology grounding substantially improves HPO extraction across diverse LLMs by reducing mapping errors and enhancing phenotype inference. The Model Context Protocol provides a standardized, interoperable mechanism for delivering such grounding, supporting reproducible, vendor-agnostic deployment of clinical LLM pipelines in genomic medicine.

Matching journals

The top 5 journals account for 50% of the predicted probability mass.

1
Journal of the American Medical Informatics Association
61 papers in training set
Top 0.1%
23.6%
2
Journal of Biomedical Informatics
45 papers in training set
Top 0.2%
7.1%
3
JCO Clinical Cancer Informatics
18 papers in training set
Top 0.1%
6.7%
4
Bioinformatics
1061 papers in training set
Top 4%
6.7%
5
JAMIA Open
37 papers in training set
Top 0.2%
6.7%
50% of probability mass above
6
npj Digital Medicine
97 papers in training set
Top 0.9%
5.1%
7
BMC Medical Informatics and Decision Making
39 papers in training set
Top 0.9%
3.2%
8
BMC Bioinformatics
383 papers in training set
Top 3%
2.7%
9
Scientific Reports
3102 papers in training set
Top 43%
2.7%
10
Med
38 papers in training set
Top 0.1%
2.2%
11
BMJ Health & Care Informatics
13 papers in training set
Top 0.3%
2.0%
12
The Lancet Digital Health
25 papers in training set
Top 0.4%
1.8%
13
Genetics in Medicine
69 papers in training set
Top 0.6%
1.8%
14
GENETICS
189 papers in training set
Top 0.6%
1.8%
15
eBioMedicine
130 papers in training set
Top 1%
1.8%
16
Frontiers in Digital Health
20 papers in training set
Top 0.7%
1.6%
17
International Journal of Medical Informatics
25 papers in training set
Top 1.0%
1.4%
18
Wellcome Open Research
57 papers in training set
Top 1%
1.3%
19
PLOS ONE
4510 papers in training set
Top 62%
0.9%
20
JMIR Medical Informatics
17 papers in training set
Top 1%
0.9%
21
NAR Genomics and Bioinformatics
214 papers in training set
Top 3%
0.9%
22
iScience
1063 papers in training set
Top 28%
0.8%
23
Genome Medicine
154 papers in training set
Top 7%
0.8%
24
The American Journal of Human Genetics
206 papers in training set
Top 4%
0.8%
25
GigaScience
172 papers in training set
Top 3%
0.8%
26
Nature Medicine
117 papers in training set
Top 5%
0.8%
27
Journal of Medical Internet Research
85 papers in training set
Top 5%
0.5%
28
Nature Communications
4913 papers in training set
Top 67%
0.5%
29
JMIR Public Health and Surveillance
45 papers in training set
Top 4%
0.5%