Back

GEN-KnowRD: Reframing AI for Rare Disease Recognition

Yan, C.; Su, W.-C.; Xin, Y.; Grabowska, M. E.; Kerchberger, V. E.; Borza, V. A.; Wang, J.; Wang, L.; Li, R.; Lynn, J.; Dickson, A. L.; Shyr, C.; Feng, Q.; Stein, C. M.; Wang, K.; Embi, P.; Malin, B. A.; Liu, H.; Wei, W.-Q.

2026-03-03 health informatics
10.64898/2026.03.02.26347469 medRxiv
Show abstract

Rare diseases affect over 300 million people worldwide, yet patients often endure years-long diagnostic delays that limit timely intervention and trial opportunities. Computational rare disease recognition (RDR) remains constrained by knowledge resources that are often incomplete, heterogeneous, and dependent on extensive multi-disciplinary expert curation that cannot scale. Large language models (LLMs) applied directly for end-to-end diagnosis or disease discrimination face similar knowledge bottlenecks while also raising concerns around cost, reproducibility, and data governance. Here, we introduce GEN-KnowRD, a knowledge-layer-first framework that leverages LLMs to generate schema-guided rare disease profiles, systematically assesses their quality, and constructs a computable knowledge base (PheMAP-RD) for local deployment. GEN-KnowRD integrates this knowledge into lightweight inference pipelines for both general-purpose disease screening and specialized early discrimination from longitudinal electronic health records. Across six public benchmarks for general-purpose screen (9,290 patients spanning 798 rare diseases), GEN-KnowRD significantly improves disease ranking compared to a state-of-the-art, HPO-centered diagnostic framework (up to 345.8% improvement in top-1 success), advanced end-to-end LLM reasoning (up to 129.1% improvement), and a variant of GEN-KnowRD instantiated with expert-curated knowledge rather than LLM-generated profiles. In two real-world cohorts for early diagnosis of idiopathic pulmonary fibrosis (511 patients) as a use case, GEN-KnowRD also demonstrates robust discrimination performance gains, supporting effective RDR during the pre-diagnostic window. These findings demonstrate that repositioning LLMs from diagnostic reasoning to the knowledge layer--decoupling knowledge construction from patient-level inference--yields stronger RDR, while providing scalable, continuously updatable, and reusable infrastructure for diagnosis, screening, and clinical research across the rare disease landscape.

Matching journals

The top 9 journals account for 50% of the predicted probability mass.

1
npj Digital Medicine
97 papers in training set
Top 0.5%
10.1%
2
Nature Machine Intelligence
61 papers in training set
Top 0.2%
10.1%
3
Nature Communications
4913 papers in training set
Top 28%
6.4%
4
Advanced Science
249 papers in training set
Top 3%
6.3%
5
Nature Biomedical Engineering
42 papers in training set
Top 0.2%
4.3%
6
Med
38 papers in training set
Top 0.1%
4.0%
7
Patterns
70 papers in training set
Top 0.2%
4.0%
8
Nature Medicine
117 papers in training set
Top 0.6%
4.0%
9
Bioinformatics
1061 papers in training set
Top 5%
3.6%
50% of probability mass above
10
Nature Computational Science
50 papers in training set
Top 0.1%
3.6%
11
European Respiratory Journal
54 papers in training set
Top 0.5%
3.1%
12
Scientific Reports
3102 papers in training set
Top 41%
3.1%
13
Science Translational Medicine
111 papers in training set
Top 2%
2.1%
14
Cell Reports Medicine
140 papers in training set
Top 3%
2.1%
15
Nature Methods
336 papers in training set
Top 4%
1.7%
16
Journal of the American Medical Informatics Association
61 papers in training set
Top 1%
1.5%
17
iScience
1063 papers in training set
Top 18%
1.5%
18
PLOS ONE
4510 papers in training set
Top 57%
1.5%
19
NAR Genomics and Bioinformatics
214 papers in training set
Top 3%
1.2%
20
eBioMedicine
130 papers in training set
Top 2%
1.2%
21
Cell Systems
167 papers in training set
Top 9%
1.2%
22
PLOS Digital Health
91 papers in training set
Top 2%
1.2%
23
Nature Biotechnology
147 papers in training set
Top 6%
1.1%
24
Genome Medicine
154 papers in training set
Top 6%
0.9%
25
Science Advances
1098 papers in training set
Top 26%
0.9%
26
Communications Medicine
85 papers in training set
Top 0.7%
0.9%
27
Communications Biology
886 papers in training set
Top 21%
0.8%
28
eLife
5422 papers in training set
Top 55%
0.8%
29
Cell Genomics
162 papers in training set
Top 6%
0.7%
30
IEEE Journal of Biomedical and Health Informatics
34 papers in training set
Top 2%
0.7%