Back

Early Detection of Rare Disease Using Hierarchical Set-to-Sequence Modeling of Structured Electronic Health Records

Ma, Y.; Chinthala, L.; Mohammed, A.; Davis, R. L.; Colonna, V.

2026-05-06 health informatics
10.64898/2026.05.04.26352393 medRxiv
Show abstract

Rare diseases are characterized by heterogeneous, weak, and sparse phenotypic signals that emerge gradually across longitudinal clinical visits, making early detection a persistent challenge. In this study, we propose a hierarchical set-to-sequence (HSS) framework for prospective rare disease detection using structured EHR data. HSS decomposes the problem into two levels: (1) intra-visit encoding via Multi-Query Attention (MQA), which treats heterogeneous clinical events within a single clinical visit as an unordered set to generate unified visit-level representations, and (2) inter-visit temporal modeling with transformer encoders conditioned on patient visit age and inter-visit time gaps to capture the disease progression and the irregular intervals between clinical visits. We construct a real-world cohort of 40,223 patients comprising 708,422 visits from a single academic medical center (2005-2025), with 3,032 rare disease cases identified by curated rule-based phenotyping including severe neuro-developmental, congenital, or genetic conditions. We formulate the task as multi-horizon prospective binary classification with five prediction horizons of 7, 30, 90, 180, and 365 days prior to first diagnosis. Experimental results show that the proposed HSS model consistently outperforms linear logistic regression, tree-based XGBoost, and Transformer-based baselines at every prediction horizon, ranging from AUROC = 0.893 and AUPRC = 0.601 at 7 days with 5.17% prevalence to AUROC = 0.829 and AUPRC = 0.228 at 365 days with at 3.98% prevalence. Notably, the performance gap between HSS and the strongest competing baseline is largest at the 365 days horizon, indicating stronger advantages for long-horizon prediction where phenotypic signals for rare diseases are weak and sparse. Additional analyses further clarify the contribution of the hierarchical components and confirm the importance of hierarchical modeling. This work contributes to the ongoing development of AI methodologies tailored to rare diseases by introducing a hierarchical framework for early detection using structured longitudinal clinical data.

Matching journals

The top 3 journals account for 50% of the predicted probability mass.

1
IEEE Journal of Biomedical and Health Informatics
34 papers in training set
Top 0.1%
18.6%
2
Journal of Biomedical Informatics
45 papers in training set
Top 0.1%
18.6%
3
npj Digital Medicine
97 papers in training set
Top 0.3%
14.7%
50% of probability mass above
4
Journal of the American Medical Informatics Association
61 papers in training set
Top 0.5%
6.3%
5
Scientific Reports
3102 papers in training set
Top 36%
3.6%
6
Advanced Science
249 papers in training set
Top 8%
2.6%
7
Nature Communications
4913 papers in training set
Top 49%
1.9%
8
Patterns
70 papers in training set
Top 0.7%
1.9%
9
Journal of Personalized Medicine
28 papers in training set
Top 0.3%
1.7%
10
Nature Machine Intelligence
61 papers in training set
Top 2%
1.7%
11
Communications Medicine
85 papers in training set
Top 0.3%
1.7%
12
PLOS Digital Health
91 papers in training set
Top 2%
1.7%
13
iScience
1063 papers in training set
Top 16%
1.7%
14
Science Advances
1098 papers in training set
Top 20%
1.5%
15
BMC Medical Informatics and Decision Making
39 papers in training set
Top 2%
1.3%
16
JMIR Medical Informatics
17 papers in training set
Top 0.9%
1.3%
17
JAMIA Open
37 papers in training set
Top 1%
0.9%
18
Journal of Medical Internet Research
85 papers in training set
Top 4%
0.9%
19
PLOS ONE
4510 papers in training set
Top 64%
0.9%
20
Nature Computational Science
50 papers in training set
Top 2%
0.8%
21
PNAS Nexus
147 papers in training set
Top 2%
0.7%
22
Nature Biomedical Engineering
42 papers in training set
Top 3%
0.6%
23
Communications Biology
886 papers in training set
Top 29%
0.6%
24
Bioinformatics
1061 papers in training set
Top 10%
0.6%