Back

Synonym Augmentation for Rare Disease Identification in Unstructured Data

Valinejad, J.; Moon, S.; Xu, Y.; Zhu, Q.

2026-05-13 health informatics
10.64898/2026.05.11.26352910 medRxiv
Show abstract

The significant challenges associated with rare diseases in the medical and research domains include the scarcity of information, which is often confined to unstructured formats. Although existing approaches provide valuable insights, there is a need to develop effective methods to identify information pertinent to rare diseases for advancing rare disease research. We identified mentions of rare diseases in relevant texts and assessed their relevance using derived scores, the confidence score and semantic similarity from a fine-tuned BioMedBERT encoder. This encoder was fine-tuned using rare disease related text from Online Mendelian Inheritance in Man (OMIM), Orphanet, a manually validated dataset, and STS benchmark datasets. The process of identifying meaningful rare disease mentioned was presented through two case studies that retrieved relevant NIH-funded projects, utilizing a generated knowledge graph in Neo4j to host data on 2,067 GARD diseases with over 320,000 NIH funded projects. Through various case studies with NIH-funded projects related to rare diseases, we demonstrated the effectiveness of our approach in systematically providing rare disease related data to enhance our understanding of rare diseases for future investigations.

Matching journals

The top 7 journals account for 50% of the predicted probability mass.

1
Journal of Biomedical Informatics
45 papers in training set
Top 0.1%
12.5%
2
Bioinformatics
1061 papers in training set
Top 3%
10.6%
3
Journal of the American Medical Informatics Association
61 papers in training set
Top 0.3%
8.5%
4
Scientific Reports
3102 papers in training set
Top 9%
8.5%
5
IEEE Journal of Biomedical and Health Informatics
34 papers in training set
Top 0.3%
4.4%
6
Nature Communications
4913 papers in training set
Top 35%
4.4%
7
PLOS ONE
4510 papers in training set
Top 42%
3.1%
50% of probability mass above
8
GENETICS
189 papers in training set
Top 0.4%
2.4%
9
Nature Computational Science
50 papers in training set
Top 0.4%
2.1%
10
JAMIA Open
37 papers in training set
Top 0.6%
2.1%
11
Advanced Science
249 papers in training set
Top 9%
1.9%
12
Journal of Personalized Medicine
28 papers in training set
Top 0.3%
1.8%
13
BMC Medical Informatics and Decision Making
39 papers in training set
Top 1%
1.8%
14
Patterns
70 papers in training set
Top 0.8%
1.7%
15
International Journal of Medical Informatics
25 papers in training set
Top 0.9%
1.7%
16
iScience
1063 papers in training set
Top 15%
1.7%
17
NAR Genomics and Bioinformatics
214 papers in training set
Top 2%
1.3%
18
BMC Bioinformatics
383 papers in training set
Top 5%
1.3%
19
Communications Biology
886 papers in training set
Top 12%
1.3%
20
Genomics, Proteomics & Bioinformatics
171 papers in training set
Top 5%
1.1%
21
Med
38 papers in training set
Top 0.5%
1.0%
22
Nature Machine Intelligence
61 papers in training set
Top 3%
1.0%
23
npj Digital Medicine
97 papers in training set
Top 3%
1.0%
24
Data in Brief
13 papers in training set
Top 0.3%
0.9%
25
Artificial Intelligence in Medicine
15 papers in training set
Top 0.6%
0.8%
26
JCO Clinical Cancer Informatics
18 papers in training set
Top 0.8%
0.8%
27
Database
51 papers in training set
Top 0.9%
0.8%
28
PLOS Genetics
756 papers in training set
Top 15%
0.7%
29
Computers in Biology and Medicine
120 papers in training set
Top 5%
0.7%
30
GigaScience
172 papers in training set
Top 4%
0.7%