Back

aiDIVA - Diagnostics of Rare Genetic Diseases Using Large Language Models

Boceck, D.; Laugwitz, L.; Sturm, M.; Bezdan, D.; Gschwind, A.; Haack, T. B.; Ossowski, S.

2025-09-07 genetic and genomic medicine
10.1101/2025.09.04.25335099 medRxiv
Show abstract

Genome sequencing (GS) enables the accurate identification of genetic variants in most genomic regions and is rapidly transforming routine diagnostics for rare diseases (RD). While streamlined data generation is scalable, efficient prioritization and correct clinical interpretation of detected alterations remain a challenge, often requiring manual classification by experts with years of training. Hence, there is a need for AI-driven clinical decision support systems that assist clinical experts in identifying causal variants or, in case of large-scale re-analysis of unsolved cases, fully automate the process. To this end, many tools have been developed to estimate the impact of variants on protein function. However, only a small number of tools combine genomic data, variant annotations, and phenotypic data to diagnose cases. Here we introduce aiDIVA, an ensemble-AI featuring a hierarchically organized set of statistical and machine learning models trained on genomic and phenotypic data to identify the causal variant(s) among tens of thousands of genetic variants of a patient. aiDIVA generates pathogenicity classifications for each variant using a random forest AI model and an evidence-based score for dominant and recessive diseases. It combines these predictions with additional clinical metadata to prioritize and rank the most likely causal variants. aiDIVA uses large language models (LLMs) to further improve and explain the results. Finally, the aiDIVA-meta model combines all scores to generate a ranked list of variants. In a benchmark analysis on more than 3,000 diagnostically solved RD patients, the causal variant was included in 97% of the cases among the top-3 candidate variants reported by aiDIVA-meta. Unlike comparative methods, aiDIVA provides interpretable explanations for the best candidates.

Matching journals

The top 3 journals account for 50% of the predicted probability mass.

1
Genome Medicine
154 papers in training set
Top 0.1%
40.7%
2
Bioinformatics
1061 papers in training set
Top 4%
7.0%
3
Nature Communications
4913 papers in training set
Top 27%
6.6%
50% of probability mass above
4
Genome Biology
555 papers in training set
Top 2%
3.7%
5
Nucleic Acids Research
1128 papers in training set
Top 7%
3.0%
6
Human Mutation
29 papers in training set
Top 0.2%
2.8%
7
Nature Genetics
240 papers in training set
Top 3%
2.5%
8
Cell Genomics
162 papers in training set
Top 2%
2.2%
9
The American Journal of Human Genetics
206 papers in training set
Top 2%
2.1%
10
Genetics in Medicine
69 papers in training set
Top 0.6%
2.0%
11
iScience
1063 papers in training set
Top 11%
2.0%
12
Briefings in Bioinformatics
326 papers in training set
Top 3%
1.8%
13
Nature Medicine
117 papers in training set
Top 2%
1.8%
14
Med
38 papers in training set
Top 0.4%
1.3%
15
Scientific Reports
3102 papers in training set
Top 65%
1.3%
16
Bioinformatics Advances
184 papers in training set
Top 4%
1.3%
17
Cell Systems
167 papers in training set
Top 10%
1.0%
18
Human Genetics
25 papers in training set
Top 0.3%
0.9%
19
Genome Research
409 papers in training set
Top 4%
0.8%
20
European Journal of Human Genetics
49 papers in training set
Top 1%
0.8%
21
Frontiers in Molecular Biosciences
100 papers in training set
Top 5%
0.7%
22
Molecular Systems Biology
142 papers in training set
Top 2%
0.7%
23
Nature
575 papers in training set
Top 17%
0.7%
24
PLOS Genetics
756 papers in training set
Top 18%
0.5%
25
Frontiers in Genetics
197 papers in training set
Top 12%
0.5%
26
Proceedings of the National Academy of Sciences
2130 papers in training set
Top 48%
0.5%