Back

A systematic review on machine learning approaches in the diagnosis of rare genetic diseases

Roman-Naranjo, P.; Parra-Perez, A. M.; Lopez-Escamez, J. A.

2023-01-31 health informatics
10.1101/2023.01.30.23285203 medRxiv
Show abstract

BackgroundThe diagnosis of rare genetic diseases is often challenging due to the complexity of the genetic underpinnings of these conditions and the limited availability of diagnostic tools. Machine learning (ML) algorithms have the potential to improve the accuracy and speed of diagnosis by analyzing large amounts of genomic data and identifying complex multiallelic patterns that may be associated with specific diseases. In this systematic review, we aimed to identify the methodological trends and the ML application areas in rare genetic diseases. MethodsWe performed a systematic review of the literature following the PRISMA guidelines to search studies that used ML approaches to enhance the diagnosis of rare genetic diseases. Studies that used DNA-based sequencing data and a variety of ML algorithms were included, summarized, and analyzed using bibliometric methods, visualization tools, and a feature co-occurrence analysis. FindingsOur search identified 22 studies that met the inclusion criteria. We found that exome sequencing was the most frequently used sequencing technology (59%), and rare neoplastic diseases were the most prevalent disease scenario (59%). In rare neoplasms, the most frequent applications of ML models were the differential diagnosis or stratification of patients (38.5%) and the identification of somatic mutations (30.8%). In other rare diseases, the most frequent goals were the prioritization of rare variants or genes (55.5%) and the identification of biallelic or digenic inheritance (33.3%). The most employed method was the random forest algorithm (54.5%). In addition, the features of the datasets needed for training these algorithms were distinctive depending on the goal pursued, including the mutational load in each gene for the differential diagnosis of patients, or the combination of genotype features and sequence-derived features (such as GC-content) for the identification of somatic mutations. ConclusionsML algorithms based on sequencing data are mainly used for the diagnosis of rare neoplastic diseases, with random forest being the most common approach. We identified key features in the datasets used for training these ML models according to the objective pursued. These features can support the development of future ML models in the diagnosis of rare genetic diseases.

Matching journals

The top 9 journals account for 50% of the predicted probability mass.

1
Orphanet Journal of Rare Diseases
18 papers in training set
Top 0.1%
14.5%
2
Cancer Medicine
24 papers in training set
Top 0.1%
7.2%
3
Scientific Reports
3102 papers in training set
Top 14%
6.9%
4
JCO Clinical Cancer Informatics
18 papers in training set
Top 0.2%
4.3%
5
Genetics in Medicine
69 papers in training set
Top 0.4%
4.2%
6
PLOS ONE
4510 papers in training set
Top 39%
3.6%
7
BMC Bioinformatics
383 papers in training set
Top 3%
3.6%
8
BMJ Health & Care Informatics
13 papers in training set
Top 0.2%
3.6%
9
International Journal of Medical Informatics
25 papers in training set
Top 0.5%
2.6%
50% of probability mass above
10
BMC Medical Genomics
36 papers in training set
Top 0.2%
2.5%
11
Human Mutation
29 papers in training set
Top 0.3%
2.4%
12
Biology Methods and Protocols
53 papers in training set
Top 0.6%
2.1%
13
BMC Medical Informatics and Decision Making
39 papers in training set
Top 1%
1.9%
14
Journal of the American Medical Informatics Association
61 papers in training set
Top 1%
1.9%
15
BMC Medical Research Methodology
43 papers in training set
Top 0.5%
1.8%
16
Journal of Medical Internet Research
85 papers in training set
Top 3%
1.7%
17
Informatics in Medicine Unlocked
21 papers in training set
Top 0.5%
1.7%
18
Journal of Personalized Medicine
28 papers in training set
Top 0.3%
1.7%
19
Diagnostics
48 papers in training set
Top 1%
1.7%
20
JAMIA Open
37 papers in training set
Top 1.0%
1.3%
21
Frontiers in Digital Health
20 papers in training set
Top 0.8%
1.3%
22
Clinical Chemistry
22 papers in training set
Top 0.6%
1.1%
23
Acta Neuropsychiatrica
12 papers in training set
Top 0.7%
1.0%
24
Cancers
200 papers in training set
Top 4%
0.8%
25
JMIR Medical Informatics
17 papers in training set
Top 1%
0.8%
26
PLOS Computational Biology
1633 papers in training set
Top 24%
0.8%
27
Frontiers in Oncology
95 papers in training set
Top 4%
0.7%
28
American Journal of Medical Genetics Part A
17 papers in training set
Top 0.3%
0.7%
29
Database
51 papers in training set
Top 1%
0.7%
30
The Journal of Molecular Diagnostics
36 papers in training set
Top 0.5%
0.7%