Back

Comparison of missing data handling methods for variant pathogenicity predictors

Särkkä, M. I.; Myöhänen, S.; Marinov, K.; Saarinen, I.; Lahti, L.; Fortino, V.; Paananen, J.

2022-06-18 bioinformatics
10.1101/2022.06.17.496578 bioRxiv
Show abstract

1BackgroundModern clinical genetic tests utilize next-generation sequencing (NGS) approaches to comprehensively analyze genetic variants from patients. Out of these millions of variants, clinically relevant variants that match the patients phenotype need to be identified accurately within a rapid timeframe that facilitates clinical action. As manual evaluation of variants is not a feasible option for meeting the speed and volume requirements of clinical genetic testing, automated solutions are needed. Various machine learning (ML), artificial intelligence (AI), and in silico variant pathogenicity predictors have been developed to solve this challenge. These solutions rely on the comprehensiveness of the available data and struggle with the sparse nature of genetic variant data. Therefore, careful treatment of missing data is necessary, and the selected methods may have a huge impact on the accuracy, reliability, speed and associated computational costs. ResultsWe present an open-source framework called AMISS that can be used to evaluate performance of different methods for handling missing genetic variant data in the context of variant pathogenicity prediction. Using AMISS, we evaluated 14 methods for handling missing values. The performance of these methods varied substantially in terms of precision, computational costs, and other attributes. Overall, simpler imputation methods and specifically mean imputation performed best. ConclusionsSelection of the missing data handling method is crucial for AI/ML-based classification of genetic variants. We show that utilizing sophisticated imputation methods is not worth the cost when used in the context of genetic variant pathogenicity classification.

Matching journals

The top 7 journals account for 50% of the predicted probability mass.

1
BMC Bioinformatics
383 papers in training set
Top 0.5%
14.4%
2
BioData Mining
15 papers in training set
Top 0.1%
12.3%
3
Journal of the American Medical Informatics Association
61 papers in training set
Top 0.5%
6.4%
4
Human Mutation
29 papers in training set
Top 0.1%
4.9%
5
Bioinformatics
1061 papers in training set
Top 4%
4.9%
6
Scientific Reports
3102 papers in training set
Top 28%
4.3%
7
PLOS ONE
4510 papers in training set
Top 34%
4.3%
50% of probability mass above
8
BMC Medical Genomics
36 papers in training set
Top 0.2%
3.6%
9
PLOS Computational Biology
1633 papers in training set
Top 10%
3.6%
10
Genome Medicine
154 papers in training set
Top 2%
3.6%
11
Computational and Structural Biotechnology Journal
216 papers in training set
Top 2%
3.1%
12
Bioinformatics Advances
184 papers in training set
Top 2%
2.7%
13
Biology Methods and Protocols
53 papers in training set
Top 0.8%
1.8%
14
BMC Genomics
328 papers in training set
Top 2%
1.7%
15
Genetics in Medicine
69 papers in training set
Top 0.6%
1.7%
16
European Journal of Human Genetics
49 papers in training set
Top 0.6%
1.7%
17
BMC Medical Informatics and Decision Making
39 papers in training set
Top 2%
1.5%
18
GigaScience
172 papers in training set
Top 2%
1.3%
19
Human Genetics
25 papers in training set
Top 0.3%
1.2%
20
Clinical Chemistry
22 papers in training set
Top 0.6%
1.1%
21
BMC Medical Research Methodology
43 papers in training set
Top 1%
0.9%
22
JCO Clinical Cancer Informatics
18 papers in training set
Top 0.8%
0.8%
23
Frontiers in Genetics
197 papers in training set
Top 9%
0.8%
24
F1000Research
79 papers in training set
Top 5%
0.7%
25
Briefings in Bioinformatics
326 papers in training set
Top 7%
0.7%
26
PeerJ
261 papers in training set
Top 16%
0.7%
27
IEEE/ACM Transactions on Computational Biology and Bioinformatics
32 papers in training set
Top 0.6%
0.7%
28
npj Genomic Medicine
33 papers in training set
Top 1%
0.7%
29
Nucleic Acids Research
1128 papers in training set
Top 20%
0.6%