Back

A systematic assessment of machine learning for structural variant filtering

Kalra, A.; Paulin, L.; SEDLAZECK, F.

2026-01-30 bioinformatics
10.64898/2026.01.27.702059 bioRxiv
Show abstract

BackgroundAccurate discrimination of true structural variants (SVs) from artifacts in long-read sequencing data remains a critical bottleneck. Numerous machine learning solutions have been proposed, ranging from classical models using engineered features to advanced deep learning and foundation model interpretability methods. However, a systematic comparison of their performance, efficiency, and practical utility is lacking. ResultsWe conducted a comprehensive benchmark of five machine learning paradigms for SV filtering using standardized Genome in a Bottle (GIAB) data for samples HG002 and HG005. We evaluated classical Random Forest classifiers on 15 genomic features, computer vision models (ResNet/VICReg), diffusion-based anomaly detection, sparse autoencoders (SAEs) on the Evo2-7B foundation model, and multimodal ensembles. A simple Random Forest on interpretable features achieved a peak F1-score of 95.7%, effectively matching all more complex models (ResNet50: 95.9%, Diffusion: 95.8%). This study represents the first application of diffusion-based anomaly detection and sparse autoencoders to structural variant analysis; while diffusion models learned highly discriminative, disentangled representations and SAEs uncovered biologically interpretable features (including atoms that were specific for ALU deletions, chromosome X variants and insertion events), they did not significantly surpass this classification ceiling. Ensemble methods offered no performance benefit but may have future potential given the orthogonality of vision-based and linear features. ConclusionsOur findings demonstrate that for the established task of germline SV filtering, simpler, interpretable models provide an optimal balance of accuracy, speed, and transparency. This benchmark establishes a pragmatic framework for method selection and argues that increased model complexity must be justified by clear, unmet biological needs rather than marginal predictive gains.

Matching journals

The top 7 journals account for 50% of the predicted probability mass.

1
BMC Bioinformatics
383 papers in training set
Top 0.3%
18.8%
2
Genome Medicine
154 papers in training set
Top 0.7%
7.3%
3
Bioinformatics
1061 papers in training set
Top 4%
6.4%
4
BMC Genomics
328 papers in training set
Top 0.4%
4.9%
5
Bioinformatics Advances
184 papers in training set
Top 0.6%
4.9%
6
PLOS Computational Biology
1633 papers in training set
Top 8%
4.4%
7
GigaScience
172 papers in training set
Top 0.3%
4.2%
50% of probability mass above
8
Computational and Structural Biotechnology Journal
216 papers in training set
Top 1%
4.0%
9
BioData Mining
15 papers in training set
Top 0.1%
4.0%
10
Frontiers in Genetics
197 papers in training set
Top 2%
3.7%
11
Scientific Reports
3102 papers in training set
Top 35%
3.6%
12
NAR Genomics and Bioinformatics
214 papers in training set
Top 0.9%
3.3%
13
PLOS ONE
4510 papers in training set
Top 48%
2.1%
14
The American Journal of Human Genetics
206 papers in training set
Top 2%
1.9%
15
Briefings in Bioinformatics
326 papers in training set
Top 3%
1.9%
16
Biology Methods and Protocols
53 papers in training set
Top 0.7%
1.9%
17
Nature Communications
4913 papers in training set
Top 54%
1.3%
18
Frontiers in Bioinformatics
45 papers in training set
Top 0.4%
1.2%
19
European Journal of Human Genetics
49 papers in training set
Top 0.9%
1.0%
20
BMC Medical Genomics
36 papers in training set
Top 0.9%
1.0%
21
Cell Genomics
162 papers in training set
Top 5%
1.0%
22
npj Genomic Medicine
33 papers in training set
Top 0.7%
0.9%
23
PeerJ
261 papers in training set
Top 13%
0.8%
24
Communications Biology
886 papers in training set
Top 21%
0.8%
25
Human Mutation
29 papers in training set
Top 0.6%
0.8%
26
JCO Clinical Cancer Informatics
18 papers in training set
Top 0.8%
0.8%
27
Journal of the American Medical Informatics Association
61 papers in training set
Top 2%
0.7%
28
Genetic Epidemiology
46 papers in training set
Top 0.9%
0.7%
29
Human Genetics
25 papers in training set
Top 0.5%
0.7%
30
npj Breast Cancer
18 papers in training set
Top 0.3%
0.5%