TF-IDF k-mer-based Classical and Hybrid Machine Learning Models for SARS-CoV-2 Variant Classification under Imbalanced Genomic Data

Haque, N.; Mazed, A.; Ankhi, J. N.; Uddin, M. J.

2026-04-02 bioinformatics

10.64898/2026.04.02.716024 bioRxiv

Show abstract

Accurate classification of SARS-CoV-2 genomic variants is essential for effective genomic surveillance, yet it is challenged by extreme class imbalance, limited representation of rare variants, and distribution shifts in real-world sequencing data. In this study, we employed hybrid RF-SVM framework designed for robust detection of rare SARS-CoV-2 variants. It integrates a random forest and a polynomial-kernel based support vector machine to enhance sensitivity to minority classes while maintaining overall predictive stability. We systematically compared classical machine learning models, deep learning approaches, and hybrid strategies under both standard and distribution-shifted evaluation settings. Our results show that classical models using TF-IDF-based k-mer features outperform deep learning methods on macro-averaged performance metrics. The Random Forest classifier using TF-IDF Feature achieved the best overall performance, with a macro-averaged F1-score of 0.8894 and an accuracy of 96.3%. The model also demonstrated strong generalization ability, as evidenced by stable cross-validation performance (CV accuracy = 0.9637). Hybrid RF-SVM model further improves rare variant detection under severe class imbalance. Calibration analysis indicates reliable probability estimates for common variants, although challenges persist for minority classes. Overall, this study highlights the limitations of deep learning in highly imbalanced genomic settings and demonstrates that carefully designed hybrid machine learning approaches provide an effective and interpretable solution for rare SARS-CoV-2 variant detection.

TF-IDF k-mer-based Classical and Hybrid Machine Learning Models for SARS-CoV-2 Variant Classification under Imbalanced Genomic Data

Matching journals