Back

Machine learning-based framework for predicting human infection potential of coronavirus associated with tri-amino acid motifs, KIQ and LEP in spike protein

Chanraeng, N.; Guo, J.; Srisongkram, T.; Hinwan, Y.; Fransson, P.; Sjödin, H.; Matsuura, Y.; Overgaard, H. J.; Panthong, W.; Ekalaksananan, T.; Pientong, C.; Phanthanawiboon, S.

2026-02-16 bioinformatics
10.64898/2026.02.02.703238 bioRxiv
Show abstract

Assessing the human infection potential of emerging coronaviruses remains a critical challenge for global health preparedness. In this study, we developed a machine learning-based framework to predict the human infection potential of coronaviruses and to identify associated sequence motifs using spike (S) protein sequences. A total of 3,904 complete S protein sequences were collected, annotated as human or non-human infection and encoded using trimer-based k-mer features. Model benchmarking was conducted across 27 machine learning algorithms, followed by hyperparameter optimization of the selected model. Robustness and generalizability were evaluated using k-fold cross-validation and independent external validation. Feature interpretability was further assessed using SHAP analysis to identify sequence determinants associated with infection potential. The Random Forest classifier achieved the best performance, with accuracy, sensitivity, and specificity of 97.8%, 99%, and 97.4%, respectively, and demonstrated stable predictive performance across validation datasets. Notably, the KIQ and LEP motifs were strongly associated with human infection coronaviruses and mapped to the HR1 and N-terminal domain regions of the S protein. Overall, this framework provides a practical approach for risk assessment and surveillance of emerging coronaviruses. Author summaryEmerging coronaviruses continue to threaten global public health, but rapidly identifying viruses with the potential to infect humans remains challenging. Traditional experimental approaches are time-consuming and resource-intensive, limiting their use for large-scale surveillance. In this study, we developed a machine learning based workflow to assess the human infection potential of coronaviruses using spike protein sequences. By analyzing sequence patterns across a diverse set of coronaviruses, our framework enables rapid screening of coronaviruses from multiple host species. Unlike previous studies focused on limited coronavirus genera, our approach integrates all four genera and systematically evaluates multiple learning strategies. Importantly, our analysis identifies conserved sequence motifs linked to human infection potential, bridging predictive performance with biological interpretability. Our findings demonstrate computational approaches support early warning systems for identifying high risk coronaviruses, contributing to prioritize viruses for experimental validation, guide surveillance efforts, and strengthen global pandemic preparedness under a One Health perspective.

Matching journals

The top 7 journals account for 50% of the predicted probability mass.

1
Computers in Biology and Medicine
120 papers in training set
Top 0.1%
10.1%
2
PLOS Computational Biology
1633 papers in training set
Top 3%
10.1%
3
PLOS ONE
4510 papers in training set
Top 23%
8.2%
4
Briefings in Bioinformatics
326 papers in training set
Top 0.6%
7.2%
5
Viruses
318 papers in training set
Top 0.8%
6.4%
6
Scientific Reports
3102 papers in training set
Top 24%
4.8%
7
Bioinformatics Advances
184 papers in training set
Top 1%
3.6%
50% of probability mass above
8
Computational and Structural Biotechnology Journal
216 papers in training set
Top 2%
3.1%
9
Patterns
70 papers in training set
Top 0.3%
2.9%
10
BMC Bioinformatics
383 papers in training set
Top 3%
2.7%
11
Journal of Chemical Information and Modeling
207 papers in training set
Top 2%
2.6%
12
GigaScience
172 papers in training set
Top 1.0%
2.1%
13
Frontiers in Bioinformatics
45 papers in training set
Top 0.2%
1.8%
14
Frontiers in Immunology
586 papers in training set
Top 4%
1.7%
15
Bioinformatics
1061 papers in training set
Top 7%
1.7%
16
mSphere
281 papers in training set
Top 4%
1.3%
17
NAR Genomics and Bioinformatics
214 papers in training set
Top 3%
0.9%
18
BMC Genomics
328 papers in training set
Top 4%
0.9%
19
Virus Evolution
140 papers in training set
Top 1%
0.9%
20
Journal of Proteome Research
215 papers in training set
Top 2%
0.9%
21
Advanced Science
249 papers in training set
Top 18%
0.8%
22
Biology
43 papers in training set
Top 2%
0.8%
23
Communications Biology
886 papers in training set
Top 24%
0.7%
24
PLOS Biology
408 papers in training set
Top 20%
0.7%
25
iScience
1063 papers in training set
Top 35%
0.7%
26
Tropical Medicine & International Health
15 papers in training set
Top 0.9%
0.6%
27
Journal of Biomedical Informatics
45 papers in training set
Top 2%
0.6%
28
Genome Medicine
154 papers in training set
Top 9%
0.6%
29
Genomics, Proteomics & Bioinformatics
171 papers in training set
Top 7%
0.6%
30
Influenza and Other Respiratory Viruses
44 papers in training set
Top 0.6%
0.6%