Machine learning-based framework for predicting human infection potential of coronavirus associated with tri-amino acid motifs, KIQ and LEP in spike protein
Chanraeng, N.; Guo, J.; Srisongkram, T.; Hinwan, Y.; Fransson, P.; Sjödin, H.; Matsuura, Y.; Overgaard, H. J.; Panthong, W.; Ekalaksananan, T.; Pientong, C.; Phanthanawiboon, S.
Show abstract
Assessing the human infection potential of emerging coronaviruses remains a critical challenge for global health preparedness. In this study, we developed a machine learning-based framework to predict the human infection potential of coronaviruses and to identify associated sequence motifs using spike (S) protein sequences. A total of 3,904 complete S protein sequences were collected, annotated as human or non-human infection and encoded using trimer-based k-mer features. Model benchmarking was conducted across 27 machine learning algorithms, followed by hyperparameter optimization of the selected model. Robustness and generalizability were evaluated using k-fold cross-validation and independent external validation. Feature interpretability was further assessed using SHAP analysis to identify sequence determinants associated with infection potential. The Random Forest classifier achieved the best performance, with accuracy, sensitivity, and specificity of 97.8%, 99%, and 97.4%, respectively, and demonstrated stable predictive performance across validation datasets. Notably, the KIQ and LEP motifs were strongly associated with human infection coronaviruses and mapped to the HR1 and N-terminal domain regions of the S protein. Overall, this framework provides a practical approach for risk assessment and surveillance of emerging coronaviruses. Author summaryEmerging coronaviruses continue to threaten global public health, but rapidly identifying viruses with the potential to infect humans remains challenging. Traditional experimental approaches are time-consuming and resource-intensive, limiting their use for large-scale surveillance. In this study, we developed a machine learning based workflow to assess the human infection potential of coronaviruses using spike protein sequences. By analyzing sequence patterns across a diverse set of coronaviruses, our framework enables rapid screening of coronaviruses from multiple host species. Unlike previous studies focused on limited coronavirus genera, our approach integrates all four genera and systematically evaluates multiple learning strategies. Importantly, our analysis identifies conserved sequence motifs linked to human infection potential, bridging predictive performance with biological interpretability. Our findings demonstrate computational approaches support early warning systems for identifying high risk coronaviruses, contributing to prioritize viruses for experimental validation, guide surveillance efforts, and strengthen global pandemic preparedness under a One Health perspective.
Matching journals
The top 7 journals account for 50% of the predicted probability mass.