Back

VarDCL: A Multimodal PLM-Enhanced Framework for Missense Variant Effect Prediction via Self-distilled Contrastive Learning

Zhang, H.; Zheng, G.; Xu, Z.; Zhao, H.; Cai, S.; Huang, Y.; Zhou, Z.; Wei, Y.

2026-03-17 bioinformatics
10.64898/2026.03.13.711612 bioRxiv
Show abstract

Missense variants are a common type of genetic mutation that can alter the structure and function of proteins, thereby affecting the normal physiological processes of organisms. Accurately distinguishing damaging missense variants from benign ones is of great significance for clinical genetic diagnosis, treatment strategy development, and protein engineering. Here, we propose the VarDCL method, which ingeniously integrates multimodal protein language model embeddings and self-distilled contrastive learning to identify subtle sequence and structural differences before and after protein mutations, thereby accurately predicting pathogenic missense variants. First, leveraging sequence and structural information before and after mutations, VarDCL generates sequence-structural multimodal features via different language models. It incorporates both global and local perspectives of feature embeddings to provide the model with dynamic, multimodal, and multi-view input data. Additionally, a Self-distilled Contrastive Learning (SDCL) module was proposed to enable more effective information integration and feature learning, enhancing the models ability to detect sequence and structural changes induced by mutations. Within this module, the multi-level contrastive learning framework excels at capturing information differences before and after mutations within the same modality; meanwhile, the feature self-distillation mechanism effectively utilizes high-level fused features to guide the learning of low-level differential features, facilitating information interaction across different modalities. The VarDCL framework not only ensures the models capacity to learn dynamic changes pre- and post-mutation but also significantly improves cross-modal information interaction between sequence and structure, thereby remarkably boosting the models performance in distinguishing pathogenic mutations from benign ones. To validate the effectiveness of VarDCL, extensive experiments were conducted. The ablation study demonstrates that all key components of VarDCL contribute significantly. On an independent test set containing 18,731 clinical variants, VarDCL achieved an AUC of 0.917, an AUPR of 0.876, an MCC of 0.690, and an F1-score of 0.789, outperforming 21 state-of-the-art existing methods. Benchmark analysis shows that VarDCL can be utilized as an accurate and potent tool for predicting missense variant effects.

Matching journals

The top 6 journals account for 50% of the predicted probability mass.

1
Briefings in Bioinformatics
326 papers in training set
Top 0.1%
14.8%
2
Advanced Science
249 papers in training set
Top 0.6%
14.5%
3
Nature Machine Intelligence
61 papers in training set
Top 0.1%
10.5%
4
IEEE Journal of Biomedical and Health Informatics
34 papers in training set
Top 0.3%
4.9%
5
Genomics, Proteomics & Bioinformatics
171 papers in training set
Top 2%
4.0%
6
Computational and Structural Biotechnology Journal
216 papers in training set
Top 2%
3.6%
50% of probability mass above
7
Bioinformatics
1061 papers in training set
Top 5%
3.6%
8
IEEE Transactions on Computational Biology and Bioinformatics
17 papers in training set
Top 0.1%
2.9%
9
Nucleic Acids Research
1128 papers in training set
Top 7%
2.8%
10
Nature Communications
4913 papers in training set
Top 44%
2.6%
11
Journal of Chemical Information and Modeling
207 papers in training set
Top 2%
2.6%
12
National Science Review
22 papers in training set
Top 0.5%
2.6%
13
Science Bulletin
22 papers in training set
Top 0.3%
1.7%
14
PLOS Computational Biology
1633 papers in training set
Top 16%
1.7%
15
Computers in Biology and Medicine
120 papers in training set
Top 3%
1.3%
16
Quantitative Biology
11 papers in training set
Top 0.5%
1.0%
17
Communications Biology
886 papers in training set
Top 21%
0.8%
18
Scientific Reports
3102 papers in training set
Top 73%
0.8%
19
PLOS ONE
4510 papers in training set
Top 68%
0.8%
20
Frontiers in Molecular Biosciences
100 papers in training set
Top 5%
0.8%
21
NAR Genomics and Bioinformatics
214 papers in training set
Top 4%
0.7%
22
Journal of Genetics and Genomics
36 papers in training set
Top 2%
0.7%
23
Genome Medicine
154 papers in training set
Top 8%
0.7%
24
International Journal of Molecular Sciences
453 papers in training set
Top 16%
0.7%
25
Communications Chemistry
39 papers in training set
Top 1%
0.7%
26
Frontiers in Genetics
197 papers in training set
Top 10%
0.7%
27
Cell Systems
167 papers in training set
Top 13%
0.6%
28
Journal of Molecular Biology
217 papers in training set
Top 4%
0.6%
29
Medical Image Analysis
33 papers in training set
Top 1%
0.5%
30
Expert Systems with Applications
11 papers in training set
Top 0.7%
0.5%