Back

Improving Variant Effect Prediction by Steering Sparse Mechanistic Features in Protein Language Models

Wang, M.; Yuan, M.; Vasilakos, A. V.; He, Y.; Ren, Z.

2026-05-15 bioinformatics
10.64898/2026.05.12.724472 bioRxiv
Show abstract

Protein language models (PLMs) like the ESM series encapsulate immense evolutionary knowledge within their high-dimensional continuous embeddings. However, these latent representations are densely entangled, obscuring the fine-grained biophysical constraints necessary for precise functional resolution. To unlock the full expressive power of these embeddings, we propose PLM-SAE, a mechanistic framework that employs Sparse Autoencoders (SAEs) to disentangle PLM representations into discrete, biologically interpretable activations. By isolating and directly intervening on critical functional features, we fundamentally enhance the structural and mutational awareness of the underlying embeddings. We rigorously validate this embedding enhancement on variant effect prediction (VEP). In the unsupervised zero-shot setting, our sparse modulation elevates the state-of-the-art ESM-3 model, yielding performance improvements across 114 deep mutational scanning datasets and delivering an 80.8% relative improvement on challenging targets like the human E3 ubiquitin ligase HECD1. Furthermore, our target-specific differentiable gating mechanism achieves consistent performance gains in over 80% of evaluated datasets with an average Spearman{rho} increase of +0.138. Finally, extending this approach to a cross-fitness multitask architecture establishes new state-of-the-art results on 17 VenusMutHub datasets, highlighted by a 169.0% performance surge in small-molecule binding predictions. Our work demonstrates that refining the highly entangled latent manifold via sparse modulation provides a robust and generalizable foundation for enhancing downstream PLM capabilities.

Matching journals

The top 4 journals account for 50% of the predicted probability mass.

1
Cell Systems
167 papers in training set
Top 0.1%
26.5%
2
Nature Communications
4913 papers in training set
Top 17%
10.4%
3
Nature Biotechnology
147 papers in training set
Top 0.9%
8.6%
4
Science
429 papers in training set
Top 5%
6.5%
50% of probability mass above
5
Proceedings of the National Academy of Sciences
2130 papers in training set
Top 13%
5.0%
6
Nature Machine Intelligence
61 papers in training set
Top 0.5%
5.0%
7
Nature Methods
336 papers in training set
Top 2%
4.4%
8
Advanced Science
249 papers in training set
Top 5%
3.7%
9
Nature
575 papers in training set
Top 8%
3.1%
10
Nucleic Acids Research
1128 papers in training set
Top 9%
1.9%
11
Cell Genomics
162 papers in training set
Top 4%
1.5%
12
Bioinformatics
1061 papers in training set
Top 8%
1.4%
13
The American Journal of Human Genetics
206 papers in training set
Top 3%
1.4%
14
Communications Biology
886 papers in training set
Top 14%
1.3%
15
Nature Computational Science
50 papers in training set
Top 1%
1.0%
16
Briefings in Bioinformatics
326 papers in training set
Top 5%
1.0%
17
Nature Genetics
240 papers in training set
Top 6%
0.9%
18
Nature Cell Biology
99 papers in training set
Top 4%
0.9%
19
Patterns
70 papers in training set
Top 2%
0.8%
20
Science Advances
1098 papers in training set
Top 29%
0.8%
21
Nature Chemical Biology
104 papers in training set
Top 4%
0.7%
22
Biology
43 papers in training set
Top 3%
0.7%
23
Genome Biology
555 papers in training set
Top 8%
0.7%
24
PLOS Computational Biology
1633 papers in training set
Top 27%
0.7%
25
Physical Review Research
46 papers in training set
Top 1.0%
0.7%
26
Cell Reports
1338 papers in training set
Top 36%
0.5%
27
Nano Letters
63 papers in training set
Top 3%
0.5%