Back

AINN-P1: A Compact Sequence-Only Protein Language Model Achieves Competitive Fitness Prediction on ProteinGym

Wang, R.; Jin, K.; Pan, L.

2026-03-30 bioinformatics
10.64898/2026.03.26.714619 bioRxiv
Show abstract

Protein language models (PLMs) are increasingly central to protein engineering and drug discovery. Many high-performing systems, however, rely on large parameter counts, multiple sequence alignments (MSAs), explicit structural inputs, or computationally intensive attention mechanisms, limiting their accessibility and throughput. Here we present AINN-P1, a 167M-parameter protein language model trained exclusively on raw UniRef amino-acid sequences using an autoregressive next-token prediction objective. AINN-P1 employs a multiplicative LSTM (mLSTM) architecture--an attention-free, recurrent design that scales linearly with sequence length and avoids growing key-value caches during inference. We evaluate AINN-P1 on ProteinGym fitness prediction tasks spanning activity, binding, expression, and stability using a frozen-encoder protocol with lightweight few-shot regression heads. Under this protocol, AINN-P1 achieves an average Spearman{rho} of 0.441 across four task categories and a Spearman{rho} of 0.625 on stability--the highest among sequence-only models in our comparison set. Because our evaluation uses few-shot supervised regression rather than the zero-shot scoring employed by most ProteinGym leaderboard baselines, direct numerical comparison requires caution; we discuss this methodological distinction throughout. Beyond benchmark performance, AINN-P1 emphasizes practical deployability: its recurrent architecture avoids quadratic memory scaling, supports fixed-state inference on long sequences, and enables rapid adaptation through frozen embeddings rather than costly end-to-end fine-tuning. We discuss when sequence-only models are sufficient when structural information remains beneficial and how compact foundation models can serve as efficient front-end filters in drug discovery workflows.

Matching journals

The top 6 journals account for 50% of the predicted probability mass.

1
Cell Systems
167 papers in training set
Top 0.7%
12.8%
2
Bioinformatics
1061 papers in training set
Top 3%
10.2%
3
Nature Machine Intelligence
61 papers in training set
Top 0.2%
8.5%
4
Nature Methods
336 papers in training set
Top 1%
8.3%
5
Proceedings of the National Academy of Sciences
2130 papers in training set
Top 9%
7.2%
6
Nature Communications
4913 papers in training set
Top 32%
4.9%
50% of probability mass above
7
Nature Biotechnology
147 papers in training set
Top 2%
4.3%
8
Science
429 papers in training set
Top 7%
4.3%
9
Bioinformatics Advances
184 papers in training set
Top 1%
4.0%
10
Scientific Reports
3102 papers in training set
Top 53%
1.9%
11
PLOS Computational Biology
1633 papers in training set
Top 14%
1.9%
12
Nature
575 papers in training set
Top 10%
1.8%
13
PLOS ONE
4510 papers in training set
Top 55%
1.7%
14
Protein Science
221 papers in training set
Top 1.0%
1.5%
15
Briefings in Bioinformatics
326 papers in training set
Top 4%
1.5%
16
BMC Bioinformatics
383 papers in training set
Top 5%
1.2%
17
Nature Computational Science
50 papers in training set
Top 1%
1.1%
18
Computational and Structural Biotechnology Journal
216 papers in training set
Top 7%
1.0%
19
Patterns
70 papers in training set
Top 2%
0.9%
20
Journal of Molecular Biology
217 papers in training set
Top 3%
0.9%
21
Genome Research
409 papers in training set
Top 4%
0.9%
22
Journal of Chemical Information and Modeling
207 papers in training set
Top 3%
0.9%
23
Genome Biology
555 papers in training set
Top 7%
0.9%
24
Genome Medicine
154 papers in training set
Top 7%
0.8%
25
eLife
5422 papers in training set
Top 57%
0.8%
26
Journal of Cheminformatics
25 papers in training set
Top 0.5%
0.8%
27
Nucleic Acids Research
1128 papers in training set
Top 17%
0.8%
28
Biophysical Journal
545 papers in training set
Top 5%
0.8%
29
ACS Synthetic Biology
256 papers in training set
Top 3%
0.7%
30
The American Journal of Human Genetics
206 papers in training set
Top 4%
0.7%