Back

CDR-aware masked language models for pairedantibodies enable state-of-the-art bindingprediction

Talaei, M.; Walker, K. C.; Hao, B.; Jolley, E.; Jin, Y.; Kozakov, D.; Misasi, J.; Vajda, S.; Paschalidis, I. C.; Joseph-McCarthy, D.

2025-10-31 bioinformatics
10.1101/2025.10.31.685149 bioRxiv
Show abstract

Antibodies are a leading class of biologics, yet their architecture with conserved framework regions and hypervariable complementarity-determining regions (CDRs) poses unique challenges for computational modeling. We present a region-aware pretraining strategy for paired heavy (VH) and light (VL) sequences in variable domains using ESM2-3B and ESM C (600M) protein language models. We compare three masking strategies: whole-chain, CDR-focused, and a hybrid approach. Through evaluation on binding affinity datasets spanning single-mutant panels and combinatorial mutants, we demonstrate that CDR-focused training produces superior embeddings for functional prediction. Notably, training only on VH-VL pairs proves sufficient, eliminating the need for massive unpaired pretraining that provides no measurable downstream benefit. Our compact 600M ESM C model achieves state-of-the-art performance, matching or exceeding larger antibody-specific baselines. These findings establish a principled framework for antibody language models: prioritize paired sequences with CDR-aware supervision over scale and complex training curricula to achieve both computational efficiency and predictive accuracy.

Matching journals

The top 7 journals account for 50% of the predicted probability mass.

1
mAbs
28 papers in training set
Top 0.1%
18.7%
2
Cell Systems
167 papers in training set
Top 1.0%
10.2%
3
Nature Machine Intelligence
61 papers in training set
Top 0.3%
6.9%
4
Bioinformatics
1061 papers in training set
Top 4%
4.9%
5
Nature Communications
4913 papers in training set
Top 32%
4.9%
6
Frontiers in Immunology
586 papers in training set
Top 2%
4.0%
7
PLOS Computational Biology
1633 papers in training set
Top 10%
3.6%
50% of probability mass above
8
Advanced Science
249 papers in training set
Top 5%
3.6%
9
Computational and Structural Biotechnology Journal
216 papers in training set
Top 2%
3.3%
10
Communications Biology
886 papers in training set
Top 4%
2.6%
11
Nature Methods
336 papers in training set
Top 4%
2.1%
12
PLOS ONE
4510 papers in training set
Top 53%
1.7%
13
Journal of Chemical Information and Modeling
207 papers in training set
Top 2%
1.7%
14
Scientific Reports
3102 papers in training set
Top 58%
1.7%
15
Patterns
70 papers in training set
Top 0.8%
1.7%
16
Briefings in Bioinformatics
326 papers in training set
Top 4%
1.7%
17
Proceedings of the National Academy of Sciences
2130 papers in training set
Top 33%
1.7%
18
Nucleic Acids Research
1128 papers in training set
Top 11%
1.7%
19
Genome Medicine
154 papers in training set
Top 5%
1.3%
20
Antibody Therapeutics
16 papers in training set
Top 0.3%
1.2%
21
Science Advances
1098 papers in training set
Top 25%
1.0%
22
eLife
5422 papers in training set
Top 51%
1.0%
23
Structure
175 papers in training set
Top 3%
1.0%
24
Cell Genomics
162 papers in training set
Top 5%
0.9%
25
ACS Synthetic Biology
256 papers in training set
Top 3%
0.8%
26
Protein Science
221 papers in training set
Top 2%
0.8%
27
iScience
1063 papers in training set
Top 32%
0.8%
28
Nature Biotechnology
147 papers in training set
Top 7%
0.8%
29
Science
429 papers in training set
Top 19%
0.8%
30
Journal of Cheminformatics
25 papers in training set
Top 0.6%
0.7%