Back

Explicit representation of germline and non-germline residues improves antibody language modeling

Kim, J.; Blalock, N.; Kulkarni, A.; Nakamura, K.; Romero, P. A.

2026-05-11 immunology
10.64898/2026.05.06.723387 bioRxiv
Show abstract

Antibodies originate from germline templates and are diversified by somatic hypermutation, producing sequences in which conserved germline residues scaffold structure while rare non-germline (NGL) substitutions refine antigen binding. Current antibody language models (ALMs) treat all residues equivalently and inherit a germline bias that systematically down-weights functionally critical NGL mutations as statistical noise. We introduce PRISM, a germline-aware ALM that explicitly represents germline and nongermline residues as distinct token types over a factorized 53-token vocabulary. PRISM achieves state-of-the-art pseudo-perplexity in hypervariable CDRs and is uniquely positively correlated with experimental binding affinity across three deep mutational scanning landscapes on which all compared ALMs anti-correlate. The dual-vocabulary further enables property-specific controllable generation previously unattainable with entangled ALMs. NGL-directed sampling improves physics-based binding scores while GL-directed sampling preserves stability and solubility. These results establish disentangled germline/non-germline representation as a substantive advance in antibody language modeling.

Matching journals

The top 6 journals account for 50% of the predicted probability mass.

1
Cell Systems
167 papers in training set
Top 0.4%
18.4%
2
Science
429 papers in training set
Top 3%
10.0%
3
Nature Communications
4913 papers in training set
Top 27%
6.7%
4
Nature Computational Science
50 papers in training set
Top 0.1%
6.3%
5
Nature
575 papers in training set
Top 5%
6.2%
6
Proceedings of the National Academy of Sciences
2130 papers in training set
Top 14%
4.8%
50% of probability mass above
7
Nature Methods
336 papers in training set
Top 3%
3.5%
8
Cell Reports
1338 papers in training set
Top 15%
3.5%
9
Nature Biotechnology
147 papers in training set
Top 4%
2.1%
10
eLife
5422 papers in training set
Top 38%
1.9%
11
Advanced Science
249 papers in training set
Top 11%
1.7%
12
PLOS Computational Biology
1633 papers in training set
Top 16%
1.7%
13
Science Advances
1098 papers in training set
Top 18%
1.7%
14
Immunity
58 papers in training set
Top 3%
1.6%
15
Frontiers in Immunology
586 papers in training set
Top 4%
1.6%
16
ACS Nano
99 papers in training set
Top 2%
1.6%
17
Cell
370 papers in training set
Top 13%
1.5%
18
Nature Immunology
71 papers in training set
Top 1%
1.5%
19
Nature Medicine
117 papers in training set
Top 3%
1.5%
20
mAbs
28 papers in training set
Top 0.3%
0.9%
21
Communications Biology
886 papers in training set
Top 22%
0.8%
22
Nucleic Acids Research
1128 papers in training set
Top 17%
0.8%
23
Science Translational Medicine
111 papers in training set
Top 6%
0.8%
24
Bioinformatics
1061 papers in training set
Top 10%
0.7%
25
iScience
1063 papers in training set
Top 33%
0.7%
26
Nature Machine Intelligence
61 papers in training set
Top 4%
0.7%
27
PLOS ONE
4510 papers in training set
Top 69%
0.7%
28
Scientific Reports
3102 papers in training set
Top 78%
0.6%
29
Nature Cancer
35 papers in training set
Top 2%
0.6%