Back

Predicting Antibody Self-Association with Sequence Structure Fusion Models: The Central Role of CSI-BLI in Early Developability Screening

Ahmed, S.; Devalle, F.; Leisen, L.; Pham, T.; Amofah, B.; Lee, A.; Hutchinson, M.; Chakiath, C.; DiChiara, J.; Farzandh, S.; Kreitz, M.; Hinton, A.; Mody, N.; Dippel, A.; Kaplan, G.; Pouryahya, M.

2026-04-15 bioinformatics
10.64898/2026.04.13.718222 bioRxiv
Show abstract

Antibody-based biologics are expanding rapidly, yet challenges in development from self-association, high viscosity, aggregation, and unfavorable clearance underscore the need for accurate in silico screening. Clone self-interaction biolayer interferometry (CSI-BLI) is a plate-based, low-material assay of weak, reversible self-association that serves as an early proxy for high-concentration viscosity and a complementary predictor of in vivo clearance. In a 246-mAb panel, CSI-BLI moderately correlates with viscosity; further, in hFcRn Tg32 mice (41 antibodies), CSI-BLI strongly associates with clearance. Here, we present an end-to-end framework that distinguishes high versus low self-interacting clones (CSI-BLI class) by coupling a fine-tuned protein language model (ESM-2) with residue-aligned 3D context from AlphaFold-predicted structures encoded as residue graphs. Disentangled multi-stream attention fuses sequence content, chain-aware positional information, and structural signals to capture spatially proximate interactions that are distant in sequence. Edit-distance-controlled splits across 1499 IgGs and 988 VHHs assess generalization. The structure-aware model achieves the highest hold-out performance (VHH-Fc F1 = 0.76; IgG F1 = 0.57), while a sequence-only disentangled variant outperforms a standard PLM baseline without structural inputs. Complementary biophysical feature-based models, built from AlphaFold structures and sequence/structure-derived physicochemical descriptors with cluster-aware selection, deliver robust, interpretable performance (VHH F1 = 0.72; IgG F1 = 0.57), with SHAP analyses highlighting charge/dipole, hydrophobicity, and aggregation-propensity drivers across CDRs and frameworks. This interaction-aware sequence-structure framework, supported by interpretable feature models, is extensible to other developability endpoints and broader protein classification tasks where joint modeling of language-derived representations and residue-level geometry is advantageous.

Matching journals

The top 7 journals account for 50% of the predicted probability mass.

1
Nature Communications
4913 papers in training set
Top 7%
18.2%
2
Cell Systems
167 papers in training set
Top 2%
8.2%
3
Nature Biotechnology
147 papers in training set
Top 1%
6.7%
4
Nature Machine Intelligence
61 papers in training set
Top 0.5%
6.2%
5
mAbs
28 papers in training set
Top 0.1%
4.8%
6
Advanced Science
249 papers in training set
Top 4%
4.8%
7
Bioinformatics
1061 papers in training set
Top 5%
3.9%
50% of probability mass above
8
Nucleic Acids Research
1128 papers in training set
Top 5%
3.9%
9
Briefings in Bioinformatics
326 papers in training set
Top 2%
3.5%
10
Nature Methods
336 papers in training set
Top 3%
3.5%
11
Cell Reports Methods
141 papers in training set
Top 1.0%
3.5%
12
Proceedings of the National Academy of Sciences
2130 papers in training set
Top 27%
2.3%
13
Patterns
70 papers in training set
Top 0.6%
2.0%
14
Nature Chemical Biology
104 papers in training set
Top 2%
1.7%
15
Communications Biology
886 papers in training set
Top 10%
1.6%
16
Frontiers in Immunology
586 papers in training set
Top 5%
1.3%
17
eLife
5422 papers in training set
Top 48%
1.3%
18
Cell Reports Medicine
140 papers in training set
Top 5%
1.3%
19
PLOS Computational Biology
1633 papers in training set
Top 20%
1.2%
20
Genome Medicine
154 papers in training set
Top 6%
1.2%
21
Computational and Structural Biotechnology Journal
216 papers in training set
Top 7%
0.9%
22
Protein Science
221 papers in training set
Top 1%
0.9%
23
ACS Synthetic Biology
256 papers in training set
Top 3%
0.8%
24
Structure
175 papers in training set
Top 3%
0.7%
25
Cell Genomics
162 papers in training set
Top 7%
0.7%
26
Science
429 papers in training set
Top 20%
0.7%
27
Nature Computational Science
50 papers in training set
Top 2%
0.7%
28
Science Advances
1098 papers in training set
Top 34%
0.6%
29
Scientific Reports
3102 papers in training set
Top 78%
0.6%
30
Molecular Therapy
71 papers in training set
Top 3%
0.6%