Back

Benchmarking Generative Large Language Models for de novo Antibody Design and Agentic Evaluation

Hossain, D.; Abir, F. A.; Zhang, S.; Chen, J. Y.

2026-04-21 bioinformatics
10.64898/2026.04.18.716776 bioRxiv
Show abstract

Despite major advances in computational antibody engineering, no systematic comparison of modern open-source LLM backbone families for antibody sequence generation exists, nor is it known whether architectural differences matter at compact model scales. In this study, five compact transformer variants inspired by prominent open-source LLM families (Llama-4, Gemma-3, DeepSeek-V3, Mistral 7B, and NVIDIA Nemotron-3) were customized and trained from scratch for de novo VH single-domain antibody (sdAb) design. All five models were pretrained from scratch on 15 million sequences from the Observed Antibody Space (OAS) database. Pretraining yielded uniformly high generative fidelity across architectures: sequence diversity 0.507-0.516 (CV=0.8%), uniqueness approaching 1.0, and novelty 0.925-0.977 (CV=2.2%). The models were subsequently fine-tuned on disease-stratified repertoires spanning SARS-CoV-2 (n=4,688), HIV (n=430), HER2 (n=22,778), and Ebola virus (n=2,868). Structural assessment of top-ranked candidates of those case studies via AlphaFold-2, Boltz-2, RoseTTAFold-2, and ESMFold produced mean pLDDT scores of 92.88{+/-}1.54 to 93.77{+/-}2.16, with no statistically significant inter-model differences (Kruskal-Wallis H=2.06, p>0.05; N=100), indicating no statistically detectable difference was observed across architectures at this compressed scale in a single-seed experiment, suggesting that generative capacity at this parameter regime is primarily determined by training data and model scale rather than family-specific design elements at this scale. Computational docking yielded predicted binding free energies of -36.34 to -65.60 kcal/mol; independent biological rigor validation through IMGT-defined CDR-H3 extraction, BLASTp novelty assessment, and NetMHCIIpan 4.3 MHC-II immunogenicity profiling collectively confirmed antigen-binding loop novelty (CDR-H3 identity 0-29% to closest database hits), germline-consistent humanness (77-90% VH germline content), and immunogenically silent antigen-binding surfaces with no strong MHC-II binders detected across CDR regions in any candidate. We further introduce a proof-of-concept agentic evaluation pipeline leveraging the Model Context Protocol (MCP) with Claude Sonnet 4.6, enabling automated structural profiling and candidate prioritization across disease targets.

Matching journals

The top 7 journals account for 50% of the predicted probability mass.

1
mAbs
28 papers in training set
Top 0.1%
14.6%
2
Nature Communications
4913 papers in training set
Top 18%
10.0%
3
Cell Systems
167 papers in training set
Top 2%
7.1%
4
Frontiers in Immunology
586 papers in training set
Top 1%
6.3%
5
Advanced Science
249 papers in training set
Top 3%
6.2%
6
Computational and Structural Biotechnology Journal
216 papers in training set
Top 1%
4.3%
7
Nature Machine Intelligence
61 papers in training set
Top 1.0%
3.6%
50% of probability mass above
8
Briefings in Bioinformatics
326 papers in training set
Top 2%
3.6%
9
eLife
5422 papers in training set
Top 31%
2.7%
10
Journal of Chemical Information and Modeling
207 papers in training set
Top 2%
2.1%
11
Cell Genomics
162 papers in training set
Top 3%
1.9%
12
Bioinformatics
1061 papers in training set
Top 7%
1.9%
13
Nature Biotechnology
147 papers in training set
Top 4%
1.8%
14
PLOS Computational Biology
1633 papers in training set
Top 16%
1.7%
15
Nucleic Acids Research
1128 papers in training set
Top 11%
1.7%
16
Patterns
70 papers in training set
Top 0.9%
1.7%
17
Communications Biology
886 papers in training set
Top 9%
1.7%
18
Nature Methods
336 papers in training set
Top 4%
1.7%
19
Cell Reports Medicine
140 papers in training set
Top 4%
1.5%
20
Structure
175 papers in training set
Top 2%
1.3%
21
Antibody Therapeutics
16 papers in training set
Top 0.3%
1.3%
22
Genome Medicine
154 papers in training set
Top 6%
1.3%
23
Bioinformatics Advances
184 papers in training set
Top 4%
1.2%
24
Scientific Reports
3102 papers in training set
Top 69%
0.9%
25
Protein Science
221 papers in training set
Top 1%
0.9%
26
Proceedings of the National Academy of Sciences
2130 papers in training set
Top 43%
0.8%
27
Science
429 papers in training set
Top 20%
0.7%
28
Science Advances
1098 papers in training set
Top 30%
0.7%
29
Chemical Science
71 papers in training set
Top 2%
0.7%
30
PLOS ONE
4510 papers in training set
Top 70%
0.7%