Back

Do Larger Models Really Win in Drug Discovery?A Benchmark Assessment of Model Scaling in AI-Driven Molecular Property and Activity Prediction

Guo, J.

2026-05-04 bioinformatics
10.64898/2026.04.29.721568 bioRxiv
Show abstract

The rapid growth of molecular foundation models and large language models has encouraged a scale centred view of AI in drug discovery, in which larger pretrained models are expected to supersede compact cheminformatics models and graph neural networks (GNNs) trained for individual tasks. We test this assumption across 26 endpoints for molecular properties, toxicity, safety liabilities and biological activity, grouped into ADME, toxicity and bioactivity classes. The benchmark contains 78 endpoint and split entries spanning random, Murcko scaffold and structure separated 5-fold CV. Ordered from easiest to hardest, these splits approximate retrospective evaluation on a closed library, scaffold expansion in hit to lead, and library expansion on novel chemotypes. Each entry includes ML, GNN, pretrained molecular sequence and LLM based SAR families. Across 156 fold mean comparisons, classical ML such as RF(ECFP4) and ExtraTrees(RDKit) win 116, GNNs such as GIN and Ligandformer win 25, pretrained sequence models such as MoLFormer and ChemBERTa2 win 12, and LLM based SAR baselines win three. ML dominates random split interpolation but loses part of this advantage under harder splits; GNN and sequence models also decline but gain relative ground, whereas LLM based SAR is weaker in absolute terms yet less sensitive to the split axis. Paired bootstrap analyses support family level trends more strongly than individual model rankings. SAR knowledge derived from training folds improves many GPT5.5-SAR and Opus4.7-SAR metrics but does not make rule based reasoning a universal substitute for supervised predictors. Compact specialized models remain highly effective for molecular property and activity prediction. Larger models add value for SAR interpretation and reasoning in low data settings, but predictive performance depends on the fit among model, task and validation scenario, not on scale alone.

Matching journals

The top 9 journals account for 50% of the predicted probability mass.

1
Journal of Chemical Information and Modeling
207 papers in training set
Top 0.3%
15.1%
2
Nature Communications
4913 papers in training set
Top 26%
7.0%
3
Cell Systems
167 papers in training set
Top 2%
6.5%
4
Proceedings of the National Academy of Sciences
2130 papers in training set
Top 13%
5.0%
5
Nature
575 papers in training set
Top 5%
5.0%
6
Science
429 papers in training set
Top 8%
4.1%
7
PLOS ONE
4510 papers in training set
Top 37%
3.8%
8
Scientific Reports
3102 papers in training set
Top 34%
3.7%
9
eLife
5422 papers in training set
Top 29%
3.1%
50% of probability mass above
10
PLOS Computational Biology
1633 papers in training set
Top 12%
2.8%
11
Nature Methods
336 papers in training set
Top 3%
2.8%
12
Journal of Cheminformatics
25 papers in training set
Top 0.2%
2.1%
13
Nature Biotechnology
147 papers in training set
Top 4%
2.1%
14
Bioinformatics
1061 papers in training set
Top 6%
2.1%
15
Nature Genetics
240 papers in training set
Top 3%
2.1%
16
Chemical Science
71 papers in training set
Top 0.6%
2.1%
17
Artificial Intelligence in the Life Sciences
11 papers in training set
Top 0.1%
1.9%
18
Nature Machine Intelligence
61 papers in training set
Top 2%
1.9%
19
Briefings in Bioinformatics
326 papers in training set
Top 4%
1.5%
20
BMC Bioinformatics
383 papers in training set
Top 5%
1.3%
21
Bioinformatics Advances
184 papers in training set
Top 4%
1.0%
22
Cell Genomics
162 papers in training set
Top 5%
0.9%
23
Patterns
70 papers in training set
Top 2%
0.8%
24
Nature Medicine
117 papers in training set
Top 5%
0.8%
25
Genome Medicine
154 papers in training set
Top 8%
0.8%
26
Computational and Structural Biotechnology Journal
216 papers in training set
Top 9%
0.8%
27
Structure
175 papers in training set
Top 3%
0.7%
28
Communications Biology
886 papers in training set
Top 25%
0.7%
29
Nature Protocols
30 papers in training set
Top 0.3%
0.7%
30
The American Journal of Human Genetics
206 papers in training set
Top 4%
0.7%