Back

Assessing the Generalizability of Machine Learning and Physics Methods for DNA-Encoded Libraries

Dolorfino, M. D.; Santos Perez, D.; Fu, Y.; Lin, S.-H.; McCarty, S.; O'Meara, M. J.; Sztain, T.

2026-04-19 biophysics
10.64898/2026.04.18.719394 bioRxiv
Show abstract

DNA-encoded libraries (DELs) enable ultra-large screening of billions of molecules simultaneously. However, various limitations of DELs have prompted interest in training machine learning (ML) models on these large datasets to extrapolate predictions to non-DEL compounds. A recent NeurIPS competition revealed that even top performing ML models trained on DEL data failed at generalizing to out-of-distribution (OOD) chemical space. We investigated whether integrating structural modeling could bridge this generalization gap. We systematically assessed state-of-the-art ML, docking, and co-folding methods with three biologically diverse protein targets screened against libraries containing multiple DEL synthesis formats, and show that while ML excels in-distribution, the optimal approach for OOD hit discrimination performance is both target and ligand dependent. We conclude that, regardless of performance reported in aggregated benchmarks, rigorous, system-dependent pilot testing is critical for reliable virtual screening predictions. We provide these workflows and analysis tools in an open-source package: DEL-iver. O_FIG O_LINKSMALLFIG WIDTH=200 HEIGHT=118 SRC="FIGDIR/small/719394v1_ufig1.gif" ALT="Figure 1"> View larger version (25K): org.highwire.dtl.DTLVardef@d9d299org.highwire.dtl.DTLVardef@913f59org.highwire.dtl.DTLVardef@1d5f69borg.highwire.dtl.DTLVardef@316102_HPS_FORMAT_FIGEXP M_FIG C_FIG

Matching journals

The top 3 journals account for 50% of the predicted probability mass.

1
Journal of Chemical Information and Modeling
207 papers in training set
Top 0.1%
33.1%
2
Chemical Science
71 papers in training set
Top 0.1%
12.6%
3
Journal of the American Chemical Society
199 papers in training set
Top 1%
6.3%
50% of probability mass above
4
eLife
5422 papers in training set
Top 20%
4.3%
5
Nature Methods
336 papers in training set
Top 2%
4.2%
6
Protein Science
221 papers in training set
Top 0.5%
2.9%
7
PLOS Computational Biology
1633 papers in training set
Top 12%
2.5%
8
Proceedings of the National Academy of Sciences
2130 papers in training set
Top 26%
2.5%
9
Journal of Chemical Theory and Computation
126 papers in training set
Top 0.4%
1.9%
10
Structure
175 papers in training set
Top 2%
1.7%
11
PLOS ONE
4510 papers in training set
Top 56%
1.5%
12
Frontiers in Molecular Biosciences
100 papers in training set
Top 2%
1.3%
13
IUCrJ
29 papers in training set
Top 0.2%
1.2%
14
Nucleic Acids Research
1128 papers in training set
Top 15%
1.0%
15
iScience
1063 papers in training set
Top 24%
1.0%
16
Bioinformatics Advances
184 papers in training set
Top 4%
1.0%
17
Cell Systems
167 papers in training set
Top 10%
1.0%
18
ACS Central Science
66 papers in training set
Top 2%
0.9%
19
Nature
575 papers in training set
Top 15%
0.8%
20
Journal of Medicinal Chemistry
68 papers in training set
Top 1%
0.8%
21
Nature Communications
4913 papers in training set
Top 61%
0.8%
22
ACS Chemical Biology
150 papers in training set
Top 2%
0.7%
23
Patterns
70 papers in training set
Top 2%
0.7%
24
Biophysical Journal
545 papers in training set
Top 5%
0.7%
25
Angewandte Chemie International Edition
81 papers in training set
Top 4%
0.7%
26
Computational and Structural Biotechnology Journal
216 papers in training set
Top 10%
0.7%
27
Nature Biotechnology
147 papers in training set
Top 8%
0.7%
28
The Journal of Physical Chemistry Letters
58 papers in training set
Top 2%
0.6%
29
Molecules
37 papers in training set
Top 2%
0.6%
30
Cell Chemical Biology
81 papers in training set
Top 4%
0.6%