Back

Small molecule bioactivity benchmarks are often well-predicted by counting cells

Seal, S.; Dee, W.; Shah, A.; Zhang, A.; Titterton, K.; Cabrera, A. A.; Boiko, D.; Beatson, A.; Puigvert, J. C.; Singh, S.; Spjuth, O.; Bender, A.; Carpenter, A. E.

2025-04-30 bioinformatics
10.1101/2025.04.27.650853 bioRxiv
Show abstract

Phenotypic profiling methods, such as Cell Painting and gene expression, have been widely used to predict compound bioactivity, often showing improvement over predictive models based on chemical structures alone. We discovered that a large subset of assays in widely-used benchmark datasets either directly relate to cell health and cytotoxicity or are assays intending to capture a more specific phenotype but whose active compounds impact cell count, while inactives do not. As a result, counting cells can achieve similar predictive performance as Cell Painting or gene expression data. Filtering benchmarks to include only assays relating to protein targets reveals that Cell Painting can capture information that cannot be predicted by mere cell counting. We re-evaluated three benchmark datasets used with Cell Painting data and observed that, in many cases, cell count models produced an AUC comparable to models using the full Cell Painting profiles. However, in protein-target-specific benchmarks across 17 distinct protein targets, Cell Painting features demonstrated unique predictive power, outperforming mean balanced accuracy from cell count models with a relative improvement of 19.6%. We propose five practical recommendations for benchmarking machine learning models for predicting bioactivity, including using cell count as a baseline feature. Although multi-class classification applications (such as matching samples based on their morphological profile) are less likely to be predictable by cell count than bioactivity benchmarks, these recommendations are broadly applicable to machine learning for drug discovery.

Matching journals

The top 4 journals account for 50% of the predicted probability mass.

1
Journal of Chemical Information and Modeling
207 papers in training set
Top 0.1%
28.5%
2
Journal of Cheminformatics
25 papers in training set
Top 0.1%
12.7%
3
Scientific Reports
3102 papers in training set
Top 22%
5.0%
4
PLOS Computational Biology
1633 papers in training set
Top 8%
4.1%
50% of probability mass above
5
Artificial Intelligence in the Life Sciences
11 papers in training set
Top 0.1%
4.1%
6
PLOS ONE
4510 papers in training set
Top 38%
3.7%
7
Proceedings of the National Academy of Sciences
2130 papers in training set
Top 19%
3.7%
8
Briefings in Bioinformatics
326 papers in training set
Top 3%
2.1%
9
Bioinformatics
1061 papers in training set
Top 7%
1.9%
10
eLife
5422 papers in training set
Top 37%
1.9%
11
Nature Communications
4913 papers in training set
Top 48%
1.9%
12
BMC Bioinformatics
383 papers in training set
Top 4%
1.8%
13
SLAS Discovery
25 papers in training set
Top 0.1%
1.7%
14
Communications Chemistry
39 papers in training set
Top 0.4%
1.4%
15
Cell Systems
167 papers in training set
Top 8%
1.4%
16
BMC Genomics
328 papers in training set
Top 3%
1.3%
17
Frontiers in Molecular Biosciences
100 papers in training set
Top 4%
0.9%
18
Computational and Structural Biotechnology Journal
216 papers in training set
Top 7%
0.9%
19
International Journal of Molecular Sciences
453 papers in training set
Top 12%
0.9%
20
Journal of Medicinal Chemistry
68 papers in training set
Top 1%
0.8%
21
Molecules
37 papers in training set
Top 2%
0.8%
22
Frontiers in Pharmacology
100 papers in training set
Top 4%
0.8%
23
Patterns
70 papers in training set
Top 2%
0.8%
24
Bioinformatics Advances
184 papers in training set
Top 5%
0.7%
25
NAR Genomics and Bioinformatics
214 papers in training set
Top 4%
0.7%
26
Nature Machine Intelligence
61 papers in training set
Top 4%
0.7%
27
Chemical Science
71 papers in training set
Top 2%
0.7%
28
Communications Biology
886 papers in training set
Top 28%
0.7%
29
Clinical Pharmacology & Therapeutics
25 papers in training set
Top 1.0%
0.5%
30
Frontiers in Bioinformatics
45 papers in training set
Top 1%
0.5%