Back

TwinSAR: An Adaptive Kernel-based Algorithm with logit-transformed Z-score Filtering for Chemical Twin Detection in Large-scale Virtual Screening

Haris Kulosmanovic, H.; Uguz, C.; DURDAGI, S.

2026-05-15 bioinformatics
10.64898/2026.05.12.724687 bioRxiv
Show abstract

Molecular similarity searching is a workhorse of cheminformatics, but the dominant Tanimoto/topological-fingerprint paradigm has well-known blind spots. It is highly sensitive to molecular size, suffers from steep activity cliffs, and frequently fails to retrieve scaffold-hopping bioisosteres. A complementary descriptor that has received comparatively little attention is global elemental composition. Despite the conceptual simplicity of comparing molecules by their elemental ratios, no widely deployed method exists for the statistically rigorous identification of "chemical twins" defined by stoichiometric proximity. We address this gap with TwinSAR (Stoichiometric Analysis and Retrieval), an adaptive kernel-based algorithm that combines three methodological innovations: (i) binary fingerprint blocking that partitions molecule by element-presence patterns and bounds the cost of all-pairs comparison from O(NM) to O({sum}nimi) enabling million/billion-scale searches; (ii) a per-block adaptive radial basis function (RBF) kernel whose precision parameter is calibrated independently for each fingerprint block via the median heuristic, providing fair similarity comparison across chemical sub-spaces of vastly different density; and (iii) a logit-transformed Z-score filter that maps bounded RBF scores onto an unbounded scale, allowing high-similarity pairs to be prioritized relative to the empirical score distribution of their own fingerprint block. TwinSAR is offered in two operating modes: (i) a deterministic BULK mode for exact reproducibility; and (ii) a stochastic FAST mode that achieved a 3.29x wall-clock speed-up in the present benchmark while preserving the similar unique-query and unique-target coverage. Statistical validation showed that detected twin pairs are 12.7x more similar in absolute ratio space than block-matched random pairs (p < 0.001), while a column-permutation negative control returned a median of zero spurious twins across three independent permutations. A controlled benchmark further established that an 8-element representation (single-element heavy-atom ratios) is sensitivity-equivalent to a comprehensive 254-element representation while running 3.55x faster. As a case study, TwinSAR was deployed in an end-to-end virtual screening pipeline against the BCL-2 target protein, where it reduced a 327,071-compound commercial library to a 390 focused candidate panel. The chemical interpretability of the retrieved twins is illustrated by their structural diversity around conserved heavy-atom skeletons. TwinSAR therefore provides a fast, conformation-free, and statistically principled prefilter that is fully orthogonal to topological fingerprints.

Matching journals

The top 2 journals account for 50% of the predicted probability mass.

1
Journal of Chemical Information and Modeling
207 papers in training set
Top 0.1%
46.4%
2
Advanced Science
249 papers in training set
Top 3%
5.1%
50% of probability mass above
3
Nature Communications
4913 papers in training set
Top 35%
4.5%
4
Journal of Cheminformatics
25 papers in training set
Top 0.1%
4.5%
5
Chemical Science
71 papers in training set
Top 0.3%
3.7%
6
Briefings in Bioinformatics
326 papers in training set
Top 2%
3.7%
7
Computational and Structural Biotechnology Journal
216 papers in training set
Top 2%
2.9%
8
Communications Chemistry
39 papers in training set
Top 0.1%
2.7%
9
Bioinformatics
1061 papers in training set
Top 6%
2.7%
10
Journal of Medicinal Chemistry
68 papers in training set
Top 0.6%
1.8%
11
PLOS ONE
4510 papers in training set
Top 57%
1.4%
12
International Journal of Molecular Sciences
453 papers in training set
Top 9%
1.4%
13
Molecules
37 papers in training set
Top 1%
1.0%
14
Nature Methods
336 papers in training set
Top 6%
0.9%
15
Scientific Reports
3102 papers in training set
Top 72%
0.8%
16
Nucleic Acids Research
1128 papers in training set
Top 16%
0.8%
17
Artificial Intelligence in the Life Sciences
11 papers in training set
Top 0.2%
0.8%
18
PLOS Computational Biology
1633 papers in training set
Top 24%
0.8%
19
Patterns
70 papers in training set
Top 2%
0.8%
20
Proceedings of the National Academy of Sciences
2130 papers in training set
Top 45%
0.8%
21
Nature Biotechnology
147 papers in training set
Top 8%
0.8%
22
Cell Systems
167 papers in training set
Top 15%
0.5%
23
Bioinformatics Advances
184 papers in training set
Top 5%
0.5%
24
Communications Biology
886 papers in training set
Top 31%
0.5%