Back

Benchmarking siRNA Prediction: The Role of Representation and Validation Strategies

Karmakar, A.; Merii, A.; Weir, A.; Kudla, G.; Basham, M.; Lubbock, A.

2026-05-14 bioinformatics
10.64898/2026.05.12.724560 bioRxiv
Show abstract

Small interfering RNAs (siRNAs) offer transformative potential for targeted therapeutics, yet the design of highly effective and non-toxic candidates is hindered by the risk of off-target effects and RNA instability. A critical flaw in in silico prediction models is pervasive data leakage in cross-validation protocols, which artificially inflates performance metrics and produces untrustworthy results. To address this, we developed a rigorous framework that eliminates data leakage through strict cross-validation, leverages z-curves (3D representations of RNA physico-chemical properties) for context-aware sequence encoding, and identifies key sequence regions critical for efficacy. Our model achieves an AUC of 0.845 on leakage-free validation, surpassing prior work at 380x faster computation speed, demonstrating that superior representation trumps model complexity. Crucially, we demonstrate how experimental variability and cross-validation choices directly impact model reliability, establishing the first benchmarked methods for robust siRNA efficacy prediction. This work provides a foundation for trustworthy sequence design and validation in RNA therapeutics.

Matching journals

The top 5 journals account for 50% of the predicted probability mass.

1
Nucleic Acids Research
1128 papers in training set
Top 1.0%
14.2%
2
Cell Systems
167 papers in training set
Top 0.7%
14.2%
3
Nature Communications
4913 papers in training set
Top 19%
10.0%
4
Nature Biotechnology
147 papers in training set
Top 0.9%
9.0%
5
PLOS Computational Biology
1633 papers in training set
Top 7%
4.8%
50% of probability mass above
6
Bioinformatics
1061 papers in training set
Top 6%
3.2%
7
Molecular Therapy Nucleic Acids
32 papers in training set
Top 0.2%
2.8%
8
PLOS ONE
4510 papers in training set
Top 46%
2.3%
9
Advanced Science
249 papers in training set
Top 9%
2.1%
10
Briefings in Bioinformatics
326 papers in training set
Top 3%
2.1%
11
Nature Machine Intelligence
61 papers in training set
Top 2%
1.9%
12
Journal of Chemical Information and Modeling
207 papers in training set
Top 2%
1.9%
13
Scientific Reports
3102 papers in training set
Top 59%
1.7%
14
NAR Genomics and Bioinformatics
214 papers in training set
Top 2%
1.7%
15
Cell Genomics
162 papers in training set
Top 3%
1.7%
16
Computational and Structural Biotechnology Journal
216 papers in training set
Top 5%
1.6%
17
Proceedings of the National Academy of Sciences
2130 papers in training set
Top 35%
1.5%
18
Bioinformatics Advances
184 papers in training set
Top 3%
1.3%
19
The American Journal of Human Genetics
206 papers in training set
Top 3%
1.2%
20
Communications Biology
886 papers in training set
Top 15%
1.2%
21
Genomics, Proteomics & Bioinformatics
171 papers in training set
Top 5%
1.1%
22
Nature Methods
336 papers in training set
Top 6%
0.9%
23
Nano Letters
63 papers in training set
Top 3%
0.7%
24
eLife
5422 papers in training set
Top 59%
0.7%
25
iScience
1063 papers in training set
Top 33%
0.7%
26
Molecular Therapy - Nucleic Acids
24 papers in training set
Top 0.4%
0.7%
27
Cell Reports Methods
141 papers in training set
Top 6%
0.7%
28
BMC Bioinformatics
383 papers in training set
Top 7%
0.7%
29
ACS Synthetic Biology
256 papers in training set
Top 4%
0.6%