Benchmarking siRNA Prediction: The Role of Representation and Validation Strategies
Karmakar, A.; Merii, A.; Weir, A.; Kudla, G.; Basham, M.; Lubbock, A.
Show abstract
Small interfering RNAs (siRNAs) offer transformative potential for targeted therapeutics, yet the design of highly effective and non-toxic candidates is hindered by the risk of off-target effects and RNA instability. A critical flaw in in silico prediction models is pervasive data leakage in cross-validation protocols, which artificially inflates performance metrics and produces untrustworthy results. To address this, we developed a rigorous framework that eliminates data leakage through strict cross-validation, leverages z-curves (3D representations of RNA physico-chemical properties) for context-aware sequence encoding, and identifies key sequence regions critical for efficacy. Our model achieves an AUC of 0.845 on leakage-free validation, surpassing prior work at 380x faster computation speed, demonstrating that superior representation trumps model complexity. Crucially, we demonstrate how experimental variability and cross-validation choices directly impact model reliability, establishing the first benchmarked methods for robust siRNA efficacy prediction. This work provides a foundation for trustworthy sequence design and validation in RNA therapeutics.
Matching journals
The top 5 journals account for 50% of the predicted probability mass.