Back

Explainable machine learning reveals an RBP regulatory logic of exon skipping

Raghav, Y.; Paul, A.; Anderson, R.; Karthyk, S.; Iturralde, A.; Vyas, J.; Dy, J.; Jones, B. C.; Castaldi, P. J.; Platig, J.

2026-05-30 systems biology
10.64898/2026.05.29.728731 bioRxiv
Show abstract

RNA binding proteins (RBPs) regulate the life cycle of an mRNA, often through RBP-RNA interactions. This life cycle includes splicing, whereby the intronic sequence of a pre-mRNA is removed and the exons are joined together. However, the patterns of RBP binding that lead to different splicing outcomes are still incompletely understood. Here, we build machine learning models from RBP-RNA binding and knockdown RNA-seq data for over 168 RBPs in two cell lines (HepG2 and K562) to better understand the binding patterns that predict exon skipping, the predominant form of alternative splicing in humans. We show that models trained exclusively on RBP binding patterns are indeed predictive and that a more sophisticated machine learning model (XGBoost) outperforms simpler linear models. In addition, we are able to extract a biologically interpretable logic embedded in these models. We show that SHAP, a machine learning explainability technique, captures activating and repressive behavior of RBP binding that is position-specific. In addition, we find that SHAP values are predictive of changes in unseen splicing events and that SHAP interactions between pairs of RBPs are predictive of protein-protein interactions. Our results demonstrate that using machine learning with interpretability techniques can reveal a regulatory logic of RBP binding. By estimating the impact of an RBP binding site on a splicing event, the SHAP values also provide a directly testable scientific hypothesis. We anticipate that models designed around biological processes and focused on interpretability will yield actionable biological insights both in splicing and genomics generally.

Matching journals

The top 6 journals account for 50% of the predicted probability mass.

1
PLOS Computational Biology
1633 papers in training set
Top 1%
17.7%
2
Cell Systems
167 papers in training set
Top 1%
9.6%
3
Bioinformatics Advances
184 papers in training set
Top 0.3%
8.0%
4
Bioinformatics
1061 papers in training set
Top 4%
6.5%
5
BMC Bioinformatics
383 papers in training set
Top 2%
6.0%
6
Proceedings of the National Academy of Sciences
2130 papers in training set
Top 16%
4.6%
50% of probability mass above
7
Nature Communications
4913 papers in training set
Top 35%
4.4%
8
Genome Biology
555 papers in training set
Top 3%
3.4%
9
Scientific Reports
3102 papers in training set
Top 39%
3.4%
10
Computational and Structural Biotechnology Journal
216 papers in training set
Top 2%
3.1%
11
Nucleic Acids Research
1128 papers in training set
Top 7%
2.9%
12
iScience
1063 papers in training set
Top 8%
2.5%
13
npj Genomic Medicine
33 papers in training set
Top 0.2%
2.3%
14
Frontiers in Genetics
197 papers in training set
Top 5%
1.6%
15
eLife
5422 papers in training set
Top 44%
1.6%
16
Communications Biology
886 papers in training set
Top 10%
1.6%
17
Genome Research
409 papers in training set
Top 3%
1.6%
18
Biophysical Journal
545 papers in training set
Top 3%
1.4%
19
The American Journal of Human Genetics
206 papers in training set
Top 3%
1.3%
20
PLOS ONE
4510 papers in training set
Top 61%
1.2%
21
PLOS Genetics
756 papers in training set
Top 12%
1.1%
22
Frontiers in Molecular Biosciences
100 papers in training set
Top 5%
0.8%
23
Cell Reports Methods
141 papers in training set
Top 6%
0.7%
24
Journal of Molecular Biology
217 papers in training set
Top 4%
0.7%
25
Cell Reports
1338 papers in training set
Top 35%
0.7%
26
Science Advances
1098 papers in training set
Top 33%
0.7%
27
Physical Biology
43 papers in training set
Top 3%
0.6%