Back

Uncertainty-aware benchmarking reveals ambiguous transcripts in mRNA-lncRNA classification

Garcia-Ruano, D.; Georges, M.; Mohanty, S. K.; Baaziz, R.; Makova, K. D.; Nikolski, M.; Chalopin, D.

2026-04-17 bioinformatics
10.64898/2026.04.14.718168 bioRxiv
Show abstract

BackgroundLong non-coding RNAs (lncRNAs) have gained significant attention in recent years, yet distinguishing them from protein-coding transcripts remains challenging. Indeed, many lncRNAs share mRNA-like processing and existing sequence-derived signals do not fully capture the coding/non-coding boundary. Recent GENCODE annotation efforts revealed tens of thousands of novel lncRNA sequences as well as the reclassification of some lncRNAs into the protein-coding class, highlighting the need to better characterize transcript features associated with classification uncertainty and errors. ResultsWe performed uncertainty-aware benchmarking by retraining and evaluating eight transcript classifiers under a controlled protocol on a label-stable GENCODE v46-v47 subset. Beyond conventional model evaluation metrics, we quantified inter-tool agreement and entropy-based uncertainty to stratify transcripts into consensus, discordant, and consensus-error groups. To expand standard sequence and ORF-derived signals, we incorporated repeat-derived features from mature transcripts and non-B DNA motif features across gene bodies. Although aggregate performance was high, [~]45% of transcripts showed inter-tool discordance, particularly among lncRNAs. Feature analyses linked low-uncertainty predictions to strong coding-like signals, whereas high-uncertainty profiles exhibited mixed signatures. Alongside classical predictors in global importance analyses, repeat-derived features appear as main contributors. ConclusionsBy combining controlled benchmarking with transcript-level agreement and uncertainty stratification, together with extended feature profiling, we identified patterns associated with classifier disagreement and misclassification. This novel framework provides practical guidance for interpreting predictions, motivating the development of more robust coding/non-coding classifiers, while also shedding light on the sequence properties that distinguish lncRNA sequences.

Matching journals

The top 4 journals account for 50% of the predicted probability mass.

1
BMC Bioinformatics
383 papers in training set
Top 0.1%
22.9%
2
Bioinformatics
1061 papers in training set
Top 2%
12.7%
3
NAR Genomics and Bioinformatics
214 papers in training set
Top 0.1%
9.3%
4
Computational and Structural Biotechnology Journal
216 papers in training set
Top 0.4%
6.9%
50% of probability mass above
5
PLOS Computational Biology
1633 papers in training set
Top 5%
6.4%
6
Bioinformatics Advances
184 papers in training set
Top 0.6%
4.9%
7
Scientific Reports
3102 papers in training set
Top 22%
4.9%
8
GigaScience
172 papers in training set
Top 0.4%
4.0%
9
Nature Communications
4913 papers in training set
Top 44%
2.6%
10
PLOS ONE
4510 papers in training set
Top 47%
2.1%
11
Nucleic Acids Research
1128 papers in training set
Top 8%
2.1%
12
Genome Biology
555 papers in training set
Top 4%
1.9%
13
Briefings in Bioinformatics
326 papers in training set
Top 4%
1.7%
14
Database
51 papers in training set
Top 0.5%
1.5%
15
Nature Methods
336 papers in training set
Top 5%
1.5%
16
RNA Biology
70 papers in training set
Top 0.3%
1.2%
17
iScience
1063 papers in training set
Top 26%
0.9%
18
BioData Mining
15 papers in training set
Top 0.7%
0.8%
19
BMC Genomics
328 papers in training set
Top 6%
0.7%
20
Communications Biology
886 papers in training set
Top 28%
0.7%
21
Scientific Data
174 papers in training set
Top 3%
0.7%
22
BMC Biology
248 papers in training set
Top 6%
0.7%
23
PeerJ
261 papers in training set
Top 19%
0.5%
24
Computers in Biology and Medicine
120 papers in training set
Top 6%
0.5%