Expanding the RNA Virus Universe by Scalable Structure-Guided Discovery
Luo, G.; Zang, Z.; Yuan, L.; Zhou, J.; Dong, A.; Huang, Y.; Li, S. Z.; Ju, F.
Show abstract
The discovery of RNA viruses from metatranscriptomic data remains challenging due to their extreme sequence divergence and frequent lack of conserved motifs. We present Rider, a lightweight two-stage framework that couples fast, structure-informed sequence screening with targeted structural validation. Stage 1 uses a compact 35M-parameter protein language model to prioritize RdRp-like fragments at whole-sample scale, achieving over 44x higher end-to-end screening throughput on commodity hardware. Stage 2 applies structure prediction and Foldseek-based alignment against a dedicated RdRp structure resource ([~]200k ESMFold-predicted structures), providing orthogonal evidence for remote homologs. Applied to >10,000 metatranscriptomes spanning marine, freshwater, soil and host-associated microbiomes, Rider matches or outperforms leading tools (e.g., LucaProt, PalmScan) and additionally recovers divergent and truncated sequences. Multiple orthogonal indicators, including structure consistency and low DNA read mapping to corresponding contigs, support genuine RNA origin. In a human IBD cohort, Rider agrees with state-of-the-art calls for clinically relevant RNA viruses while extending discovery to divergent lineages. Rider turns structure-guided homology search into a practical, scalable pipeline for RNA virome discovery. HighlightO_LIA two-stage framework enables structure-guided RNA virus discovery at sample scale, achieving up to 44-fold higher throughput on standard computing hardware. C_LIO_LIThe method matches or surpasses LucaProt and PalmScan across >10,000 metatranscriptomes from diverse environments, while recovering RdRp fragments missed by existing tools. C_LIO_LIStructural validation using [~]200,000 ESMFold-predicted RdRp models and Foldseek alignment supports the detection of remote homologs with high confidence. C_LIO_LIOrthogonal evidence, including low DNA read mapping, strand-specific expression, and ORF metrics, confirms RNA origin and reduces false positives.. C_LIO_LIOpen-source code and an openly released RdRp structure database enable scalable, reproducible RNA virome discovery in environmental and clinical settings. C_LI
Matching journals
The top 2 journals account for 50% of the predicted probability mass.