Back

PREMISE: A Quality-Aware Probabilistic Framework for Pathogen Resolution and Source Assignment in Viral mNGS

Vijendran, S.; Dorman, K.; Anderson, T. K.; Eulenstein, O.

2026-03-18 bioinformatics
10.64898/2026.03.15.711921 bioRxiv
Show abstract

The circulation of Influenza A viruses (IAVs) in wildlife and livestock presents a significant public health threat due to their zoonotic potential and rapid genomic diversification. Accurate classification of viral subtypes and characterization of within-host diversity are crucial for risk assessment and vaccine development. Although metagenomic sequencing facilitates early detection, prevalent memory-efficient k-mer-based pipelines often discard critical linkage information. This loss of information can result in missed or imprecise pathogen identification, potentially delaying clinical and public health responses. We introduce PREMISE (Pathogen Resolution via Expectation Maximization In Sequencing Experiments), a probabilistic, alignment-based framework implemented in RUST for high-resolution viral genome identification. By integrating advanced string data structures for efficient alignment with a quality-score-aware Expectation-Maximization algorithm, PREMISE accurately identifies source strains, estimates relative abundances, and performs precise read assignments. This framework provides superior source estimation with statistical confidence, enabling the identification of mixed infections, recombination, and IAV-reassortment directly from raw data. Validated against simulated and empirical datasets, PREMISE outperforms state-of-the-art k-mer methods. Ultimately, this framework represents a significant advancement in viral identification, providing a foundation for novel approaches that can automatically flag reassorted viruses or recombination events in the future, thereby improving the detection of emerging pathogens with zoonotic potential. Availabilityhttps://github.com/sriram98v/premise under a MIT license. Contactsriramv@iastate.edu

Matching journals

The top 7 journals account for 50% of the predicted probability mass.

1
Bioinformatics Advances
184 papers in training set
Top 0.2%
9.8%
2
PLOS Computational Biology
1633 papers in training set
Top 3%
9.8%
3
Virus Evolution
140 papers in training set
Top 0.1%
8.9%
4
Bioinformatics
1061 papers in training set
Top 3%
8.1%
5
Nature Methods
336 papers in training set
Top 2%
6.1%
6
Cell Systems
167 papers in training set
Top 2%
6.1%
7
Nature Biotechnology
147 papers in training set
Top 2%
4.7%
50% of probability mass above
8
BMC Bioinformatics
383 papers in training set
Top 3%
3.5%
9
Genome Biology
555 papers in training set
Top 3%
3.5%
10
Nature Communications
4913 papers in training set
Top 41%
3.5%
11
Microbiome
139 papers in training set
Top 1%
3.5%
12
Briefings in Bioinformatics
326 papers in training set
Top 3%
2.0%
13
GigaScience
172 papers in training set
Top 1%
2.0%
14
Genome Research
409 papers in training set
Top 2%
1.7%
15
Cell Reports Methods
141 papers in training set
Top 3%
1.6%
16
PLOS ONE
4510 papers in training set
Top 56%
1.6%
17
Nucleic Acids Research
1128 papers in training set
Top 13%
1.3%
18
Genome Medicine
154 papers in training set
Top 6%
1.3%
19
Patterns
70 papers in training set
Top 1%
1.3%
20
mSystems
361 papers in training set
Top 6%
1.3%
21
mSphere
281 papers in training set
Top 5%
1.2%
22
Viruses
318 papers in training set
Top 4%
1.1%
23
NAR Genomics and Bioinformatics
214 papers in training set
Top 3%
0.9%
24
Proceedings of the National Academy of Sciences
2130 papers in training set
Top 42%
0.9%
25
Scientific Reports
3102 papers in training set
Top 72%
0.9%
26
Communications Biology
886 papers in training set
Top 26%
0.7%
27
iScience
1063 papers in training set
Top 34%
0.7%