Back

Evaluating Reference-Independent Pipelines for the Detection of Spreading Organisms in Metagenomic Datasets

Popov, N. S.; Panova, V. V.; Molchanova, M.; Gurov, S.; Lukashev, A. N.; Manolov, A.; Ilina, E. N.

2026-05-06 bioinformatics
10.64898/2026.05.03.722517 bioRxiv
Show abstract

The emergence of unidentified pathogens, or "Disease X," poses a significant threat to global health, necessitating the development of proactive surveillance strategies for the wildlife and human virosphere. Since novel viruses often lack universal genetic markers or known homologs, this study evaluates four reference-independent computational pipelines: coverage-based, k-mer-based, nucleotide clustering, and Large Language Model (LLM)-based designed to detect spreading organisms by comparing distinct metagenomic datasets. Using a real-world pandemic dataset of human nasopharyngeal RNA-seq runs and a semi-synthetic dataset enriched with divergent Egovirales sequences, we measured the sensitivity, selectivity, and computational efficiency of each approach. The coverage-based method proved most robust, consistently achieving 100% genome coverage of SARS-CoV-2 and maintaining high selectivity even at low viral concentrations, though it required extensive computational resources (20 days of CPU time for 2B reads). In contrast, the k-mer-based approach offered a tenfold reduction in execution time and high selectivity but was sensitive to data depletion, failing to detect targets at very low abundances. The clustering-based pipeline performed effectively at moderate concentrations but suffered from sequence fragmentation in sparse data, while the LLM-based method (using ViraLM), despite its efficiency, exhibited critically low selectivity due to current latent space partitioning limitations. These results demonstrate that while k-mer and LLM-based tools provide rapid screening capabilities, the coverage-based approach remains the most reliable for sensitive pathogen discovery. Ultimately, these reference-independent workflows are essential for illuminating metagenomic "dark matter" and establishing early warning systems for emerging infectious diseases

Matching journals

The top 7 journals account for 50% of the predicted probability mass.

1
Briefings in Bioinformatics
326 papers in training set
Top 0.1%
18.4%
2
BMC Bioinformatics
383 papers in training set
Top 2%
6.3%
3
PLOS Computational Biology
1633 papers in training set
Top 6%
6.3%
4
Viruses
318 papers in training set
Top 0.9%
6.2%
5
Computational and Structural Biotechnology Journal
216 papers in training set
Top 0.9%
4.8%
6
Scientific Reports
3102 papers in training set
Top 28%
4.3%
7
NAR Genomics and Bioinformatics
214 papers in training set
Top 0.5%
4.1%
50% of probability mass above
8
Bioinformatics Advances
184 papers in training set
Top 1%
3.9%
9
GigaScience
172 papers in training set
Top 0.5%
3.6%
10
PLOS ONE
4510 papers in training set
Top 45%
2.6%
11
Frontiers in Bioinformatics
45 papers in training set
Top 0.1%
2.1%
12
Genome Medicine
154 papers in training set
Top 4%
1.8%
13
Nucleic Acids Research
1128 papers in training set
Top 10%
1.8%
14
Patterns
70 papers in training set
Top 0.8%
1.8%
15
PeerJ
261 papers in training set
Top 6%
1.8%
16
Microbiome
139 papers in training set
Top 2%
1.8%
17
Cell Reports Methods
141 papers in training set
Top 2%
1.7%
18
Bioinformatics
1061 papers in training set
Top 7%
1.7%
19
Communications Biology
886 papers in training set
Top 10%
1.6%
20
Nature Communications
4913 papers in training set
Top 53%
1.6%
21
Frontiers in Genetics
197 papers in training set
Top 7%
1.2%
22
Genome Biology
555 papers in training set
Top 6%
1.2%
23
iScience
1063 papers in training set
Top 22%
1.2%
24
Frontiers in Microbiology
375 papers in training set
Top 7%
0.9%
25
mSphere
281 papers in training set
Top 5%
0.9%
26
mSystems
361 papers in training set
Top 6%
0.9%
27
Virus Evolution
140 papers in training set
Top 1%
0.8%
28
Cell Systems
167 papers in training set
Top 12%
0.7%
29
BMC Genomics
328 papers in training set
Top 6%
0.7%
30
Advanced Science
249 papers in training set
Top 20%
0.7%