Back

Machine learning-based prediction of memory requirements for metagenomic assembly in high-performance computing environments

Zierep, P. F.; Faack, S.; Beracochea, M.; Sanchez, S.; Batut, B.; Finn, R. D.; Gruening, B. A.

2026-05-13 microbiology
10.64898/2026.05.12.724571 bioRxiv
Show abstract

Metagenomic assembly can be a computationally intensive step in microbiome analysis, with memory requirements that vary widely depending on input data characteristics. In workflow systems like Galaxy and large-scale platforms like MGnify, which run thousands of heterogeneous jobs, inaccurate memory allocation drives job failures and costly retries when underestimated, and reduces throughput when overestimated. Current approaches rely primarily on heuristic rules based on input file size or sample metadata, which often fail to generalize across diverse datasets. In this study, we present a machine learning-based framework for predicting memory requirements of metagenomic assembly using metaSPAdes. We analyzed 300 assembly jobs from diverse biomes and evaluated 18 predictive models using combinations of input file size, biome classification, and sequence-derived k-mer features. K-mer profiles were computed from raw sequencing data and summarized into statistical descriptors capturing sequence complexity and diversity. Model performance was assessed using both conventional regression metrics and a production-oriented cost function that accounts for retry policies and resource waste in high-performance computing environments. Our results show that machine learning models can outperform commonly used heuristics. In particular, models incorporating biome information achieved the best performance and can be tuned to favor conservative predictions that reduce job failure rates. Simpler models based solely on input file size also performed competitively, offering a practical alternative for systems with limited feature availability. When evaluated under realistic workload distributions, predictive approaches reduced total memory waste by several million gigabyte-hours per 1,000 jobs compared to static allocation strategies. These findings demonstrate that data-driven resource prediction can substantially improve efficiency in metagenomic workflows. The proposed framework is adaptable to different computational environments and provides a foundation for integrating predictive resource allocation into large-scale bioinformatics platforms beyond Galaxy.

Matching journals

The top 4 journals account for 50% of the predicted probability mass.

1
PLOS Computational Biology
1633 papers in training set
Top 0.9%
21.9%
2
GigaScience
172 papers in training set
Top 0.1%
17.0%
3
Bioinformatics Advances
184 papers in training set
Top 0.2%
8.9%
4
Bioinformatics
1061 papers in training set
Top 4%
6.1%
50% of probability mass above
5
NAR Genomics and Bioinformatics
214 papers in training set
Top 1%
2.7%
6
BMC Genomics
328 papers in training set
Top 1%
2.5%
7
Genome Biology
555 papers in training set
Top 4%
2.0%
8
Scientific Reports
3102 papers in training set
Top 51%
2.0%
9
BMC Bioinformatics
383 papers in training set
Top 4%
2.0%
10
Computational and Structural Biotechnology Journal
216 papers in training set
Top 3%
2.0%
11
Microbial Genomics
204 papers in training set
Top 1%
1.7%
12
PeerJ
261 papers in training set
Top 8%
1.6%
13
Journal of Proteome Research
215 papers in training set
Top 1%
1.6%
14
iScience
1063 papers in training set
Top 16%
1.6%
15
Frontiers in Bioinformatics
45 papers in training set
Top 0.3%
1.6%
16
Genome Research
409 papers in training set
Top 3%
1.4%
17
PLOS ONE
4510 papers in training set
Top 57%
1.4%
18
Frontiers in Microbiology
375 papers in training set
Top 6%
1.3%
19
mSystems
361 papers in training set
Top 6%
1.3%
20
Briefings in Bioinformatics
326 papers in training set
Top 5%
1.3%
21
Nature Communications
4913 papers in training set
Top 56%
1.3%
22
Metabolites
50 papers in training set
Top 0.7%
1.2%
23
Genomics, Proteomics & Bioinformatics
171 papers in training set
Top 5%
0.9%
24
Communications Biology
886 papers in training set
Top 22%
0.8%
25
Cell Systems
167 papers in training set
Top 13%
0.7%
26
eLife
5422 papers in training set
Top 62%
0.6%
27
Microbiome
139 papers in training set
Top 4%
0.6%
28
Advanced Science
249 papers in training set
Top 23%
0.6%