Back

A Conditional Random Field approach for de novo reconstruction of bacterial haplotypes from a de Bruijn graph representation

Steyaert, A.; Van Hecke, M.; Marchal, K.; Fostier, J.

2026-05-12 bioinformatics
10.64898/2026.05.11.724222 bioRxiv
Show abstract

BackgroundDetecting distinct bacterial strains in a mixed sample is an important, yet less well-developed aspect of metagenomic research. Several methods exist that successfully retrieve a de novo reconstruction of viral strains. However, the reconstruction of bacterial haplotypes poses its own distinct challenges, and methods that successfully reconstruct full genome-length bacterial strains de novo are scarce. Here, we develop HaploDetox, a method for de novo bacterial haplotype reconstruction from short reads. We use a de Bruijn graph representation of the reads in which nodes correspond with k-mers from the read set and arcs represent overlap between two nodes sequences. Our aim is to accurately assign labels to each node and arc in the graph to reveal the presence or absence of their corresponding sequence in individual strains. ResultsUsing a negative binomial mixture model, we model the relationship between the read coverage of nodes and arcs in the graph and their presence in a strain. We achieve improved labelling accuracy by including contextual information from neighbouring nodes and arcs with a Conditional Random Field. These labels are used to extract strain-specific de Bruijn graphs from the original graph. Additionally, we allow users to assess the number of strains present in the dataset based on model selection criteria. We evaluate our node/arc labelling accuracy on simulated datasets and in silico mixes of real datasets containing different numbers of strains, as well as on in vitro mixed real datasets. Existing de novo haplotype reconstruction methods present their reconstruction as strain-specific sets of SNPs. We demonstrate that HaploDetox assigns strain-specific SNPs with a higher recall and similar precision than existing methods, by aligning the unitigs from strain-specific graphs to a reference genome. ConclusionsWe achieve improved strain-specific SNP phasing accuracy as compared to existing methods for de novo bacterial haplotype reconstruction. Additionally, HaploDetox is not limited to the determination of strain-specific SNPs, and other types of variant calls can be obtained through reference alignment. Finally, strain-specific de Bruijn graphs are an important first step towards full genome-length bacterial haplotype-aware assembly.

Matching journals

The top 2 journals account for 50% of the predicted probability mass.

1
Bioinformatics
1061 papers in training set
Top 0.6%
33.2%
2
BMC Bioinformatics
383 papers in training set
Top 0.2%
19.6%
50% of probability mass above
3
Microbial Genomics
204 papers in training set
Top 0.2%
8.5%
4
Bioinformatics Advances
184 papers in training set
Top 1%
3.6%
5
NAR Genomics and Bioinformatics
214 papers in training set
Top 0.7%
3.6%
6
PLOS Computational Biology
1633 papers in training set
Top 9%
3.6%
7
GigaScience
172 papers in training set
Top 0.7%
2.8%
8
PLOS ONE
4510 papers in training set
Top 46%
2.4%
9
Genome Biology
555 papers in training set
Top 4%
1.9%
10
BMC Genomics
328 papers in training set
Top 2%
1.7%
11
Scientific Reports
3102 papers in training set
Top 59%
1.7%
12
Briefings in Bioinformatics
326 papers in training set
Top 4%
1.7%
13
Nucleic Acids Research
1128 papers in training set
Top 13%
1.3%
14
Frontiers in Microbiology
375 papers in training set
Top 7%
1.2%
15
Computational and Structural Biotechnology Journal
216 papers in training set
Top 7%
1.1%
16
Genome Research
409 papers in training set
Top 3%
1.0%
17
Methods in Ecology and Evolution
160 papers in training set
Top 2%
0.8%
18
Nature Communications
4913 papers in training set
Top 61%
0.8%
19
Wellcome Open Research
57 papers in training set
Top 2%
0.7%
20
PeerJ
261 papers in training set
Top 17%
0.6%
21
iScience
1063 papers in training set
Top 37%
0.6%
22
mSystems
361 papers in training set
Top 8%
0.6%
23
Frontiers in Bioinformatics
45 papers in training set
Top 2%
0.5%