Back

Selecting genomes that matter: haplotype-based prioritization for iterative pangenome expansion

Marone, M. P.; Chen, E.; Himmelbach, A.; Haberer, G.; Spannagl, M.; Stein, N.; Mascher, M.

2026-05-18 genomics
10.64898/2026.05.13.724976 bioRxiv
Show abstract

BackgroundAs pangenomes approach saturation, identifying additional genomes that contribute novel sequence information becomes increasingly difficult. Current sample-selection strategies often rely on global diversity metrics or variant counts and do not explicitly account for the composition of an existing pangenome, a limitation that becomes increasingly relevant as pangenomes mature. Here, we present SelHap, a haplotype-based pipeline that uses whole-genome sequencing (WGS) data to prioritize accessions based on their contribution of novel haplotypes relative to a defined background, enabling targeted and iterative pangenome expansion. ResultsWe applied SelHap to the barley pangenome, using 76 assembled genomes as a background to select new accessions from a large WGS panel. Using this approach, we generated chromosome-scale genome assemblies from 19 accessions selected with SelHap and from 17 elite lines selected based on their relevance in historical barley breeding. Across multiple benchmarking scenarios, SelHap-based selection consistently resulted in a greater increase in non-redundant (single-copy) pangenome sequence, demonstrating that prioritizing haplotype novelty relative to an existing background maximizes unrepresented sequence content. ConclusionsBy transforming complex haplotype-clustering outputs into interpretable summaries and ranked candidate lists, SelHap provides a practical framework for targeted pangenome expansion. Beyond sample selection, SelHap can facilitate ancestry and germplasm comparisons across diverse panels. As WGS data become more accessible, SelHap offers a scalable and interpretable solution for extending mature pangenomes by explicitly targeting previously unrepresented sequence space.

Matching journals

The top 8 journals account for 50% of the predicted probability mass.

1
BMC Genomics
328 papers in training set
Top 0.1%
12.1%
2
Methods in Ecology and Evolution
160 papers in training set
Top 0.5%
6.7%
3
The Plant Genome
53 papers in training set
Top 0.1%
6.2%
4
Bioinformatics Advances
184 papers in training set
Top 0.5%
6.2%
5
Genome Biology
555 papers in training set
Top 1%
6.2%
6
Nature Communications
4913 papers in training set
Top 34%
4.8%
7
Frontiers in Plant Science
240 papers in training set
Top 2%
4.8%
8
GigaScience
172 papers in training set
Top 0.4%
4.2%
50% of probability mass above
9
BMC Bioinformatics
383 papers in training set
Top 2%
3.9%
10
Bioinformatics
1061 papers in training set
Top 5%
3.6%
11
Scientific Reports
3102 papers in training set
Top 46%
2.6%
12
The Plant Journal
197 papers in training set
Top 2%
2.1%
13
PLOS ONE
4510 papers in training set
Top 48%
2.0%
14
Plant Biotechnology Journal
56 papers in training set
Top 0.5%
2.0%
15
G3: Genes, Genomes, Genetics
222 papers in training set
Top 0.3%
2.0%
16
Microbial Genomics
204 papers in training set
Top 1%
1.9%
17
NAR Genomics and Bioinformatics
214 papers in training set
Top 2%
1.7%
18
Nucleic Acids Research
1128 papers in training set
Top 11%
1.7%
19
Genome Research
409 papers in training set
Top 2%
1.7%
20
Cell Genomics
162 papers in training set
Top 3%
1.7%
21
Molecular Ecology Resources
161 papers in training set
Top 0.7%
1.5%
22
Genome Medicine
154 papers in training set
Top 5%
1.5%
23
PLOS Computational Biology
1633 papers in training set
Top 19%
1.3%
24
G3
33 papers in training set
Top 0.3%
1.3%
25
Nature Genetics
240 papers in training set
Top 5%
1.3%
26
Computational and Structural Biotechnology Journal
216 papers in training set
Top 7%
0.9%
27
G3 Genes|Genomes|Genetics
351 papers in training set
Top 2%
0.8%
28
Applications in Plant Sciences
21 papers in training set
Top 0.3%
0.7%
29
Communications Biology
886 papers in training set
Top 27%
0.7%