Back

Determinants of haplotype phasing accuracy in long-read human genome sequencing

Damaraju, N. E.; Frost, F. G.; Fu, J.; Donofrio, D.; Goffena, J.; Storz, S.; Anderson, Z.; Prall, T.; Galey, M.; Malicdan, M. C.; Adams, D.; Miller, D. E.

2026-05-08 genomics
10.64898/2026.05.04.722832 bioRxiv
Show abstract

Accurate haplotype phasing is critical for interpreting human genetic variation. Long-read whole-genome sequencing has emerged as a powerful approach for read-based phasing, particularly where parental DNA is absent, yet the determinants of phasing accuracy remain incompletely defined. Here, we evaluate haplotype phasing performance across sequencing technology, reference genome, read length, and coverage depth using Oxford Nanopore Technologies (ONT) and Pacific Biosciences (PacBio) data from two Genome in a Bottle reference samples (HG002 and HG005). In clinically relevant genes, alignment to the T2T-CHM13 (T2T) reference genome improves phasing performance relative to GRCh38, reducing mean gene-level phasing error rates by 3-9-fold. T2T alignment increases phase set NG50 and yields 1.5-2-fold more phased variant pairs. At similar read N50 values, ONT has a higher phasing error rate than PacBio in certain genes. Downsampling demonstrates that phasing error rates plateau at [~]20x coverage. Longer ONT read lengths reduce phasing error rates and extend phase set contiguity. Haplotype-resolved assemblies produce substantially higher phasing error rates than alignment-based phasing, demonstrating the advantage of an alignment-based approach. To enable per-variant-pair confidence assessment, we introduce PhaseQuality, a technology-specific stratification method that assigns confidence tiers to phased variants based solely on sequencing data. PhaseQuality accurately assigns 82-99% of known phasing errors to lower-confidence tiers, reducing error rates among high-confidence pairs to <0.5%. Together, these results demonstrate the primary technical determinants of long-read haplotype phasing accuracy and provide practical benchmarks for optimizing reference genome selection, coverage targets, and read length for long-read sequencing studies.

Matching journals

The top 5 journals account for 50% of the predicted probability mass.

1
The American Journal of Human Genetics
206 papers in training set
Top 0.3%
14.0%
2
Nature Communications
4913 papers in training set
Top 15%
12.2%
3
Nature Methods
336 papers in training set
Top 0.9%
12.0%
4
Nature Biotechnology
147 papers in training set
Top 0.8%
9.8%
5
Science
429 papers in training set
Top 3%
9.8%
50% of probability mass above
6
Genome Medicine
154 papers in training set
Top 1%
4.7%
7
Nature Genetics
240 papers in training set
Top 2%
4.7%
8
Genome Biology
555 papers in training set
Top 2%
4.2%
9
Nature
575 papers in training set
Top 7%
3.6%
10
Cell Genomics
162 papers in training set
Top 2%
3.5%
11
Genome Research
409 papers in training set
Top 2%
2.0%
12
Nature Computational Science
50 papers in training set
Top 0.6%
1.7%
13
Scientific Reports
3102 papers in training set
Top 65%
1.3%
14
Nucleic Acids Research
1128 papers in training set
Top 14%
1.2%
15
Cell
370 papers in training set
Top 15%
0.9%
16
Proceedings of the National Academy of Sciences
2130 papers in training set
Top 42%
0.9%
17
BMC Genomics
328 papers in training set
Top 6%
0.7%
18
NAR Genomics and Bioinformatics
214 papers in training set
Top 4%
0.7%
19
Cell Systems
167 papers in training set
Top 13%
0.7%
20
Nature Machine Intelligence
61 papers in training set
Top 4%
0.7%
21
PLOS ONE
4510 papers in training set
Top 70%
0.7%
22
Communications Biology
886 papers in training set
Top 27%
0.7%
23
eLife
5422 papers in training set
Top 60%
0.7%