Back

Pansoma, a machine learning tool for identifying somatic variants using pangenome graphs

Shen, J.; Fu, Q.; Macias, J. F.; Human Pangenome Reference Consortium, ; Li, D.; Wang, T.

2026-05-29 genomics
10.64898/2026.05.27.726245 bioRxiv
Show abstract

Somatic variant calling, the identification of mutations in non-germline cells acquired over an individuals lifetime, is critical for studying diseases, including cancer, and for developing precision oncology strategies. Traditional somatic variant calling methods rely on linear reference genomes, which do not adequately capture human genetic diversity and result in reference bias, compromising the accuracy of somatic variant detection. The recently developed graph-based human pangenome reference represents diverse genetic variants across human populations and has promised to drive advances in many genetics and genomics studies. In this study, we introduce Pansoma, a novel pangenome-native and machine learning-based tool specifically designed for somatic variant calling using a pangenome graph reference. Pansoma performs somatic variant detection from both short{square} and long{square}read sequencing data by learning tensor representations of alignment on graph nodes rather than on a linear reference. Pansoma outputs variant representations anchored to the pangenome graph paths and conventional somatic variant calls remapped to the linear reference. Additionally, we provide a suite of bioinformatics tools tailored for graph-based genomic data management and analysis of variant calling results. Benchmarking shows that Pansoma improves tumor-only somatic variant detection while preserving graph-specific variant representations that are not directly recoverable from linear-reference outputs.

Matching journals

The top 6 journals account for 50% of the predicted probability mass.

1
Bioinformatics
1061 papers in training set
Top 2%
12.3%
2
Genome Medicine
154 papers in training set
Top 0.5%
9.9%
3
Genome Biology
555 papers in training set
Top 0.5%
9.0%
4
Genome Research
409 papers in training set
Top 0.2%
8.1%
5
Bioinformatics Advances
184 papers in training set
Top 0.5%
6.2%
6
Nature Biotechnology
147 papers in training set
Top 2%
4.8%
50% of probability mass above
7
Cell Genomics
162 papers in training set
Top 1%
3.8%
8
Nature Communications
4913 papers in training set
Top 41%
3.5%
9
Frontiers in Genetics
197 papers in training set
Top 2%
3.0%
10
Nucleic Acids Research
1128 papers in training set
Top 7%
2.8%
11
Nature Methods
336 papers in training set
Top 4%
2.3%
12
NAR Genomics and Bioinformatics
214 papers in training set
Top 1%
2.1%
13
Genomics, Proteomics & Bioinformatics
171 papers in training set
Top 3%
2.0%
14
Nature Machine Intelligence
61 papers in training set
Top 2%
1.9%
15
Briefings in Bioinformatics
326 papers in training set
Top 3%
1.9%
16
Nature Computational Science
50 papers in training set
Top 0.5%
1.9%
17
IEEE Transactions on Computational Biology and Bioinformatics
17 papers in training set
Top 0.2%
1.9%
18
PLOS Computational Biology
1633 papers in training set
Top 15%
1.8%
19
Computational and Structural Biotechnology Journal
216 papers in training set
Top 4%
1.8%
20
Nature
575 papers in training set
Top 12%
1.5%
21
BMC Bioinformatics
383 papers in training set
Top 5%
1.3%
22
The American Journal of Human Genetics
206 papers in training set
Top 3%
0.9%
23
Cell Systems
167 papers in training set
Top 11%
0.9%
24
BMC Medical Genomics
36 papers in training set
Top 1%
0.8%
25
GigaScience
172 papers in training set
Top 3%
0.8%
26
Scientific Reports
3102 papers in training set
Top 73%
0.8%
27
Cell
370 papers in training set
Top 18%
0.7%
28
Nature Genetics
240 papers in training set
Top 8%
0.7%
29
Journal of Genetics and Genomics
36 papers in training set
Top 3%
0.6%
30
Communications Biology
886 papers in training set
Top 30%
0.6%