Back

From SNPs to Pathways: A genome-wide benchmark of annotation discrepancies and their impact on protein- and pathway-level inference

Queme, B.; Muruganujan, A.; Ebert, D.; Mushayahama, T.; Gauderman, W. J.; Mi, H.

2026-03-24 bioinformatics
10.64898/2026.03.21.713397 bioRxiv
Show abstract

BackgroundAccurate single-nucleotide polymorphism (SNP) annotation is central to genomic research yet widely used tools and gene models often yield divergent results. Prior studies have shown such discrepancies in small datasets, but the extent of genome-wide variation and its impact on downstream pathway analysis remain unclear. ResultsWe conducted a comprehensive comparison of three commonly used SNP annotation tools, ANNOVAR, SnpEff, and VEP, using both Ensembl and RefSeq gene models to evaluate more than 40 million SNPs from the Haplotype Reference Consortium. At the protein level, annotation output differed significantly across tools and gene models (p-adj < 0.001), with discrepancies present in both genic and intergenic regions. RefSeq produced broader annotation coverage, particularly for intergenic SNPs, while Ensembl showed greater internal consistency. SnpEff provided the most complete coverage overall, whereas no single tool or model configuration achieved full annotation recovery of the union reference. Integration across tools and models maximized coverage and reduced annotation loss. In a case study of 204 colorectal cancer-associated SNPs from the FIGI GWAS, pathway enrichment results varied depending on annotation strategy. The fully integrated approach identified all four significant pathways, whereas several single-tool or single-model strategies missed one or more. ConclusionSNP annotation outcomes are influenced by both the tool and gene model used, and relying on a single approach may result in incomplete coverage. A multi-tool, multi-model strategy provides the most comprehensive annotation and preserves enriched pathways, supporting more robust and reproducible genomic interpretation.

Matching journals

The top 4 journals account for 50% of the predicted probability mass.

1
BMC Bioinformatics
383 papers in training set
Top 0.1%
23.7%
2
Bioinformatics
1061 papers in training set
Top 1%
20.5%
3
European Journal of Human Genetics
49 papers in training set
Top 0.2%
4.2%
4
BioData Mining
15 papers in training set
Top 0.1%
3.9%
50% of probability mass above
5
BMC Medical Genomics
36 papers in training set
Top 0.2%
3.4%
6
PLOS ONE
4510 papers in training set
Top 43%
2.9%
7
Scientific Reports
3102 papers in training set
Top 44%
2.7%
8
GigaScience
172 papers in training set
Top 0.9%
2.2%
9
PLOS Computational Biology
1633 papers in training set
Top 15%
1.9%
10
International Journal of Cancer
42 papers in training set
Top 0.6%
1.8%
11
PeerJ
261 papers in training set
Top 6%
1.8%
12
The American Journal of Human Genetics
206 papers in training set
Top 2%
1.8%
13
Human Mutation
29 papers in training set
Top 0.4%
1.8%
14
Nucleic Acids Research
1128 papers in training set
Top 10%
1.8%
15
Database
51 papers in training set
Top 0.5%
1.4%
16
NAR Genomics and Bioinformatics
214 papers in training set
Top 3%
1.3%
17
Frontiers in Genetics
197 papers in training set
Top 6%
1.3%
18
Journal of Translational Medicine
46 papers in training set
Top 2%
1.2%
19
Bioinformatics Advances
184 papers in training set
Top 4%
0.9%
20
Genome Medicine
154 papers in training set
Top 7%
0.9%
21
F1000Research
79 papers in training set
Top 4%
0.8%
22
Journal of the American Medical Informatics Association
61 papers in training set
Top 2%
0.8%
23
BMC Genomics
328 papers in training set
Top 5%
0.8%
24
Computational and Structural Biotechnology Journal
216 papers in training set
Top 8%
0.8%
25
Genome Biology
555 papers in training set
Top 7%
0.8%
26
Journal of Biomedical Informatics
45 papers in training set
Top 1%
0.8%
27
Human Molecular Genetics
130 papers in training set
Top 3%
0.8%
28
Human Genetics and Genomics Advances
70 papers in training set
Top 0.9%
0.7%
29
Genetics in Medicine
69 papers in training set
Top 1%
0.7%
30
Trials
25 papers in training set
Top 2%
0.5%