Back

GAP-MS: Automated validation of gene predictions using integrated mass ‎spectrometry evidence

Abbas, Q.; Wilhelm, M.; Kuster, B.; Frischman, D.

2026-03-19 bioinformatics
10.64898/2026.03.17.712294 bioRxiv
Show abstract

Accurate genome annotation is fundamental to modern biology, yet distinguishing authentic protein-coding sequences from prediction artifacts remains challenging, particularly in complex plant genomes where automated methods are error-prone and manual curation is rarely feasible due to prohibitive time and costs. Here, we present GAP-MS (Gene model Assessment using Peptides from Mass Spectrometry), an automated proteogenomic pipeline that leverages mass spectrometry evidence to systematically validate the protein-level accuracy of predicted gene models. Applied across 9 major crop species, GAP-MS consistently improved prediction precision for four widely used gene prediction tools. In addition to filtering erroneous models, the pipeline identified hundreds of previously missing gene models from current standard reference annotations. These peptide-supported loci were further verified by transcriptional evidence, well-supported functional annotations, and high coding-potential scores. Together, these results demonstrate that direct proteomic evidence provides a robust framework for resolving annotation ambiguities, defining high-confidence reference proteomes, and uncovering overlooked protein-coding genes, while facilitating the identification of sequences that may require further investigation. GAP-MS is freely available as a web interface at https://webclu.bio.wzw.tum.de/gapms/.

Matching journals

The top 4 journals account for 50% of the predicted probability mass.

1
Nature Methods
336 papers in training set
Top 0.2%
23.4%
2
Nature Communications
4913 papers in training set
Top 5%
19.4%
3
Cell Systems
167 papers in training set
Top 2%
6.6%
4
Molecular & Cellular Proteomics
158 papers in training set
Top 0.4%
6.6%
50% of probability mass above
5
Genome Biology
555 papers in training set
Top 1%
5.0%
6
Nature Biotechnology
147 papers in training set
Top 2%
4.1%
7
Bioinformatics
1061 papers in training set
Top 5%
3.8%
8
Journal of Proteome Research
215 papers in training set
Top 0.9%
2.4%
9
Genomics, Proteomics & Bioinformatics
171 papers in training set
Top 3%
2.0%
10
Advanced Science
249 papers in training set
Top 10%
1.8%
11
Nucleic Acids Research
1128 papers in training set
Top 10%
1.7%
12
Plant Communications
35 papers in training set
Top 0.9%
1.5%
13
Briefings in Bioinformatics
326 papers in training set
Top 4%
1.4%
14
Nature Machine Intelligence
61 papers in training set
Top 2%
1.4%
15
PLOS Computational Biology
1633 papers in training set
Top 20%
1.1%
16
Molecular Systems Biology
142 papers in training set
Top 1%
0.9%
17
Molecular Plant
36 papers in training set
Top 1%
0.8%
18
Plant Biotechnology Journal
56 papers in training set
Top 1%
0.8%
19
PLOS ONE
4510 papers in training set
Top 65%
0.8%
20
Plant Physiology
217 papers in training set
Top 3%
0.8%
21
eLife
5422 papers in training set
Top 56%
0.8%
22
Cell Reports Methods
141 papers in training set
Top 5%
0.7%
23
The Plant Cell
141 papers in training set
Top 2%
0.5%
24
Proceedings of the National Academy of Sciences
2130 papers in training set
Top 48%
0.5%
25
The Plant Journal
197 papers in training set
Top 4%
0.5%
26
Nature Plants
84 papers in training set
Top 2%
0.5%