Back

Comprehensive top-down mass spectral repository enables pan-dataset analysis and top-down spectral prediction

Li, K.; Liu, K.; Fulcher, J. M.; Tang, H.; Liu, X.

2026-02-23 bioinformatics
10.64898/2026.02.20.707032 bioRxiv
Show abstract

Mass spectral libraries have become essential resources for training deep learning (DL) models for spectral prediction and de novo sequencing in bottom-up mass spectrometry (BU-MS). Compared with BU-MS, top-down MS (TD-MS) offers unique advantages for characterizing intact proteoforms by analyzing proteoforms without enzymatic digestion. Despite these advantages, large-scale spectral libraries for TD-MS are currently lacking. Here we present TopRepo, the first comprehensive repository of TD-MS spectra, comprising more than 18 million spectra acquired from 12 species across eight types of mass spectrometers. Using TopRepo, we constructed a large-scale top-down spectral library containing over 5 million spectra with curated proteoform and fragment-ion annotations. We demonstrate that TopRepo enables pan-dataset analyses of N-terminal processing, mass shifts, and other proteoform characteristics identified by TD-MS. Furthermore, we show that the TopRepo spectral library substantially improves proteoform identification through spectral library searching and supports the training of DL models for high-accuracy top-down spectral prediction.

Matching journals

The top 6 journals account for 50% of the predicted probability mass.

1
Nature Communications
4913 papers in training set
Top 16%
10.6%
2
Analytical Chemistry
205 papers in training set
Top 0.3%
10.2%
3
Nature Methods
336 papers in training set
Top 1%
9.3%
4
PLOS ONE
4510 papers in training set
Top 25%
6.9%
5
Journal of Proteome Research
215 papers in training set
Top 0.4%
6.9%
6
Journal of the American Society for Mass Spectrometry
33 papers in training set
Top 0.1%
6.5%
50% of probability mass above
7
Bioinformatics
1061 papers in training set
Top 4%
6.4%
8
Molecular & Cellular Proteomics
158 papers in training set
Top 0.5%
4.9%
9
Nature Machine Intelligence
61 papers in training set
Top 1%
2.8%
10
Nature Biotechnology
147 papers in training set
Top 4%
1.9%
11
Advanced Science
249 papers in training set
Top 10%
1.7%
12
Briefings in Bioinformatics
326 papers in training set
Top 5%
1.3%
13
Scientific Data
174 papers in training set
Top 1%
1.3%
14
Communications Biology
886 papers in training set
Top 14%
1.2%
15
Cell Systems
167 papers in training set
Top 9%
1.2%
16
Genome Biology
555 papers in training set
Top 5%
1.2%
17
PROTEOMICS
35 papers in training set
Top 0.5%
1.1%
18
Scientific Reports
3102 papers in training set
Top 69%
1.0%
19
Communications Chemistry
39 papers in training set
Top 1%
0.8%
20
Computational and Structural Biotechnology Journal
216 papers in training set
Top 9%
0.8%
21
mSystems
361 papers in training set
Top 7%
0.8%
22
ACS Nano
99 papers in training set
Top 4%
0.7%
23
Genomics, Proteomics & Bioinformatics
171 papers in training set
Top 6%
0.7%
24
Nano Letters
63 papers in training set
Top 3%
0.7%
25
Proceedings of the National Academy of Sciences
2130 papers in training set
Top 47%
0.7%
26
Metabolites
50 papers in training set
Top 1%
0.7%
27
Cell Reports Methods
141 papers in training set
Top 6%
0.7%
28
Nucleic Acids Research
1128 papers in training set
Top 19%
0.7%
29
iScience
1063 papers in training set
Top 37%
0.7%
30
PLOS Computational Biology
1633 papers in training set
Top 29%
0.5%