Back

NativeReady: an open benchmark and sequence-based triage model for native mass spectrometry suitability

Znabu, B. F.; Atif, Z.

2026-05-06 bioinformatics
10.64898/2026.05.03.722506 bioRxiv
Show abstract

Native mass spectrometry is a central analytical method for characterizing intact proteins, antibody-drug conjugates, and non-covalent assemblies, and it is increasingly the deciding measurement in biotherapeutic development pipelines. A single screening attempt requires days of expression, purification, and buffer exchange into ammonium acetate, followed by 30 to 60 minutes of optimization on a Q-Exactive UHMR or comparable instrument. To our knowledge, no published sequence-based predictor currently estimates native MS suitability before experimental screening. We curated 634 unique proteins with documented native MS outcomes, drawn from a 232-protein hand-curated base set, 358 entries recovered from RCSB PDB by full-text searching for native MS terminology, and 44 evidence-based extractions from supplementary tables across 80 EuropePMC papers. We trained four model variants on this benchmark: a 36-feature BioPython physicochemical baseline, an ESM-2 linear probe, an ESM-2 PCA-256 random forest, and a combined model that concatenates ESM-2 PCA components with BioPython features. All variants were evaluated under cluster-aware 5-fold cross-validation (GroupKFold over ESM-2 embedding-similarity clusters) with isotonic calibration, and standard stratified 5-fold cross-validation is reported as a sensitivity analysis. Under cluster-aware 5-fold cross-validation (GroupKFold over ESM-2 embedding-similarity clusters, our defense against homology leakage), the combined model achieved an AUC of 0.869 plus or minus 0.036, robust against the original stratified-CV value (0.873) and the BioPython baseline (0.852). The ESM-2-only variants showed AUC drops of 0.024 to 0.046 between stratified and cluster-aware splits, indicating that some of the apparent ESM-2 contribution under standard CV reflects homology leakage. Negative recall was 9.4 percent under cluster-aware splitting versus 26.0 percent under stratified, confirming that the models apparent failure-detection capability was substantially inflated by within-fold homology. We report both numbers and treat the cluster-aware values as the primary results. We release the curated dataset, the trained model, and an interactive web tool at nativeready.netlify.app. In its current form, NativeReady should be interpreted primarily as a positive-suitability triage tool; failure prediction remains limited by the scarcity of experimentally documented negative cases. We propose a user-contribution mechanism to accumulate real failure data over time. To our knowledge, no published sequence-based predictor currently estimates native MS suitability before experimental screening, and NativeReady is the first open benchmark and triage model specifically designed for this task.

Matching journals

The top 6 journals account for 50% of the predicted probability mass.

1
Journal of Proteome Research
215 papers in training set
Top 0.3%
12.5%
2
Nature Communications
4913 papers in training set
Top 15%
12.2%
3
Bioinformatics
1061 papers in training set
Top 3%
8.3%
4
Molecular & Cellular Proteomics
158 papers in training set
Top 0.3%
8.3%
5
Nature Methods
336 papers in training set
Top 2%
6.3%
6
Journal of the American Society for Mass Spectrometry
33 papers in training set
Top 0.1%
6.2%
50% of probability mass above
7
Chemical Science
71 papers in training set
Top 0.4%
3.5%
8
Cell Systems
167 papers in training set
Top 5%
2.8%
9
Nature Machine Intelligence
61 papers in training set
Top 1%
2.6%
10
PLOS ONE
4510 papers in training set
Top 48%
2.0%
11
Analytical Chemistry
205 papers in training set
Top 1%
1.8%
12
Proceedings of the National Academy of Sciences
2130 papers in training set
Top 33%
1.7%
13
Nature Biotechnology
147 papers in training set
Top 5%
1.7%
14
Communications Chemistry
39 papers in training set
Top 0.3%
1.7%
15
Computational and Structural Biotechnology Journal
216 papers in training set
Top 5%
1.7%
16
Journal of Chemical Information and Modeling
207 papers in training set
Top 2%
1.7%
17
Communications Biology
886 papers in training set
Top 9%
1.7%
18
Briefings in Bioinformatics
326 papers in training set
Top 5%
1.2%
19
Advanced Science
249 papers in training set
Top 16%
0.9%
20
Protein Science
221 papers in training set
Top 1%
0.9%
21
JACS Au
35 papers in training set
Top 0.9%
0.9%
22
Nucleic Acids Research
1128 papers in training set
Top 16%
0.9%
23
eLife
5422 papers in training set
Top 54%
0.9%
24
International Journal of Molecular Sciences
453 papers in training set
Top 14%
0.8%
25
Biophysical Journal
545 papers in training set
Top 5%
0.7%
26
Metabolites
50 papers in training set
Top 1%
0.7%
27
Genome Biology
555 papers in training set
Top 8%
0.7%
28
ACS Chemical Biology
150 papers in training set
Top 2%
0.7%