PROTEOMICS — Latest Matching Preprints

1

Community Resource: A Genome-Based Extension of Large-Scale Wheat Proteogenomics

Vincent, D.; Appels, R.

2026-07-08 plant biology 10.64898/2026.06.17.733048 medRxiv

Top 0.1%

19.1%

Show abstract

Bread wheat (Triticum aestivum L.) possesses a large and highly repetitive allohexaploid genome and annotation requires extensive protein-level validation. We developed a genome-based wheat proteogenomics workflow integrating large-scale MS/MS reanalysis, GFF3-based peptide coordinate reconstruction, thorough validation, and genome browser-compatible peptide deployment against the IWGSC RefSeq v2.1 reference genome. Public wheat proteomics datasets comprising 577 raw mass spectrometry files ([~]1.0 TB) from 32 tissues were reprocessed using FragPipe/MSFragger, generating 2,226,779 non-redundant peptides and 1,648,740 unique protein accessions. Peptide-to-genome projections using GFF3 annotation files produced 8,291,056 genomic peptide projected rows, of which 98.14% passed validation procedures. Overall, peptide evidence supported 103,095 high-confidence (HC) and 135,495 low-confidence (LC) wheat gene models, corresponding to 96.4% and 84.7% of all parsed HC and LC annotations, respectively. In total, 238,590 wheat gene models (89.4% of all parsed annotations) received protein-level support. Apollo/JBrowse-compatible BED tracks enabled exon-resolved visualisation of peptide evidence across wheat chromosomes. Together, this study establishes a scalable GFF3-based proteogenomics framework for complex polyploid plant genomes and provides an extensive community resource for wheat genome annotation refinement and visual exploration (https://bread-wheat-um.genome.edu.au/apollo/49826/jbrowse/index.html). Graphical abstract O_FIG O_LINKSMALLFIG WIDTH=200 HEIGHT=63 SRC="FIGDIR/small/733048v2_ufig1.gif" ALT="Figure 1"> View larger version (16K): org.highwire.dtl.DTLVardef@6e797org.highwire.dtl.DTLVardef@14ea4fdorg.highwire.dtl.DTLVardef@31f027org.highwire.dtl.DTLVardef@8d908a_HPS_FORMAT_FIGEXP M_FIG C_FIG

2

AliceDB database and pipeline for identification of natural protein variants based on mass spectrometry measurement data

Thiel, M.; Rozycka, A.; Puchalski, M.; Oldziej, S.

2026-06-15 bioinformatics 10.64898/2026.06.11.731579 medRxiv

Top 0.1%

13.6%

Show abstract

The natural variation that distinguishes living organisms within a single species is currently being studied intensively, primarily at the genetic level. Unfortunately, studies of natural variants at the level of protein gene products are not very common, mainly due to the lack of appropriate databases and bioinformatics tools. The main research technique used to study proteomes/peptidomes is mass spectrometry (MS). A classic method for interpreting raw mass spectrometry data in proteomic/peptidomic studies involves the use of databases containing representative (canonical) sequences that define the proteome of the organism under study. In this paper, we present the AliceDB database, which contains information on over 7 million natural variants of protein sequences described in the scientific literature for Homo sapiens. The data contained in the AliceDB database can be utilized using widely available and commonly used software for interpreting proteomic data. Test results regarding the use of the AliceDB database for the interpretation of proteomic data indicate that accounting for the presence of natural variants increases both the number and quality of identified proteins. Furthermore, it is easy to identify protein sequence variants that may, for example, be of significance in medicine.

3

LAMPrEY: a Python-based automated quality control tool for large-scale proteomics datasets

Valdes-Tresanco, M. E.; Wacker, S.; Valdes-Tresanco, M. S.; Plakhotnyk, A.; Brodie, N. I.; Hepburn, M.; Ulke-Lemee, A.; Huttlin, E. L.; Lewis, I. A.

2026-05-11 bioinformatics 10.64898/2026.05.06.722826 medRxiv

Top 0.1%

13.2%

Show abstract

Over the past years, proteomics has moved increasingly towards the analysis of large cohorts of biological specimens. This has been made possible by significant improvements in mass spectrometry technology, chromatographic separation methods, and improved data acquisition strategies. These technological advances now routinely enable experiments that yield vast datasets that substantially outstrip the capacity of existing proteomics data analysis approaches. Processing such large datasets requires purpose-built, quality control tools designed to organize and analyze the data while recording all processing parameters for reproducibility. To address this need, we developed an open-source, Python-based software platform, Large-scale Automated Multi-level Proteomics Evaluation by Python (LAMPrEY), a comprehensive quality-control pipeline for quantitative proteomics analyses of large cohorts of samples. LAMPrEY features GUI-based file submission, automated processing with MaxQuant and RawTools, an interactive analytics dashboard, and an application programming interface (API) for programmatic usage that collectively enable rapid, reproducible analysis and interpretation of proteomics data. We demonstrate the longitudinal monitoring and analytical capabilities of LAMPrEY using TMT11 quantitative proteomics data generated from 910 Enterococcus faecium isolates collected from bloodstream infection patients. LAMPrEY is an open-source software that can be accessed at www.lewisresearchgroup.org/software.

4

Hidden Structural Bias in Proteomics: Sonication-induced Selective Fragmentation of Intrinsically Disordered Regions

Narita, M.; Yamakawa, T.; Nishimura, R.; Iwasaki, M.

2026-07-15 cell biology 10.64898/2026.07.14.738389 medRxiv

Top 0.1%

12.6%

Show abstract

Sonication is a fundamental technique in proteome sample preparation, primarily used for protein solubilization and shearing of genomic DNA. Although the mechanical shearing of DNA is well-characterized, its unintended impact on protein structural integrity remains a significant "blind spot" in high-throughput analytical workflows. In this study, we systematically investigated sonication-induced protein fragmentation by combining gel-based fractionation (PEPPI-MS) with sequence-level compositional analysis and bioinformatic mapping. Our results demonstrate that sonication does not significantly alter overall proteome identification or the recovery of membrane proteins; however, it induces extensive and non-random protein fragmentation. Sonication caused an approximately three-fold increase in the abundance of >45 kDa protein-derived fragments migrating into the <40 kDa fraction, and 1,620 high-molecular-weight (MW) proteins were uniquely detected in the lower-MW fraction upon sonication, an eight-fold increase over non-sonicated controls. Peptide-level amino acid composition analysis revealed subtle but directional shifts in the sonication-derived fragments. This residue-level signature is reinforced by two orthogonal structural analyses (MobiDB peptide-level mapping and protein-level profiling using metapredict V3 software), which show that sonication-susceptible proteins harbor more than twice the disordered content of length-matched controls (median 40% vs. 18%). This study identifies a previously unrecognized "structural bias" whereby intrinsically disordered region (IDR)-rich proteins are selectively compromised during sample preparation. Because these fragments are indistinguishable from enzymatic digestion products in conventional bottom-up proteomics, the underlying structural damage is effectively masked in global quantitative datasets, potentially distorting biological interpretations related to protein size, isoforms, and stability, particularly for IDR-rich classes, such as transcription factors and signaling molecules. We propose that optimizing and standardizing sonication parameters is essential for ensuring the accuracy and reproducibility of quantitative proteomic analyses.

5

Systems-Informed prioritization of Exosomal Protein Candidates in TNBC Identifies an ECM Invasion Module and Nominates Agrin as a High-Priority Target

Nguyen, T. M.

2026-05-19 cancer biology 10.64898/2026.05.14.725271 medRxiv

Top 0.1%

12.2%

Show abstract

BackgroundTriple-negative breast cancer (TNBC) remains the most clinically challenging breast cancer subtype, in part due to the absence of validated molecular targets and the limited availability of non-invasive early detection strategies. Tumor-derived exosomes have emerged as promising liquid biopsy analytes, yet the functional organization of their protein cargo and the identification of biologically meaningful candidates remain incompletely characterized. MethodsWe present a Composite Driver Score (CDS) framework that integrates differential expression magnitude with protein-protein interaction network topology and Analytic Hierarchy Process (AHP)-based multi-criteria weighting to prioritize exosomal protein candidates in a systems-informed manner. The framework was applied to publicly available label-free quantitative proteomic datasets comparing MDA-MB-231 (TNBC) and MCF-10A (non-tumorigenic) exosomal fractions, with cross-dataset validation performed on an independent proteomic dataset. ResultsCDS prioritization demonstrated robustness to variations in proteome depth and parameter weighting, consistently recovering a functionally coherent set of extracellular matrix (ECM) and adhesion-associated proteins. Network and pathway analyses revealed coordinated co-enrichment of integrin receptors, cognate ECM ligands, and associated co-receptors -- consistent with selective packaging of a functionally integrated invasion module. Agrin (AGRN), a heparan sulfate proteoglycan with virtually limited prior characterization in TNBC exosome biology, emerged as a high-priority candidate through its network integration within this ECM program. ConclusionsThese findings support a model in which TNBC-derived exosomes carry coordinated molecular programs capable of modulating extracellular matrix architecture. The CDS framework offers a transferable strategy for integrative exosomal biomarker prioritization and a systems-level foundation for targeted liquid biopsy panel development.

6

De-N-glycosylation of in vivo and in vitro adipogenic stem cell products unmasks differential expression of CD36 glycoprotein in human adipogenesis

Wongtrakul-Kish, K.; Herbert, B. R.; Haynes, P. A.; Packer, N. H.

2026-05-05 cell biology 10.64898/2026.05.01.722121 medRxiv

Top 0.1%

11.3%

Show abstract

Adipogenesis is the process of adipose-derived stem cells (ADSCs) responding to extracellular signals from the stem cell niche to differentiate into adipocytes (fat cells) and may be studied in vitro using a cocktail of chemicals that promote adipogenic differentiation to produce differentiated ADSCs (dADSCs). The global membrane N- and O-glycosylation changes of this process have been previously analysed and compared to native adipocytes as a benchmark for a true adipocyte profile, and revealed that bisecting GlcNAc type N-glycans are characteristic of adipogenesis. As stem cell differentiation has been widely reported to result in cellular protein changes, the same cells (ADSCs, dADSCs and mature adipocytes) were characterised for their membrane proteome here using label-free quantitative shotgun proteomics analysis. The membrane proteome displayed more differences in protein numbers between the cell types compared to the previously reported N-glycome which had shown high identical glycomes between stem cells and in vitro dADSCs, suggesting that the proteome is more dynamic during in vitro adipogenesis. Following the global shotgun proteomics analysis, a more targeted approach of carrying out proteomic analysis of de-N-glycosylated peptides of gel-separated proteins unearthed new glycoproteins not detected in the shotgun proteomic analysis. This approach identified the adipogenic marker, CD36, to be under-represented in the shotgun proteome analysis, but as the dominant (glyco)protein in the adipocyte membrane proteome that was also up-regulated at the mRNA transcript level in both the in vitro differentiated ADSCs (7.1-fold increase) and mature adipocytes (102.9-fold increase). A comparison of CD36 sequence coverage in the global shotgun analysis with the de-N-glycosylated CD36 revealed a 41% increase when N-glycans were removed prior to trypsin digestion, explaining its observed increased abundance and highlights the crucial need for de-N-glycosylation of proteins in proteomics experiments for increased identification of glycoproteins. The systems glycobiology approach by the integration of previously reported glycomics data and the proteomics and transcriptomics analyses in this work extended the investigation of membrane protein glycosylation changes in adipose-derived stem cell differentiation. The work provides a framework for future glycoproteomics-based investigations into the differentiation of stem cells into adipocytes, and will allow their related pathologies and potential therapeutic applications to be discovered. GRAPHICAL ABSTRACT O_FIG O_LINKSMALLFIG WIDTH=200 HEIGHT=121 SRC="FIGDIR/small/722121v1_ufig1.gif" ALT="Figure 1"> View larger version (44K): org.highwire.dtl.DTLVardef@189a786org.highwire.dtl.DTLVardef@5563b8org.highwire.dtl.DTLVardef@5cb5borg.highwire.dtl.DTLVardef@69e11f_HPS_FORMAT_FIGEXP M_FIG C_FIG

7

Analysis of Confounding Factors in Reactive Cysteine Profiling Reveals Enhanced Chromatin-Protein Association via CDK7 Inhibition by THZ1

Yang, K.; Li, S.; Li, B.; Richards, D.; Dong, K.; Seneviratne, U.; Lee, W.; Iannetta, A.; Xu, H.; Gygi, S.; Yu, Q.

2026-05-07 cell biology 10.64898/2026.05.05.721470 medRxiv

Top 0.1%

10.0%

Show abstract

Recent advances in activity-based proteome profiling (ABPP) have enabled global mapping of cysteine ligandability, uncovering novel biological insights and opportunities for identifying disease vulnerabilities. While both live cell-based and native lysate-based ABPP have been applied, how cysteine ligandability differs between these systems and what factors influence these measurements remain unclear. Building on our previous development of a high-throughput TMT-ABPP workflow for native lysates, here we adapt the protocol for live cells and systematically compare cysteine ligandability across both platforms. Our analysis reveals three major contributors to the discrepancies: in-cellulo cysteine accessibility, protein abundance changes, and protein relocalization. Notably, we highlight that CDK7 inhibitor THZ1 induces substantial protein relocalization and promotes chromatin binding. Together, these results provide a practical framework for ABPP experimental design and data interpretation, supporting more accurate application of ABPP in functional proteomics and drug discovery.

8

Capillary-based Subcellular Sampling Uncovers the Stress Granule Proteome in Single Cells

Davison, C.; Locker, N.; Marques, M.; Kelly, S.; Relton, E.; Sharma, T.; Fraser, E.; Aragon Fernandez, P.; Schoof, E. M.; Petersen, M.; Pascoe, J.; Lilley, K. S.; Pinto, S. M.; Spick, M.; Bailey, M.

2026-05-13 cell biology 10.64898/2026.05.11.724230 medRxiv

Top 0.1%

9.1%

Show abstract

Many diseases arise from dysfunction within specific organelles or biomolecular condensates, highlighting the value of analysing proteins at subcellular resolution to uncover new biological mechanisms. We report a novel capillary-based subcellular sampling workflow coupled with liquid chromatography-mass spectrometry (LC-MS) for proteomic analysis of defined subcellular regions of individual cells. We applied this methodology to stress granules (SGs), membrane-less biomolecular condensates that form in response to cellular stress (including viral infection), and are implicated in infection, neuropathology and cancer. Comprehensive characterisation of SG protein composition remains limited by technical challenges associated with bulk purification, including loss of spatial context, dynamic behaviour and contamination from cytosolic material. Using our novel method, we identified a high-confidence set of 405 SG-associated proteins, including 46 established SG residents alongside numerous previously unreported candidates. Functional enrichment analysis revealed pathways consistent with known SG biology, while comparison with an independent cytosolic proteome dataset demonstrated minimal overlap, supporting the specificity of the sampling strategy. Selected novel SG protein candidates (AHNAK2, DDX39B, NUDT1 and FKBP2) were validated using immunofluorescence microscopy. These findings establish capillary-based subcellular sampling as a viable approach for proteomic analysis of SGs with preserved spatial context and provide a framework for analysing other subcellular compartments. Table of contentsWe report an LC-MS-based capillary sampling workflow for proteomic analysis of subcellular structures within single cells. This methodology identified 405 high-confidence stress granule-associated proteins, including 46 previously established and numerous novel candidates. The approach demonstrated high specificity and preserved spatial context, expanding the capabilities of subcellular proteomics. O_FIG O_LINKSMALLFIG WIDTH=200 HEIGHT=55 SRC="FIGDIR/small/724230v1_ufig1.gif" ALT="Figure 1"> View larger version (21K): org.highwire.dtl.DTLVardef@1fa0bb0org.highwire.dtl.DTLVardef@1158524org.highwire.dtl.DTLVardef@1d82812org.highwire.dtl.DTLVardef@2ee4d9_HPS_FORMAT_FIGEXP M_FIG C_FIG Figure made in Biorender.com.

9

Confident Identification and Quantification of Mouse Brain Tissues Reveals Sirtuin 5-Dependent Regulation

Landgrave-Gomez, J.; Bons, J.; Vega-Hormazabal, G.; Riley, R.; Schilling, B.; Verdin, E.

2026-05-28 cell biology 10.64898/2026.05.26.726073 medRxiv

Top 0.1%

8.1%

Show abstract

Methylmalonylation is a non-enzymatic lysine post-translational modification derived from methylmalonyl-CoA, a reactive intermediate that accumulates during mitochondrial dysfunction and branched-chain amino acid catabolism. Although reported in models of methylmalonic acidemia, its broader distribution and functional relevance remain largely unexplored. Progress has been hindered by a key analytical challenge: methylmalonyl-and succinyl-lysine are isobaric (+100.0160 Da) and generate overlapping mass spectrometric fragmentation spectra, preventing confident identification in conventional proteomic workflows. Here, we establish a straightforward proteomic workflow that overcomes this barrier and enables confident identification and quantification of lysine methylmalonylation by combining antibody-based enrichment with data-independent acquisition mass spectrometry (DIA-MS). Anti-malonyl antibodies were used to enrich methylmalonylated peptides through cross-reactivity. Using synthetic peptide standards containing malonyl-, succinyl-, or methylmalonyl-lysine, we defined distinguishing analytical features including chromatographic retention time, ion mobility, and fragmentation patterns. Applying this approach to mouse brain tissues from Sirtuin-5 (SIRT5) knockout and wild-type mice, we identified 44 methylmalonylated peptides across 41 proteins, enriched in neuronal and myelin-associated proteins (NEFM, NEFL, MBP) and mitochondrial enzymes such as ADT1. Several sites were increased in SIRT5-deficient brains, consistent with regulation by this mitochondrial deacylase. Functional assays demonstrated that methylmalonylation of myelin basic protein (MBP) impairs lipid binding, linking this modification to myelin stability. Together, this workflow enables confident methylmalonylation identification and defines it as a widespread and regulated modification in the brain, providing a framework to study metabolically driven protein acylation in neurobiology and disease. SignificanceLysine methylmalonylation has remained largely unexplored due to its isobaric overlap with succinylation, which prevents confident identification using conventional proteomic workflows. Here, we establish an integrated strategy combining antibody-based enrichment, data-independent acquisition mass spectrometry, and orthogonal analytical features to resolve these modifications with high confidence. Applying this approach to mouse brain tissue reveals a SIRT5-regulated methylmalonylome enriched in mitochondrial and myelin-associated proteins, including myelin basic protein (MBP). Functional assays demonstrate that methylmalonylation impairs MBP lipid binding, linking this modification to myelin stability. Beyond this specific application, our workflow provides a generalizable framework to resolve isobaric post-translational modifications and expands the study of metabolically driven protein acylation in neurobiology and disease.

10

Enhanced proteome relative quantification using refined quantotypic spectral libraries

Barnes, B. A.; Alharbi, H.; Unwin, R.

2026-07-10 bioinformatics 10.64898/2026.07.06.736793 medRxiv

Top 0.1%

7.9%

Show abstract

Plasma proteomics is used for a variety of applications including biomarker discovery, disease monitoring, and drug development. Data-independent acquisition (DIA) has vastly improved the breadth of proteins that are identified from samples; however, given challenges in reproducibility and translation, it is critical that the quantitative performance of these methods is reliable. Analysis of global proteomics data typically incorporates information from all detected peptides. However, some peptides do not reflect their parent protein amount, due to irreproducible digestion, modification, analytical interferences or instability. We hypothesise that including these peptides impacts protein relative quantification, and thus, a refined spectral library containing only quantitatively representative peptides provides superior protein quantification. By analysing a defined multi-species spike-in model, we show that refining a plasma spectral library by removing precursors that fail to meet quality control metrics (25.4% of all identified precursors) reduces noise and variability, improving precision, accuracy and differential abundance analysis by up to [~]11%, with minimal identification losses and substantial reduction in computational demand. This demonstrates proof-of-concept that refining spectral libraries produces results that prioritize quantification quality over quantity. This approach could enable development of universal tissue-specific refined spectral libraries able to improve quantification quality with easy implementation and minimal processing time. Significance of the StudyAs DIA mass spectrometry proteome depth increases, the quality of the associated protein quantifications must be considered alongside identification breadth, particularly in complex matrices such as plasma, which presents additional technical challenges. The spectral library used for protein identification and quantification is a critical determinant of DIA performance, and its composition requires considerable consideration. This work illustrates an initial step toward improving protein quantification starting at the spectral library level by filtering precursors which are poor quantitative representatives of their parent proteins. In doing so, the resulting data is more reliable for downstream and biological interpretation, with fewer false differential abundance assignments and reduced quantitative noise. As such, this work represents a broader shift away from the habitual focus of MS workflows on maximising the number of protein and differential abundance identifications and instead prioritises the quality of quantification over quantity. These initial findings lay the groundwork for further development of spectral library refinement strategies, with the potential to continue improving the accuracy and precision of protein quantification in DIA-based proteomics.

11

Trypsin exhibits exopeptidase-like activity toward N-terminal arginine that biases proteomic analyses

Ambrose, E. A.; Kandasamy, G.; Meulener, M. M.; Zhang, F.

2026-05-16 biochemistry 10.64898/2026.05.15.725550 medRxiv

Top 0.1%

7.8%

Show abstract

Many proteomics protocols rely on enzymatic digestion of complex protein mixtures to generate peptides with predictable cleavage patterns for the mass spectrometry analysis. One of the most utilized enzymes, trypsin, is classically defined as a serine endopeptidase with high specificity for cleaving peptide bonds on the C-terminal side of internal lysine and arginine residues. Accordingly, trypsin is not expected to remove the N-terminal arginine, which may arise through posttranslational modification such as arginylation or by proteolysis exposing internal residues as the new N-termini. N-terminal arginine plays important biological roles, including functioning as an N-degron and modulating protein interactions/signaling through its positive charge. Curiously, prior mass spectrometry-based studies utilizing trypsin to identify proteins bearing N-terminal arginine have frequently reported low and inconsistent yields, suggesting potential systematic bias in current proteomic approaches. Here, we explored whether trypsin would affect the integrity of the N-terminal arginine. By using antibodies specifically recognizing N-terminal arginine of different peptides, and by using mass spectrometry peptide analysis, we show that trypsin can remove N-terminal arginine residues in an exopeptidase-like manner. This effect occurs across a range of digestion conditions consistent with standard proteomic workflows, on peptides or whole proteins, and depends on trypsin concentration, incubation time, and catalytic activity. In addition, we show that the alternative arginine-cleavage enzyme Arg-C can also affect N-terminal arginine in a sequence-dependent context. In contrast, Lys-C and LysargiNase do not exhibit such effects, providing suitable alternative digestion strategies. Together, these findings reveal an unappreciated enzymatic behavior of arginine-cleaving proteases and suggest that their widespread use may systematically compromise the detection of N-terminal arginine in proteomic studies.

12

ProCAST: A Bioinformatics Suite for Mass Spectrometry-Based Protein Corona Proteomics Analysis

Mun, H.; Leamy, M.; Kaushik, A.; Kieslich, C.; Douglas-Green, S. A.

2026-05-12 bioinformatics 10.64898/2026.05.08.723620 medRxiv

Top 0.1%

7.6%

Show abstract

When nanoparticles are exposed to biological fluids, they spontaneously adsorb proteins, forming a protein corona that defines their biological identity and dictates cellular uptake, biodistribution, and toxicity. Characterizing protein coronas includes using proteomics approaches (e.g., LC-MS/MS) to identify proteins and generate vast lists of adsorbed proteins, often visualized via complex heatmaps. While heatmaps display data they do not offer heuristic guide, leaving the driving mechanisms of adsorption unknown. Moreover, interpretation of protein corona proteomics data remains limited by fragmented workflows, inconsistent preprocessing, and visual outputs that are often descriptive rather than readily interpretable. These conventional methods identify adsorbed proteins but fail to explain why specific proteins are selected or how they influence the particles biological fate. Here, we developed ProCAST (Protein Corona Analysis and Statistical Tool), an R-based framework for protein corona proteomics that integrates proteomics data, nanoparticle metadata, protein annotations, and multi-level visualization within a single analytical workflow. ProCAST facilitates abundant protein clustering based on sample conditions, sequence descriptors, property or protein correlations, and gene ontology-based functional visualization. It also distinguishes abundant proteins from frequent proteins, providing distinct layers of information from the same dataset. ProCAST was used to re-analyze previously published PAMAM G4 dendrimer-FBS datasets, demonstrating that ProCAST reproduces descriptor-level visualizations and offers new insights through clearer comparisons of functional patterns and hypothesis generation from dominant corona proteins. By organizing results as complementary views of the same dataset, ProCAST facilitates the shift of protein corona analysis from descriptive outputs toward structured, comparative, and experimentally testable interpretations.

13

Stoichiometry-dependent specificity in biotin enrichment: a benchmarking framework for proximity labeling proteomics

Zala, C. A.; Trueba Sanchez, M. C.; van den Bor, J.; Willemsens, T.; Verweij, F. J.; Altelaar, M.; Stecker, K.

2026-05-11 molecular biology 10.64898/2026.05.07.723439 medRxiv

Top 0.1%

7.1%

Show abstract

Proximity labeling methods (including, BioID, TurboID, ultraID), along with surface proteomics and microdomain mapping, enable proteome-wide identification of spatially proximal proteins via MS-based analysis. These workflows require specific enrichment of biotinylated proteins using affinity purification, yet enrichment specificity can often be compromised by non-specifically bound proteins. As labeling strategies are increasingly applied to complex biological samples with low protein input or low biotin stoichiometry, accurately distinguishing true targets from background becomes a major analytical challenge. Despite its critical impact on data quality and interpretation, the influence of biotinylation level and protein input on enrichment performance remains poorly characterized, limiting the reliability of proximity labeling experiments. To address this, we establish a quantitative benchmarking framework that systematically evaluates biotin enrichment under controlled conditions, including scenarios of low biotin stoichiometry. Using this setup, we show that enrichment specificity strongly depends on biotin stoichiometry: higher levels of biotinylation in samples yield high specificity, whereas low biotinylation increases non-specific background. Reduced protein input further limits recovery of true targets, yet maintains enrichment specificity, highlighting sensitivity constraints of enrichment-based workflows. We apply this framework to biotinylated extracellular vesicle (EV) cargo uptake in recipient cells using ultraID-CD63 labeling. Detection of the most abundant EV cargo proteins under low biotinylation conditions indicates that current workflows approach the lower bounds of biotin enrichment sensitivity. Together, these standards provide a practical reference for evaluating and optimizing biotin enrichment workflows, supporting quantitative and reproducible proximity labeling in proteomics.

14

Predicting and Elucidating Peptide Retention Mechanisms with Graph Attention Networks

Kensert, A.; Hruzova, K.; Devreese, R.; Nameni, A.; Declercq, A.; Gabriels, R.; Martens, L.; Bouwmeester, R.; Urban, J.

2026-05-20 bioinformatics 10.64898/2026.05.18.725893 medRxiv

Top 0.1%

7.0%

Show abstract

Liquid chromatography (LC) is a key technology in bottom-up proteomics, separating proteolytic peptides to decrease sample complexity, enhance coverage, and increase the robustness of protein identification and quantification. Although high-resolution mass spectrometry has advanced significantly, comparable progress in LC has lagged, primarily due to a limited understanding of peptide-column interactions. To bridge this knowledge gap, we introduce a novel deep learning model (PeptideGNN) based on a Graph Neural Network (GNN) architecture to model and elucidate peptide behaviors across various separation conditions. Trained to accurately predict peptide retention times on ten diverse proteomic datasets, the model subsequently employed a saliency mapping technique to interpret the underlying retention mechanisms. Our model consistently outperformed existing retention-time predictors across multiple datasets, while the saliency mapping, importantly, revealed insights into peptide-stationary phase interactions, highlighting the effects of neighboring amino acids, post-translational modifications (PTMs), chromato-graphic columns, and mobile phase additives on peptide retention.

15

TomAP-MS: an improved tomato lectin affinity purification-based mass spectrometry workflow enabling ultra-deep plasma proteomics

Okuda, Y.; Mitsui, H.; Konno, R.; Nakajima, D.; Ueyama, N.; Ohara, O.; Kawashima, Y.

2026-04-30 biochemistry 10.64898/2026.04.28.721243 medRxiv

Top 0.1%

6.9%

Show abstract

Plasma proteomics is increasingly important for biomarker discovery and disease stratification; however, comprehensive and high-throughput analysis remains challenging because of the extreme dynamic range of plasma proteins. We previously established tomato lectin affinity purification-based mass spectrometry (TomAP-MS), a workflow that enhances plasma proteome coverage via tomato lectin-mediated enrichment. The initial workflow depended on a 4% sodium dodecyl sulfate (SDS) elution, followed by SP3-based purification and digestion, which raised complexity and restricted throughput. In this study, we developed an improved TomAP-MS workflow incorporating lauryl maltose neopentyl glycol (LMNG)-assisted acid elution (LAcE), in which proteins are eluted under acidic conditions in the presence of LMNG. This process is followed by pH adjustment and direct tryptic digestion without SP3 cleanup. Compared with conventional acid elution and the original SDS/SP3 workflow, LAcE increased protein identifications while simplifying sample preparation and improving throughput. Using the optimized workflow, we identified more than 7,500 proteins from human plasma and demonstrated broader applicability in extracellular vesicle enrichment and protein interaction analysis workflows. We demonstrated that ethylenediaminetetraacetic acid plasma was the preferred specimen type, enabling the identification of over 5,000 proteins from just 1 {micro}L of plasma, with minimal impact on proteomic profiles after up to three freeze-thaw cycles. Additionally, the analysis of plasma from 200 healthy individuals reproducibly detected 4,117 proteins across all samples, including many proteins associated with inherited disorders. These findings establish TomAP-MS with LAcE as a practical platform for deep plasma proteomics, supporting its future application in proteomics-based screening and diagnostics.

16

PeptiDIA: A Machine Learning Framework for Enhanced Peptide Identification in Fast-Gradient Data-Independent Acquisition Proteomics

Ortona, J.; Leclercq, M.; Roux-Dalvai, F.; Routy, B.; Bonnet, S.; Droit, A.

2026-06-12 bioinformatics 10.64898/2026.06.10.731224 medRxiv

Top 0.1%

6.8%

Show abstract

Data-independent acquisition (DIA) mass spectrometry has become increasingly prevalent in proteomics as advances in instrumentation, chromatography, and computational analysis have enabled robust proteome identification across complex biological samples. However, analytical depth achieved with fast chromatographic gradients remains lower than that obtained using long-gradients, reflecting a throughput-depth trade-off. Here, we present PeptiDIA, a machine learning framework that enhances peptide identification in fast-gradient DIA data by leveraging paired fast and long-gradient acquisitions from identical samples. PeptiDIA processes DIA-NN outputs generated at relaxed false discovery rate thresholds to obtain expanded candidate peptide pools and trains gradient-boosted decision tree models using long-gradient identifications as reference labels. The model integrates DIA-NN features with engineered peptide descriptors and applies isotonic regression to calibrate probabilities, enabling controlled peptide recovery relative to the long-gradient reference. Applied to human and murine datasets spanning six tissues acquired on an Orbitrap Exploris 480, PeptiDIA increased peptide identifications by 25-34% at 1% target reference-discordance rate (RDR) and increased the number of protein groups containing at least one rescued peptide by 15-17%. Overall, PeptiDIA improves the identification depth of fast-gradient DIA-NN workflows without altering acquisition strategies. The framework is available as a web application and command-line tool at https://github.com/Jordano700/PeptiDIA.

17

Near-Zero Missed Cleavages with a High-Fidelity Recombinant Arg-C Zero for Mass Spectrometry-Based Proteomics

Hernandez-Rollan, C.; Elsborg, J. D.; Le Boiteux, E.; Lu, Y.; Patel, K.; Ahel, I.; Jensen, O. N.; Batth, T. S.; Olsen, J. V.

2026-05-28 biochemistry 10.64898/2026.05.28.728370 medRxiv

Top 0.1%

6.7%

Show abstract

Proteolytic digestion remains a critical step in bottom-up proteomics workflows, with enzyme specificity and efficiency directly impacting peptide identification and protein sequence coverage. Here, we present the comprehensive characterization of Arg-C Zero, a recombinant arginyl endopeptidase derived from Porphyromonas gingivalis that exhibits exceptional fidelity in cleaving specifically at the C-terminus of arginine residues. Unlike conventional serine proteases such as Trypsin, Arg-C Zero utilizes a histidine-cysteine catalytic dyad mechanism, achieving near-zero missed cleavage rates (>99% efficiency) under standard proteomics conditions. Through systematic evaluation using HeLa protein extracts, we demonstrate that Arg-C Zero maintains consistent performance across varying digestion times. The enzyme shows robust activity across a broad pH range and tolerates up to 4M urea, making it ideally suitable for a diverse range of proteomics sample preparation workflows. While Trypsin/LysC combinations remain superior for comprehensive proteome coverage, Arg-C Zero offers unique advantages for applications requiring high specificity and reproducible arginine-specific cleavage patterns, particularly for analysis of post-translational modifications (PTMs). Here, we demonstrate how Arg-C Zero aids comprehensive mapping of histone PTMs, and when used in low-pH workflows help preserve labile ADP-ribosylation sites, expanding the analytical capabilities of mass spectrometry for characterizing these challenging modifications. The enzymes resistance to proline-adjacent cleavage sites and compatibility with standard mass spectrometry buffers position it as a valuable addition to the proteomics enzyme toolkit.

18

Manchester Proteome Profiler: A User-Friendly Platform for Quantitative Proteomic Analysis

Cain, S. A.; Fatima, M.; Humphries, M.

2026-05-18 bioinformatics 10.64898/2026.05.14.725092 medRxiv

Top 0.1%

6.7%

Show abstract

Manchester Proteome Profiler (MPP) is an open-source R Shiny application that streamlines downstream analysis of quantitative proteomic data. Compatible with grouped protein intensities tables from MaxQuant, FragPipe, Proteome Discoverer and other custom layouts, MPP provides an integrated platform for filtering, normalisation, imputation, differential expression analysis and cluster analysis across user-chosen experimental conditions. MPP supports both single- and dual-dataset comparisons, incorporates SAINTexpress for affinity purification and proximity labelling experiments, and downstream analysis of the significant protein list clusters to functional enrichment and interaction networks via Gene Ontology, BioGRID and STRING. Benchmarking with a KRAS proximity biotinylation dataset demonstrated the ability of MPP to identify reproducible clusters of differentially expressed proteins and reveal biologically meaningful patterns, including enrichment of solute carrier transporters and adhesion molecules. With interactive visualisations, customisable reports, and support for complex experimental designs, MPP offers a novel, versatile and user-friendly environment for proteomic data exploration and hypothesis generation.

19

Optimised haemoglobin depletion improves clinical proteomics from dried blood spots

Ging, H.; Maher, R. E.; Davies, E.; Brownridge, P.; Rao, A.; Salama, A. D.; Oni, L.; Eyers, C.; Chetwynd, A. J.

2026-06-13 biochemistry 10.64898/2026.06.13.731967 medRxiv

Top 0.1%

6.5%

Show abstract

Equitable access to large sample cohorts for robust, high-throughput proteomics for biomarker discovery is a major barrier to widescale clinical implementation. Dried blood spots (DBS) offer a minimally invasive alternative to venous blood draws, enabling at-home microsampling (<50 {micro}L) for centralised analysis, thus enhancing research participation. This approach is particularly relevant for under-represented groups, including children, the elderly, minority backgrounds and those with long-term health conditions such as chronic kidney disease (CKD), where disease fluctuations may occur outside the clinic, and vein preservation is critical. Proteomic analysis has demonstrated great utility in monitoring disease progression, and for biomarker/therapeutic target discovery. However, liquid chromatography-tandem mass spectrometry (LC-MS/MS) of whole blood is hindered by the wide dynamic range and the relatively high abundance of proteins such as haemoglobin, compromising biomarker discovery. Here, we establish an optimised workflow for protein extraction and haemoglobin depletion from microsamples obtained using DBS, enabling sensitive and high-throughput proteomic analysis. We demonstrate that haemoglobin depletion increases protein identifications by [~]50%, mitigating ion suppression and dynamic range effects, enabling the identification of putative biomarkers from patients with stage 5 CKD on dialysis. We also evaluated a commercial cell-free DBS device which yielded a sample more representative of plasma compared to traditional DBS and enabled greater depletion of haemoglobin compared to traditional DBS with haemoglobin depletion methods. Our findings offer a scalable approach for biomarker discovery, facilitating remote, longitudinal clinical studies.

20

Ground Truth-Based Evaluation of False Discovery Rate and Statistical Power in DIA Proteomics

Yarbro, J. M.; Huang, Y.; Pagala, V.; Fu, Y.; Wang, Z.; Wu, L.; Wang, X.; High, A. A.; Byrum, S.; Peng, J.; Yuan, Z.-F.

2026-06-02 bioinformatics 10.64898/2026.05.29.728747 medRxiv

Top 0.1%

6.2%

Show abstract

Data-independent acquisition (DIA) mass spectrometry enables rapid proteomic quantification, yet the reliability of statistical inference in DIA-based protein quantification remains incompletely understood. Here, we systematically evaluated missingness, false discovery rate (FDR), and statistical power, defined as true positive rate (i.e. sensitivity or recall), using technical replicates and a spike-in benchmark with known ground truth. Analysis of 18 HeLa replicates revealed persistent, abundance-dependent missingness. In the spike-in experiment with five replicates, human peptides were titrated against a stable yeast background, allowing fold changes (FCs) to be compared with expected values. Across comparisons with log2FCs ranging from 0.2 to 2.5, the nominal BH-FDR substantially underestimated the true FDR. For example, at a BH-FDR threshold of 0.05, the true FDR was [~]0.2. Statistical power was [~]40% for a log2FC of 0.2 and increased to nearly 100% for a log2FC of 2.5. Additional incorporation of FC thresholds improved the true FDR for large-FC comparisons, with slight loss of power, but markedly reduced sensitivity for small-FC comparisons. Together, these results indicate that nominal FDR does not necessarily reflect actual error rates in DIA proteomics and that DIA performance is influenced by protein abundance and expected fold changes. This study provides a framework for experimental design and data interpretation in DIA-based proteomic studies.