An AI-driven pipeline for the discovery of hidden peptides in plant proteomes: the CLE family as a case study
Boschin, M.; Rota Negroni, M.; Francese, C.; Pavanello, A.; Sales, G.; Trainotti, L.
Show abstract
Plant proteomes contain evolutionarily conserved peptides with poorly conserved primary sequences, often hindering their identification and classification into families. Homology-based approaches and conventional annotation pipelines frequently fail to detect these family members, particularly in poorly characterized, but agronomically relevant plant species. CLE peptides (CLAVATA3/EMBRYO SURROUNDING REGION-related peptides) constitute a large and evolutionarily conserved family of plant signaling molecules, yet their characterization remains incomplete. Beyond a limited number of well-studied members, a substantial number of CLE peptides remain uncharacterized due to functional redundancy and the intrinsic features of CLE genes, which encode short pre-propeptides with only a small 12-residue conserved motif. Here, we present a novel framework leveraging state-of-the-art Protein Language Models (pLMs) to discover CLE peptides directly from 13 plant proteomes. By coupling sequence embeddings trained on large evolutionary datasets (ESM2 and ProtT5) with supervised machine learning, our dual-model approach captures deep semantic features of the CLE family that are missed by traditional alignment methods. The pipeline demonstrated robust generalization, achieving high classification accuracy (98.9-99.4%) on a held-out set of CLE peptides not used during training. Consequently, we identified a set of high-confidence, previously unannotated CLE candidates prioritized through a stringent consensus-based filtering strategy. This work demonstrates how AI-driven proteome analysis can overcome the limitations of homology-based methods and provides a scalable strategy for uncovering previously unidentified peptide-mediated signaling molecules across plant lineages. HighlightLeveraging Protein Language Models, our AI framework uncovers "hidden" signaling peptides missed by standard tools, revealing the elusive diversity of CLE regulators across plant proteomes.
Matching journals
The top 7 journals account for 50% of the predicted probability mass.