Back

An AI-driven pipeline for the discovery of hidden peptides in plant proteomes: the CLE family as a case study

Boschin, M.; Rota Negroni, M.; Francese, C.; Pavanello, A.; Sales, G.; Trainotti, L.

2026-02-03 plant biology
10.64898/2026.01.31.703007 bioRxiv
Show abstract

Plant proteomes contain evolutionarily conserved peptides with poorly conserved primary sequences, often hindering their identification and classification into families. Homology-based approaches and conventional annotation pipelines frequently fail to detect these family members, particularly in poorly characterized, but agronomically relevant plant species. CLE peptides (CLAVATA3/EMBRYO SURROUNDING REGION-related peptides) constitute a large and evolutionarily conserved family of plant signaling molecules, yet their characterization remains incomplete. Beyond a limited number of well-studied members, a substantial number of CLE peptides remain uncharacterized due to functional redundancy and the intrinsic features of CLE genes, which encode short pre-propeptides with only a small 12-residue conserved motif. Here, we present a novel framework leveraging state-of-the-art Protein Language Models (pLMs) to discover CLE peptides directly from 13 plant proteomes. By coupling sequence embeddings trained on large evolutionary datasets (ESM2 and ProtT5) with supervised machine learning, our dual-model approach captures deep semantic features of the CLE family that are missed by traditional alignment methods. The pipeline demonstrated robust generalization, achieving high classification accuracy (98.9-99.4%) on a held-out set of CLE peptides not used during training. Consequently, we identified a set of high-confidence, previously unannotated CLE candidates prioritized through a stringent consensus-based filtering strategy. This work demonstrates how AI-driven proteome analysis can overcome the limitations of homology-based methods and provides a scalable strategy for uncovering previously unidentified peptide-mediated signaling molecules across plant lineages. HighlightLeveraging Protein Language Models, our AI framework uncovers "hidden" signaling peptides missed by standard tools, revealing the elusive diversity of CLE regulators across plant proteomes.

Matching journals

The top 7 journals account for 50% of the predicted probability mass.

1
Plant Communications
35 papers in training set
Top 0.1%
17.9%
2
Proceedings of the National Academy of Sciences
2130 papers in training set
Top 9%
6.9%
3
Cell Systems
167 papers in training set
Top 2%
6.1%
4
Nature Communications
4913 papers in training set
Top 31%
6.1%
5
Advanced Science
249 papers in training set
Top 3%
6.1%
6
Molecular & Cellular Proteomics
158 papers in training set
Top 0.5%
6.1%
7
eLife
5422 papers in training set
Top 21%
4.2%
50% of probability mass above
8
The Plant Journal
197 papers in training set
Top 2%
3.5%
9
Molecular Plant
36 papers in training set
Top 0.5%
3.0%
10
Plant Physiology
217 papers in training set
Top 1%
3.0%
11
Nature Plants
84 papers in training set
Top 0.8%
2.5%
12
Genomics, Proteomics & Bioinformatics
171 papers in training set
Top 3%
2.3%
13
PLOS Computational Biology
1633 papers in training set
Top 17%
1.6%
14
Genome Biology
555 papers in training set
Top 5%
1.6%
15
Nature Biotechnology
147 papers in training set
Top 5%
1.4%
16
Nature Machine Intelligence
61 papers in training set
Top 2%
1.4%
17
The Plant Cell
141 papers in training set
Top 1%
1.4%
18
Plant Biotechnology Journal
56 papers in training set
Top 0.9%
1.2%
19
PROTEOMICS
35 papers in training set
Top 0.5%
1.2%
20
Cell Reports
1338 papers in training set
Top 29%
1.1%
21
Molecular Systems Biology
142 papers in training set
Top 1%
0.9%
22
Computational and Structural Biotechnology Journal
216 papers in training set
Top 8%
0.9%
23
Communications Biology
886 papers in training set
Top 20%
0.9%
24
New Phytologist
309 papers in training set
Top 5%
0.8%
25
Cell
370 papers in training set
Top 16%
0.8%
26
Frontiers in Plant Science
240 papers in training set
Top 5%
0.7%
27
Genome Research
409 papers in training set
Top 4%
0.7%
28
Plant Phenomics
17 papers in training set
Top 0.3%
0.7%
29
Nucleic Acids Research
1128 papers in training set
Top 19%
0.7%
30
Science Advances
1098 papers in training set
Top 32%
0.7%