Back

CliPepPI: Scalable prediction of domain-peptide specificityusing contrastive learning

Hochner-Vilk, T.; Stein, D.; Schueler-Furman, O.; Raveh, B.; Chook, Y. M.; Schneidman-Duhovny, D.

2026-03-20 bioinformatics
10.64898/2026.03.18.712595 bioRxiv
Show abstract

Domain-peptide interactions mediate a significant fraction of cellular protein networks, yet accurately predicting their specificity remains challenging. Peptide motifs typically have short, fuzzy sequence profiles, and their interactions are often weak and transient, limiting the size, coverage, and quality of experimentally validated domain-peptide datasets. Since true non-binders are rarely known, constructing negative examples often introduces bias. While structure-based prediction methods can achieve high accuracy, they are computationally demanding and difficult to scale to the proteome level. We introduce CLIPepPI, a dual-encoder model that leverages contrastive learning to embed domains and peptides into a shared space directly from sequence. Both encoders are initialized from a protein language model (ESM-C) and fine-tuned using lightweight LoRA adapters, enabling parameter-efficient training on positive pairs alone. To overcome data scarcity, we augment [~]3K protein-peptide complexes from PPI3D with [~]150K domain-peptide pairs derived from protein-protein interfaces. CLIPepPI further injects structural information by marking interface residues in the domain sequence, thus guiding the encoders toward binding regions and linking sequence-level learning with structural context. Competitive performance is achieved across three independent benchmarks: domain-peptide complexes from PPI3D, large-scale phage-library data from ProP-PD, and a curated dataset of nuclear export signal (NES) sequences. We demonstrate scalability and generalization through two applications: (i) proteome-wide NES scanning, and (ii) variant-effect prediction, where score changes in domain-peptide interactions between wild-type and mutant sequences discriminate pathogenic from benign variants. Together, CLIPepPI offers a scalable, structure-informed model for predicting domain-peptide specificity and generating meaningful embeddings suited for large-scale proteomic analyses. CLIPepPI is available at: https://bio3d.cs.huji.ac.il/webserver/clipeppi/.

Matching journals

The top 4 journals account for 50% of the predicted probability mass.

1
Nature Methods
336 papers in training set
Top 0.3%
18.6%
2
Cell Systems
167 papers in training set
Top 0.8%
12.3%
3
Nature Communications
4913 papers in training set
Top 16%
10.4%
4
Nature Machine Intelligence
61 papers in training set
Top 0.2%
10.1%
50% of probability mass above
5
Bioinformatics
1061 papers in training set
Top 3%
8.4%
6
Nature Biotechnology
147 papers in training set
Top 1%
6.8%
7
Proceedings of the National Academy of Sciences
2130 papers in training set
Top 25%
2.6%
8
Advanced Science
249 papers in training set
Top 8%
2.4%
9
Genome Biology
555 papers in training set
Top 3%
2.1%
10
Nucleic Acids Research
1128 papers in training set
Top 9%
2.1%
11
Genome Research
409 papers in training set
Top 2%
1.8%
12
Briefings in Bioinformatics
326 papers in training set
Top 4%
1.5%
13
Patterns
70 papers in training set
Top 1%
1.3%
14
PLOS Computational Biology
1633 papers in training set
Top 21%
0.9%
15
NAR Genomics and Bioinformatics
214 papers in training set
Top 3%
0.9%
16
Molecular & Cellular Proteomics
158 papers in training set
Top 2%
0.9%
17
Nature
575 papers in training set
Top 14%
0.9%
18
Bioinformatics Advances
184 papers in training set
Top 4%
0.9%
19
Cell Reports Methods
141 papers in training set
Top 5%
0.7%
20
Cell Genomics
162 papers in training set
Top 7%
0.7%
21
Genome Medicine
154 papers in training set
Top 9%
0.6%
22
Molecular Systems Biology
142 papers in training set
Top 2%
0.6%
23
Science
429 papers in training set
Top 21%
0.6%