Back

GeoEPred: A Multimodal Structure-Aware Geometric Deep Learning Framework for Gram-Negative Bacterial Secreted Effector Prediction with Sequence Semantics

Song, S.; Shi, H.; Wu, H.; Liu, D.; Lin, Y.; Mat Isa, N. A.; Zou, Q.; Wei, L.

2026-05-20 genomics
10.64898/2026.05.18.725929 bioRxiv
Show abstract

Accurate prediction of effector proteins secreted by Gram-negative bacteria is important for elucidating bacterial pathogenic mechanisms and developing precise anti-infective strategies. Although existing methods have benefited from the strong sequence feature extraction capacity of pretrained protein language models, reliance on linear sequence information alone often fails to fully capture the three-dimensional conformational signals required for virulence functions. Meanwhile, conventional structure-based methods are limited by the scarcity of experimentally resolved protein structures. To address these challenges, We propose GeoEPred, a multimodal deep learning framework designed for the synergistic modeling of protein sequence and structure to identify Gram-negative bacterial effector proteins. Specifically, the model integrates sequence-contextual embeddings from a pretrained protein language model with three-dimensional structural representations predicted by ESMFold. A feature projection network refines fine-grained sequence signals associated with effector functions, while geometric vector perceptrons characterize inter-residue orientations, distances, and local spatial topology to capture potential structural conformational motifs. To further enable effective cross-modal fusion, we design a cross-modal alignment and feature-tokenized self-attention module. This module enhances consistency between the sequence-semantic and structural-geometric spaces through contrastive learning and models associations between linear functional motifs and spatial conformational patterns at a fine-grained token level. Extensive evaluations on multiple benchmark datasets show that GeoEPred achieves better predictive performance than existing leading models in T3SE, T4SE, and T6SE prediction tasks, while maintaining stable performance in remote homolog recognition scenarios. Moreover, the modular and extensible architecture of GeoEPred demonstrates strong generalization ability and substantial application potential for genome-scale effector protein discovery. Author summarySecreted effector proteins are central virulence factors used by many Gram-negative bacterial pathogens to execute infection strategies. Their functions are governed not only by secretion signals and short linear motifs in the amino acid sequence, but also by three-dimensional folds, local domains, and surface geometric patterns. However, current predictors mainly exploit sequence-contextual features, limiting their ability to model the correspondence between linear sequence signals and spatial conformational motifs, and thereby constraining accuracy and interpretability. Here, we present GeoEPred, a multimodal deep learning framework for secreted effector protein identification. GeoEPred couples sequence-semantic embeddings from a pretrained protein language model with structural representations learned by geometric vector perceptrons. A cross-modal alignment and interaction module uses contrastive learning to improve functional consistency between sequence and structure modalities, while feature-token attention captures fine-grained links between key linear and conformational motifs. Across benchmark datasets covering multiple effector types, GeoEPred outperforms existing state-of-the-art methods and provides interpretable evidence from sequence fragments, structural regions, and cross-modal associations, supporting functional annotation, pathogenic mechanism analysis, and experimental validation.

Matching journals

The top 6 journals account for 50% of the predicted probability mass.

1
Nature Machine Intelligence
61 papers in training set
Top 0.1%
23.1%
2
Proceedings of the National Academy of Sciences
2130 papers in training set
Top 7%
9.4%
3
Nature Communications
4913 papers in training set
Top 21%
9.4%
4
Nature Biotechnology
147 papers in training set
Top 2%
4.1%
5
Cell Genomics
162 papers in training set
Top 1%
3.7%
6
PLOS Computational Biology
1633 papers in training set
Top 9%
3.7%
50% of probability mass above
7
Nucleic Acids Research
1128 papers in training set
Top 7%
3.1%
8
Bioinformatics
1061 papers in training set
Top 6%
3.1%
9
Nature Methods
336 papers in training set
Top 3%
3.1%
10
Cell Systems
167 papers in training set
Top 6%
2.1%
11
Science
429 papers in training set
Top 12%
2.1%
12
Genome Biology
555 papers in training set
Top 4%
1.9%
13
Genome Research
409 papers in training set
Top 2%
1.9%
14
Advanced Science
249 papers in training set
Top 9%
1.9%
15
Genome Medicine
154 papers in training set
Top 4%
1.8%
16
Patterns
70 papers in training set
Top 0.8%
1.7%
17
Briefings in Bioinformatics
326 papers in training set
Top 4%
1.7%
18
Frontiers in Immunology
586 papers in training set
Top 5%
1.4%
19
Nature Computational Science
50 papers in training set
Top 1.0%
1.3%
20
Computational and Structural Biotechnology Journal
216 papers in training set
Top 6%
1.3%
21
Cell Reports Methods
141 papers in training set
Top 4%
1.0%
22
Communications Biology
886 papers in training set
Top 16%
1.0%
23
Cell Reports
1338 papers in training set
Top 30%
0.9%
24
iScience
1063 papers in training set
Top 25%
0.9%
25
Frontiers in Genetics
197 papers in training set
Top 8%
0.9%
26
Scientific Reports
3102 papers in training set
Top 74%
0.8%
27
Nature
575 papers in training set
Top 15%
0.8%
28
eLife
5422 papers in training set
Top 59%
0.7%
29
Science Advances
1098 papers in training set
Top 31%
0.7%
30
Cell Reports Medicine
140 papers in training set
Top 9%
0.7%