Back

DualLoc: Full-parameter fine-tuning of cascaded dual transformers for protein subcellular localization prediction

Chen, Y. G.; Chung, W.-Y.; Chang, K. Y.

2026-03-30 bioinformatics
10.64898/2026.03.27.714699 bioRxiv
Show abstract

Accurate protein subcellular localization is essential for biological function, and mislocalization is linked to numerous diseases. While current methods like DeepLoc 2.0 employ lightweight fine-tuning of protein language models (PLMs), their ability to predict multi-compartment localization remains limited. To address this, we introduce DualLoc, a multi-label localization predictor for ten compartments. DualLoc leverages full-parameter fine-tuning of a cascaded dual-transformer architecture, built upon foundational PLMs and augmented with attention and dropout layers. We evaluated this framework using three foundational PLMs--ProtBERT, ESM-2, and ProtT5--as backbones. Cross-validation on Swiss-Prot and independent validation on the Human Protein Atlas demonstrate consistent superiority over state-of-the-art baselines. The best-performing variant, DualLoc-ProtT5, achieves 0.5872 accuracy, 0.8271 micro-F1, and 0.7811 macro-F1, with substantial gains in the Matthews correlation coefficient for the nucleus (+0.13), cell membrane (+0.13), and extracellular space (+0.07). Pointwise mutual information analysis of model outputs reveals biologically relevant compartment couplings, notably between the Golgi apparatus and endoplasmic reticulum (PMI = 0.25, P < 10-6), accurately reflecting secretory pathway coordination. DualLoc provides both a highly accurate predictive tool and a robust framework for investigating protein multi-localization mechanisms. Author summaryWhere a protein resides within a cell determines what it does. When proteins end up in the wrong location, normal cellular function breaks down--a misplacement linked to diseases like cancer and Alzheimers. While computational tools exist to predict these locations, accurately tracking proteins that multitask across multiple cellular compartments simultaneously remains a major challenge. We developed DualLoc, a new approach that predicts protein locations across ten different cellular compartments, from the nucleus to the cell membrane. By training an advanced artificial intelligence model on large protein sequence databases, our method more accurately identifies where proteins go, especially in complex, multi-location scenarios. Importantly, our analysis revealed meaningful biological patterns. We found strong predictive links between compartments that work closely together, such as the Golgi apparatus and the endoplasmic reticulum--two organelles that coordinate protein processing and transport. This suggests our model captures genuine cellular logic rather than simply memorizing data. By improving how we predict protein localization, DualLoc helps researchers better understand normal cellular function and disease mechanisms. Our method is freely available to the biomedical community.

Matching journals

The top 4 journals account for 50% of the predicted probability mass.

1
Bioinformatics
1061 papers in training set
Top 1%
22.3%
2
Bioinformatics Advances
184 papers in training set
Top 0.1%
12.2%
3
Proceedings of the National Academy of Sciences
2130 papers in training set
Top 8%
8.3%
4
PLOS Computational Biology
1633 papers in training set
Top 4%
8.1%
50% of probability mass above
5
Cell Systems
167 papers in training set
Top 3%
4.3%
6
BMC Bioinformatics
383 papers in training set
Top 2%
4.3%
7
Briefings in Bioinformatics
326 papers in training set
Top 2%
3.6%
8
Scientific Reports
3102 papers in training set
Top 46%
2.6%
9
Nature Communications
4913 papers in training set
Top 47%
2.1%
10
Nature Machine Intelligence
61 papers in training set
Top 2%
1.9%
11
Patterns
70 papers in training set
Top 0.9%
1.7%
12
Computational and Structural Biotechnology Journal
216 papers in training set
Top 5%
1.7%
13
PLOS ONE
4510 papers in training set
Top 57%
1.5%
14
iScience
1063 papers in training set
Top 22%
1.2%
15
Journal of Proteome Research
215 papers in training set
Top 2%
1.1%
16
npj Systems Biology and Applications
99 papers in training set
Top 2%
0.9%
17
Frontiers in Genetics
197 papers in training set
Top 8%
0.9%
18
Advanced Science
249 papers in training set
Top 16%
0.9%
19
Genome Biology
555 papers in training set
Top 6%
0.9%
20
Nature Methods
336 papers in training set
Top 6%
0.9%
21
Communications Biology
886 papers in training set
Top 22%
0.8%
22
GigaScience
172 papers in training set
Top 3%
0.8%
23
NAR Genomics and Bioinformatics
214 papers in training set
Top 3%
0.8%
24
BioData Mining
15 papers in training set
Top 1%
0.6%