Back

Investigation of Protein Melting Temperature Prediction with Cross-Method Validation on Biophysical Data

Pailozian, K.; Kohout, P.; Damborsky, J.; Mazurenko, S.

2026-05-11 bioinformatics
10.64898/2026.05.07.723192 bioRxiv
Show abstract

MotivationProtein melting temperature (Tm) prediction accelerates the discovery of thermostable enzymes which are crucial for industrial biotechnology often requiring harsh reaction conditions. Experimental determination of Tm remains labour-intensive and varies across techniques, motivating the development of in silico predictors. Mass-spectrometry datasets such as Meltome Atlas now enable large-scale Tm prediction with models based on deep learning, but model generalisation across diverse experimental datasets has not been systematically tested. ResultsWe evaluated the generalisability of state-of-the-art deep learning approaches and explored ESM-based embeddings for Tm prediction. To this end, we assembled the ProMelt training dataset (45 441 proteins) and five independent biophysics-based validation datasets. Our analysis revealed substantial differences between proteomics- and biophysics-based Tm measurements, highlighting the challenge of cross-domain generalisation. Existing state-of-the-art predictors trained on large-scale proteomics datasets showed reduced performance on biophysics-based validation sets. Our fine-tuned embedding-based models, particularly LoRA-adapted ESM-2 (TmProt 1.0), outperformed state-of-the-art predictors in identifying thermostable proteins (Tm[≥] 60 {degrees}C) across heterogeneous datasets, achieving AUC scores of 0.75-0.77. We also demonstrated that the available models could be used efficiently in the sequence prioritization task. AvailabilityThe TmProt web server is available at https://loschmidt.chemi.muni.cz/tmprot/. Source code and data are available at https://github.com/loschmidt/TmProt.

Matching journals

The top 3 journals account for 50% of the predicted probability mass.

1
Bioinformatics
1061 papers in training set
Top 0.9%
23.5%
2
Nature Communications
4913 papers in training set
Top 2%
23.5%
3
Molecular & Cellular Proteomics
158 papers in training set
Top 0.3%
8.8%
50% of probability mass above
4
Analytical Chemistry
205 papers in training set
Top 0.4%
6.7%
5
Journal of Proteome Research
215 papers in training set
Top 0.5%
5.1%
6
Nature Methods
336 papers in training set
Top 2%
4.5%
7
Nature Machine Intelligence
61 papers in training set
Top 2%
2.0%
8
Communications Chemistry
39 papers in training set
Top 0.2%
1.8%
9
Cell Systems
167 papers in training set
Top 8%
1.6%
10
Molecular Systems Biology
142 papers in training set
Top 0.9%
1.3%
11
Communications Biology
886 papers in training set
Top 13%
1.3%
12
Briefings in Bioinformatics
326 papers in training set
Top 5%
1.2%
13
PLOS Computational Biology
1633 papers in training set
Top 20%
1.2%
14
Computational and Structural Biotechnology Journal
216 papers in training set
Top 8%
0.8%
15
Genome Biology
555 papers in training set
Top 7%
0.8%
16
Advanced Science
249 papers in training set
Top 19%
0.8%
17
Nature Biotechnology
147 papers in training set
Top 8%
0.8%
18
Cell Reports Methods
141 papers in training set
Top 5%
0.8%
19
Nucleic Acids Research
1128 papers in training set
Top 19%
0.7%
20
Protein Science
221 papers in training set
Top 2%
0.5%
21
GigaScience
172 papers in training set
Top 4%
0.5%
22
PROTEOMICS
35 papers in training set
Top 1.0%
0.5%
23
Scientific Reports
3102 papers in training set
Top 79%
0.5%