Back

muat: portable transformer-based method for tumour classification and representation learning from somatic variants

Sanjaya, P.; Pitkänen, E.

2026-04-03 bioinformatics
10.64898/2026.04.01.715762 bioRxiv
Show abstract

MotivationDeep neural networks have proven effective in classifying tumour types using next-generation sequencing data. However, developing transferable models that work across heterogeneous operating environments remains challenging due to differences in cohort compositions and data generation protocols, privacy concerns, and limited computational capabilities. ResultsWe introduce muat, a transformer-based software for tumour classification using somatic variant data from whole-genome (WGS) and whole-exome sequencing (WES). Building on previously developed MuAt and MuAt2 models, we distribute the software via Docker containers and Bioconda for deployment in high-performance computing (HPC) systems and Secure Processing Environments (SPEs). Using a downloadable MuAt checkpoint, we reproduce the performance reported in the original study on whole genome (PCAWG; 89% accuracy in histological tumour typing) and exome sequencing data (TCGA; 64% accuracy). Cross-cohort evaluation in Genomics England SPE achieved 81% accuracy without retraining and 89% following fine-tuning. As a demonstration of the softwares adaptability, we also deployed muat within the iCAN Digital Precision Cancer Medicine Flagships SPE and integrated it into a Nextflow-managed workflow. Availability and implementationmuat is available through conda (www.anaconda.org/bioconda/muat) and GitHub (https://github.com/primasanjaya/muat), under the Apache 2.0 License. Contactprima.sanjaya@helsinki.fi, esa.pitkanen@helsinki.fi; website: mlbiomed.net

Matching journals

The top 3 journals account for 50% of the predicted probability mass.

1
Bioinformatics
1061 papers in training set
Top 0.4%
40.1%
2
GigaScience
172 papers in training set
Top 0.1%
8.4%
3
BMC Bioinformatics
383 papers in training set
Top 2%
6.5%
50% of probability mass above
4
Nature Communications
4913 papers in training set
Top 32%
4.9%
5
Patterns
70 papers in training set
Top 0.2%
3.7%
6
PLOS Computational Biology
1633 papers in training set
Top 12%
2.6%
7
JCO Clinical Cancer Informatics
18 papers in training set
Top 0.4%
1.9%
8
Bioinformatics Advances
184 papers in training set
Top 2%
1.9%
9
Scientific Reports
3102 papers in training set
Top 55%
1.8%
10
Genome Biology
555 papers in training set
Top 4%
1.7%
11
Genome Medicine
154 papers in training set
Top 4%
1.7%
12
Genomics, Proteomics & Bioinformatics
171 papers in training set
Top 3%
1.7%
13
Briefings in Bioinformatics
326 papers in training set
Top 4%
1.5%
14
NAR Genomics and Bioinformatics
214 papers in training set
Top 2%
1.4%
15
Nature Methods
336 papers in training set
Top 5%
1.2%
16
Genome Research
409 papers in training set
Top 3%
1.1%
17
Nature Biotechnology
147 papers in training set
Top 6%
0.9%
18
Nucleic Acids Research
1128 papers in training set
Top 15%
0.9%
19
iScience
1063 papers in training set
Top 26%
0.9%
20
PLOS ONE
4510 papers in training set
Top 67%
0.8%
21
Computational and Structural Biotechnology Journal
216 papers in training set
Top 10%
0.7%
22
Communications Medicine
85 papers in training set
Top 1%
0.7%
23
Clinical and Translational Science
21 papers in training set
Top 1%
0.7%
24
NAR Cancer
36 papers in training set
Top 0.2%
0.7%
25
Scientific Data
174 papers in training set
Top 3%
0.7%
26
Nature Machine Intelligence
61 papers in training set
Top 4%
0.7%
27
Communications Biology
886 papers in training set
Top 28%
0.7%
28
npj Systems Biology and Applications
99 papers in training set
Top 3%
0.5%