Back

DoFormer: Causal Transformer for Gene Perturbation

Karbalayghareh, A.; Paull, E.; Califano, A.

2026-05-04 bioinformatics
10.64898/2026.05.02.722054 bioRxiv
Show abstract

Learning causal gene regulatory mechanisms from single-cell data, and thereby predicting the effects of unseen perturbations, remains challenging. Observational RNA-seq data alone is insufficient for causal modeling, whereas perturbational data is essential. Classical causal inference methods often rely on unrealistic directed acyclic graph (DAG) assumptions and are not well suited to integrating multimodal data. Current transcriptomic foundation models also typically treat observational and perturbational data identically, limiting their ability to model perturbations. We present DoFormer, a causal multimodal Transformer that makes no DAG assumptions and leverages rich perturbational data to accurately predict previously unseen perturbations. DoFormer enables principled in silico perturbations by adapting the causal do-operator within the attention mechanism: the perturbed gene is set to the intervention value and prevented from attending to other genes, allowing the model to fully distinguish observational from interventional regimes. We train DoFormer using biologically informed loss functions and evaluate it with comprehensive perturbation prediction metrics. DoFormer substantially improves perturbation prediction relative to baseline and prior foundation models, underscoring the importance of intervention-aware architectures and biologically grounded objectives for causal modeling in single-cell genomics.

Matching journals

The top 4 journals account for 50% of the predicted probability mass.

1
Cell Systems
167 papers in training set
Top 0.3%
21.8%
2
Nature Communications
4913 papers in training set
Top 16%
12.0%
3
Nature Methods
336 papers in training set
Top 1%
9.8%
4
Genome Research
409 papers in training set
Top 0.3%
6.6%
50% of probability mass above
5
Genome Biology
555 papers in training set
Top 2%
4.7%
6
Nature Biotechnology
147 papers in training set
Top 2%
4.7%
7
Bioinformatics
1061 papers in training set
Top 5%
4.7%
8
Science
429 papers in training set
Top 10%
3.5%
9
Nature Machine Intelligence
61 papers in training set
Top 1%
3.5%
10
Nature
575 papers in training set
Top 9%
2.3%
11
Proceedings of the National Academy of Sciences
2130 papers in training set
Top 27%
2.3%
12
The American Journal of Human Genetics
206 papers in training set
Top 2%
2.3%
13
PLOS Computational Biology
1633 papers in training set
Top 13%
2.3%
14
Nature Genetics
240 papers in training set
Top 4%
1.7%
15
PLOS ONE
4510 papers in training set
Top 55%
1.6%
16
Nucleic Acids Research
1128 papers in training set
Top 12%
1.4%
17
Nature Computational Science
50 papers in training set
Top 1%
0.9%
18
Genome Medicine
154 papers in training set
Top 7%
0.9%
19
Briefings in Bioinformatics
326 papers in training set
Top 7%
0.7%
20
Development
440 papers in training set
Top 4%
0.7%
21
BMC Bioinformatics
383 papers in training set
Top 7%
0.7%
22
Scientific Reports
3102 papers in training set
Top 79%
0.6%
23
Bioinformatics Advances
184 papers in training set
Top 5%
0.6%