Back

ATLAS: Population-Level Disease Locus Discovery via Differential Attention in Genomic Language Models

Liu, Y.; Deng, K.; Ye, Y.; Zhan, J.; Wang, Z.; Chen, S.; Hu, X.; Chang, A.; Li, Z.; Jin, X.; Liu, S.; Chen, K.; Shen, H.; Qi, X.; Xu, X.; Zhang, H.

2026-02-10 genetics
10.64898/2026.02.09.704696 bioRxiv
Show abstract

Identifying disease-associated genetic variants remains a key challenge in genomics, especially in small cohorts or for rare and complex mutation types where genome-wide association studies (GWAS) often fall short. We introduce ATLAS, a population-level framework that leverages attention signals from pretrained genomic language models (gLMs) to detect disease-associated genes and loci directly from raw sequences--without requiring explicit variant calls or supervised training. ATLAS first performs gene-level differential attention analysis to prioritize candidate genes, followed by base-level analysis to localize disease-associated regions at single-haplotype resolution. We validate ATLAS on synthetic and {beta}-thalassemia datasets, demonstrating robust performance across diverse allele frequencies (down to 10%), cohort sizes (below 200 individuals per group), and genomic scales. Compared to GWAS, ATLAS achieves higher recall of known loci and captures haplotype-specific signals missed by traditional methods. Cross-model benchmarking further shows that precise localization depends on both model size and pretraining on diverse human genomes. In summary, ATLAS offers a scalable, sequence-native alternative to traditional statistical genetics.

Matching journals

The top 3 journals account for 50% of the predicted probability mass.

1
Nature Genetics
240 papers in training set
Top 0.1%
27.8%
2
The American Journal of Human Genetics
206 papers in training set
Top 0.3%
14.4%
3
Nature Communications
4913 papers in training set
Top 23%
8.3%
50% of probability mass above
4
Nature Biotechnology
147 papers in training set
Top 2%
4.0%
5
Science
429 papers in training set
Top 8%
4.0%
6
Nature
575 papers in training set
Top 6%
4.0%
7
Genome Biology
555 papers in training set
Top 2%
3.7%
8
Cell Genomics
162 papers in training set
Top 1%
3.6%
9
Nature Methods
336 papers in training set
Top 3%
2.7%
10
Genome Research
409 papers in training set
Top 2%
2.4%
11
Genome Medicine
154 papers in training set
Top 3%
2.1%
12
Bioinformatics
1061 papers in training set
Top 7%
1.9%
13
Nature Human Behaviour
85 papers in training set
Top 2%
1.9%
14
Nucleic Acids Research
1128 papers in training set
Top 9%
1.9%
15
Science Translational Medicine
111 papers in training set
Top 3%
1.7%
16
Nature Neuroscience
216 papers in training set
Top 5%
1.3%
17
Proceedings of the National Academy of Sciences
2130 papers in training set
Top 36%
1.3%
18
Nature Medicine
117 papers in training set
Top 4%
0.9%
19
Briefings in Bioinformatics
326 papers in training set
Top 7%
0.7%
20
Nature Computational Science
50 papers in training set
Top 2%
0.7%
21
Genetics
225 papers in training set
Top 5%
0.6%
22
PLOS Genetics
756 papers in training set
Top 17%
0.6%
23
PLOS ONE
4510 papers in training set
Top 71%
0.6%