ATLAS: Population-Level Disease Locus Discovery via Differential Attention in Genomic Language Models
Liu, Y.; Deng, K.; Ye, Y.; Zhan, J.; Wang, Z.; Chen, S.; Hu, X.; Chang, A.; Li, Z.; Jin, X.; Liu, S.; Chen, K.; Shen, H.; Qi, X.; Xu, X.; Zhang, H.
Show abstract
Identifying disease-associated genetic variants remains a key challenge in genomics, especially in small cohorts or for rare and complex mutation types where genome-wide association studies (GWAS) often fall short. We introduce ATLAS, a population-level framework that leverages attention signals from pretrained genomic language models (gLMs) to detect disease-associated genes and loci directly from raw sequences--without requiring explicit variant calls or supervised training. ATLAS first performs gene-level differential attention analysis to prioritize candidate genes, followed by base-level analysis to localize disease-associated regions at single-haplotype resolution. We validate ATLAS on synthetic and {beta}-thalassemia datasets, demonstrating robust performance across diverse allele frequencies (down to 10%), cohort sizes (below 200 individuals per group), and genomic scales. Compared to GWAS, ATLAS achieves higher recall of known loci and captures haplotype-specific signals missed by traditional methods. Cross-model benchmarking further shows that precise localization depends on both model size and pretraining on diverse human genomes. In summary, ATLAS offers a scalable, sequence-native alternative to traditional statistical genetics.
Matching journals
The top 3 journals account for 50% of the predicted probability mass.