Back

A fine-tuned genomic language model adds complementary nucleotide-context information to missense variant interpretation

Su, Y.; Lin, Y.-J.

2026-05-11 bioinformatics
10.64898/2026.05.06.723362 bioRxiv
Show abstract

Missense variant interpretation remains a central challenge in clinical genomics. Missense pathogenicity predictors achieve strong performance, but many emphasize protein-level consequences or overlapping annotation priors. Whether genomic language models add non-redundant nucleotide-context signal to missense interpretation remains unclear. Here, we systematically adapted genomic language models to ClinVar missense pathogenicity prediction across back-bone architectures, representation strategies, classifier heads, and adaptation regimes. In our analysis, variant-position embeddings consistently outperformed pooled sequence representations, multi-species pretraining provided the strongest backbone-level advantage, and low-rank adaptation generalized better than full fine-tuning. The resulting fine-tuned model, GLM-Missense, substantially outperformed zero-shot scoring from the same pretrained model. To test whether GLM-Missense contributes information beyond existing methods, we built MetaMissense, an XGBoost ensemble combining GLM-Missense with AlphaMissense, ESM1b, REVEL, CADD, SIFT, and PolyPhen-2. GLM-Missense showed the lowest concordance with other predictors, retained the strongest partial association with pathogenicity after controlling for the other predictors, and ranked as the most informative non-ensemble input to MetaMissense. MetaMissense achieved the best performance in both cross-validation and held-out testing. Analyses of variants correctly classified by GLM-Missense but misclassified by several established predictors suggested two patterns. First, part of the GLM-Missense signal may reflect splice-relevant exonic context. Second, GLM-Missense appears to add value in settings where other predictors may overweight allele frequency, gene-level constraint, or amino-acid-change severity. However, these features explained only about 10% of the distinction between the GLM-Missense-correct subset from the background. Together, our results demonstrate that fine-tuned genomic language models contribute complementary nucleotide-context information to missense variant interpretation.

Matching journals

The top 5 journals account for 50% of the predicted probability mass.

1
Genome Medicine
154 papers in training set
Top 0.1%
23.1%
2
PLOS Computational Biology
1633 papers in training set
Top 4%
7.3%
3
Genome Biology
555 papers in training set
Top 0.8%
7.0%
4
The American Journal of Human Genetics
206 papers in training set
Top 0.7%
6.5%
5
Cell Systems
167 papers in training set
Top 2%
6.5%
50% of probability mass above
6
Nature Communications
4913 papers in training set
Top 35%
4.4%
7
Bioinformatics
1061 papers in training set
Top 5%
4.1%
8
Cell Genomics
162 papers in training set
Top 1%
3.7%
9
Nature Machine Intelligence
61 papers in training set
Top 0.9%
3.7%
10
BMC Genomics
328 papers in training set
Top 2%
2.1%
11
PLOS Genetics
756 papers in training set
Top 7%
2.1%
12
Genome Research
409 papers in training set
Top 2%
1.7%
13
Briefings in Bioinformatics
326 papers in training set
Top 4%
1.5%
14
Nature Genetics
240 papers in training set
Top 5%
1.4%
15
Scientific Reports
3102 papers in training set
Top 65%
1.3%
16
Cell Reports Medicine
140 papers in training set
Top 5%
1.3%
17
PLOS ONE
4510 papers in training set
Top 59%
1.3%
18
Nature Methods
336 papers in training set
Top 5%
1.1%
19
Bioinformatics Advances
184 papers in training set
Top 4%
1.1%
20
Nature Biotechnology
147 papers in training set
Top 6%
1.0%
21
Frontiers in Genetics
197 papers in training set
Top 8%
0.9%
22
Proceedings of the National Academy of Sciences
2130 papers in training set
Top 41%
0.9%
23
Nucleic Acids Research
1128 papers in training set
Top 16%
0.8%
24
BMC Bioinformatics
383 papers in training set
Top 7%
0.8%
25
NAR Genomics and Bioinformatics
214 papers in training set
Top 4%
0.8%
26
Human Genetics
25 papers in training set
Top 0.4%
0.7%
27
eLife
5422 papers in training set
Top 59%
0.7%
28
Advanced Science
249 papers in training set
Top 20%
0.7%
29
npj Genomic Medicine
33 papers in training set
Top 1%
0.7%
30
Computational and Structural Biotechnology Journal
216 papers in training set
Top 10%
0.7%