Back

Benchmarking long-context genome language models on biosynthetic gene clusters

Hirota, K.; Higashi, K.; Kurokawa, K.; Yamada, T.

2026-05-15 bioinformatics
10.64898/2026.05.12.724296 bioRxiv
Show abstract

Recent advances in language models for natural language processing have spread to the field of genomics, driving the development of genome language models (gLMs) to decipher genomic information. Cutting-edge long-context gLMs are promising approaches for understanding and designing biological complexity, but their evaluation remains underdeveloped. In this study, we introduce BGCs-Bench, a unified benchmark focused on biosynthetic gene clusters for assessing long-range genomic modeling on three downstream tasks: biosynthetic class prediction, taxonomic classification and coding sequence annotation. Using BGCs-Bench, we perform systematic and layer-wise evaluations of the embedding representations of long-context gLMs, demonstrating that layer selection is crucial for downstream task performance. In addition to the evaluation results, the logit lens analysis of autoregressive gLMs suggests that StripedHyena-based models consist of earlier layers to encode biologically meaningful information from input DNA sequences and deeper layers to optimize embeddings for sequence generation. These findings provide insights for more effective development and application of long-context gLMs.

Matching journals

The top 7 journals account for 50% of the predicted probability mass.

1
Briefings in Bioinformatics
326 papers in training set
Top 0.3%
12.0%
2
Computational and Structural Biotechnology Journal
216 papers in training set
Top 0.2%
9.8%
3
Bioinformatics
1061 papers in training set
Top 3%
8.1%
4
BMC Bioinformatics
383 papers in training set
Top 2%
6.2%
5
PLOS Computational Biology
1633 papers in training set
Top 6%
6.2%
6
Nature Communications
4913 papers in training set
Top 34%
4.7%
7
NAR Genomics and Bioinformatics
214 papers in training set
Top 0.7%
3.6%
50% of probability mass above
8
Frontiers in Genetics
197 papers in training set
Top 2%
3.5%
9
Bioinformatics Advances
184 papers in training set
Top 2%
3.5%
10
Nucleic Acids Research
1128 papers in training set
Top 6%
3.5%
11
GigaScience
172 papers in training set
Top 0.6%
3.5%
12
Journal of Chemical Information and Modeling
207 papers in training set
Top 2%
2.5%
13
Advanced Science
249 papers in training set
Top 8%
2.5%
14
Genome Biology
555 papers in training set
Top 4%
2.0%
15
Nature Machine Intelligence
61 papers in training set
Top 2%
1.8%
16
iScience
1063 papers in training set
Top 16%
1.6%
17
IEEE Transactions on Computational Biology and Bioinformatics
17 papers in training set
Top 0.2%
1.6%
18
PLOS ONE
4510 papers in training set
Top 55%
1.6%
19
Genomics, Proteomics & Bioinformatics
171 papers in training set
Top 4%
1.3%
20
npj Systems Biology and Applications
99 papers in training set
Top 2%
1.2%
21
Cell Systems
167 papers in training set
Top 10%
1.2%
22
Genome Research
409 papers in training set
Top 3%
1.1%
23
Journal of Genetics and Genomics
36 papers in training set
Top 2%
0.9%
24
Scientific Reports
3102 papers in training set
Top 73%
0.8%
25
Cell Genomics
162 papers in training set
Top 7%
0.7%
26
Patterns
70 papers in training set
Top 3%
0.6%
27
BMC Genomics
328 papers in training set
Top 7%
0.6%
28
Database
51 papers in training set
Top 1%
0.6%
29
Horticulture Research
43 papers in training set
Top 2%
0.6%
30
BioData Mining
15 papers in training set
Top 1%
0.6%