Guidance for high-quality functional gene embeddings from large language models

Huang, R.; Hou, Y.; Zhao, W.; Zhang, J.; Lu, J.; Kong, Y.; Xu, P.

2026-05-04 bioinformatics

10.64898/2026.04.30.721875 bioRxiv

Show abstract

Large language models (LLMs) are increasingly used to generate gene embeddings, yet systematic benchmarks of prompting strategies and practical guidance for obtaining biologically meaningful representations remain limited. Here we present GEbench, an evaluation framework for assessing LLM-derived gene embeddings across different tasks, prompting strategies, and LLM architectures. GEbench revealed that embedding quality depends primarily on whether the input text contains explicit functional information, rather than on sparse gene identifiers or model size. Identifier-based embeddings showed weak biological organization, whereas embeddings derived from functional descriptions consistently achieved stronger functional separation and predictive performance. Notably, Self-Des, which extracts embeddings from model-generated gene function descriptions, enabled locally deployable LLMs to generate high-fidelity representations that approach the quality of expert-curated databases. Genome-scale analyses further supported these findings, indicating that explicit functional descriptions are an effective design principle for generating high-quality gene embeddings from LLMs.

Matching journals

●Non-profit ◐University press ○Commercial

The top 9 journals account for 50% of the predicted probability mass.

Only show non-profit

Nucleic Acids Research

◐ 1128 papers in training set

Nature Communications

○ 4913 papers in training set

○ 555 papers in training set

Genome Research

● 409 papers in training set

◐ 1061 papers in training set

○ 167 papers in training set

Briefings in Bioinformatics

◐ 326 papers in training set

Genome Medicine

○ 154 papers in training set

PLOS Computational Biology

● 1633 papers in training set

50% of probability mass above

Computational and Structural Biotechnology Journal

● 216 papers in training set

Bioinformatics Advances

◐ 184 papers in training set

○ 336 papers in training set

Nature Biotechnology

○ 147 papers in training set

Nature Machine Intelligence

○ 61 papers in training set

◐ 172 papers in training set

Advanced Science

○ 249 papers in training set

○ 162 papers in training set

NAR Genomics and Bioinformatics

◐ 214 papers in training set

Frontiers in Genetics

○ 197 papers in training set

Scientific Reports

○ 3102 papers in training set

BMC Bioinformatics

○ 383 papers in training set

IEEE Transactions on Computational Biology and Bioinformatics

● 17 papers in training set

Proceedings of the National Academy of Sciences

● 2130 papers in training set

Nature Computational Science

○ 50 papers in training set

● 4510 papers in training set

◐ 51 papers in training set

npj Systems Biology and Applications

○ 99 papers in training set

○ 70 papers in training set

○ 1063 papers in training set

Genomics, Proteomics & Bioinformatics

◐ 171 papers in training set