GenePT Revisited: Do Better Text Embeddings Make Better Gene Embeddings?

Hedley, J. G.; Torr, P. H. S.; Märtens, K.

2026-04-20 genomics

10.64898/2026.04.16.718976 bioRxiv

Show abstract

AO_SCPLOWBSTRACTC_SCPLOWGenePT introduced a simple recipe for gene representations: embed each genes natural-language description with a general-purpose text embedding model and reuse the resulting vectors across downstream tasks. Since GenePTs release, embedding models have improved rapidly, with many strong open and commercial encoders benchmarked on suites such as the Massive Text Embedding Benchmark (MTEB). We present a controlled "leaderboard" study that keeps the GenePT pipeline fixed and varies only the embedding backbone. We benchmark contemporary encoders on four diverse gene embedding tasks: gene-gene interaction prediction, gene property classification, cell type classification, and prediction of transcriptomic responses to unseen genetic perturbations. Across these settings, newer backbones consistently outperform the original GenePT backbone (text-embedding-ada-002), achieving improvements of 1-17%, while enabling fully reproducible research by avoiding API dependencies.

Matching journals

●Non-profit ◐University press ○Commercial

The top 7 journals account for 50% of the predicted probability mass.

Only show non-profit

Genome Research

● 409 papers in training set

○ 555 papers in training set

Nature Biotechnology

○ 147 papers in training set

○ 336 papers in training set

○ 575 papers in training set

Nature Machine Intelligence

○ 61 papers in training set

Nature Communications

○ 4913 papers in training set

50% of probability mass above

◐ 1061 papers in training set

Nucleic Acids Research

◐ 1128 papers in training set

Nature Genetics

○ 240 papers in training set

Bioinformatics Advances

◐ 184 papers in training set

○ 162 papers in training set

○ 167 papers in training set

Nature Computational Science

○ 50 papers in training set

Proceedings of the National Academy of Sciences

● 2130 papers in training set

● 429 papers in training set

Genome Medicine

○ 154 papers in training set

Frontiers in Genetics

○ 197 papers in training set

Nature Medicine

○ 117 papers in training set

Briefings in Bioinformatics

◐ 326 papers in training set

Scientific Reports

○ 3102 papers in training set

Nature Neuroscience

○ 216 papers in training set

IEEE Transactions on Computational Biology and Bioinformatics

● 17 papers in training set

BMC Bioinformatics

○ 383 papers in training set

● 4510 papers in training set

The American Journal of Human Genetics

○ 206 papers in training set