Benchmarking Large Language Models for Predicting Therapeutic Antisense Oligonucleotide Efficacy

Wei, Z.; Griesmer, S.; Sundar, A.

2026-02-19 bioinformatics

10.64898/2026.02.17.706455 bioRxiv

Show abstract

Antisense oligonucleotides (ASOs) are a promising class of therapeutic drugs that can target and modulate genes associated with various diseases. This study benchmarks Large Language Models (LLMs) for predicting ASO therapeutic efficacy through a two-stage approach: (1) molecular embedding-based fine-tuning using SMILES representations, and (2) prompt engineering with zero-shot and few-shot learning using DNA sequences with target gene information. We evaluated general-purpose models (GPT-3.5-Turbo, LLaMA2-7B, Galactica-6.7B) and chemistry-specific models (ChemBERTa, Molformer, BERT) across three datasets: PFRED (522 sequences), openASO (1708 sequences), and ASOptimizer (1267 sequences). DNA sequence inputs with target gene information outperformed SMILES representations. GPT-3.5-Turbo achieved R2 values of 0.6381 (PFRED) and 0.6340 (ASOptimizer) for few-shot prompting with k=3 examples. Code and datasets available at: https://github.com/asundar0128/IndependentStudy

Matching journals

●Non-profit ◐University press ○Commercial

The top 4 journals account for 50% of the predicted probability mass.

Only show non-profit

◐ 1061 papers in training set

Journal of Chemical Information and Modeling

● 207 papers in training set

Briefings in Bioinformatics

◐ 326 papers in training set

Bioinformatics Advances

◐ 184 papers in training set

50% of probability mass above

Nucleic Acids Research

◐ 1128 papers in training set

Nature Communications

○ 4913 papers in training set

Nature Machine Intelligence

○ 61 papers in training set

PLOS Computational Biology

● 1633 papers in training set

Molecular Therapy Nucleic Acids

○ 32 papers in training set

Computational and Structural Biotechnology Journal

● 216 papers in training set

Advanced Science

○ 249 papers in training set

Scientific Reports

○ 3102 papers in training set

BMC Bioinformatics

○ 383 papers in training set

○ 167 papers in training set

○ 336 papers in training set

● 4510 papers in training set

NAR Genomics and Bioinformatics

◐ 214 papers in training set

○ 1063 papers in training set

Nature Biotechnology

○ 147 papers in training set

Journal of Cheminformatics

○ 25 papers in training set

ACS Synthetic Biology

● 256 papers in training set

Proceedings of the National Academy of Sciences

● 2130 papers in training set

Genome Medicine

○ 154 papers in training set

○ 70 papers in training set

Frontiers in Genetics

○ 197 papers in training set