Back

Benchmarking Large Language Models for Predicting Therapeutic Antisense Oligonucleotide Efficacy

Wei, Z.; Griesmer, S.; Sundar, A.

2026-02-19 bioinformatics
10.64898/2026.02.17.706455 bioRxiv
Show abstract

Antisense oligonucleotides (ASOs) are a promising class of therapeutic drugs that can target and modulate genes associated with various diseases. This study benchmarks Large Language Models (LLMs) for predicting ASO therapeutic efficacy through a two-stage approach: (1) molecular embedding-based fine-tuning using SMILES representations, and (2) prompt engineering with zero-shot and few-shot learning using DNA sequences with target gene information. We evaluated general-purpose models (GPT-3.5-Turbo, LLaMA2-7B, Galactica-6.7B) and chemistry-specific models (ChemBERTa, Molformer, BERT) across three datasets: PFRED (522 sequences), openASO (1708 sequences), and ASOptimizer (1267 sequences). DNA sequence inputs with target gene information outperformed SMILES representations. GPT-3.5-Turbo achieved R2 values of 0.6381 (PFRED) and 0.6340 (ASOptimizer) for few-shot prompting with k=3 examples. Code and datasets available at: https://github.com/asundar0128/IndependentStudy

Matching journals

The top 4 journals account for 50% of the predicted probability mass.

1
Bioinformatics
1061 papers in training set
Top 1%
22.7%
2
Journal of Chemical Information and Modeling
207 papers in training set
Top 0.4%
14.5%
3
Briefings in Bioinformatics
326 papers in training set
Top 0.3%
12.5%
4
Bioinformatics Advances
184 papers in training set
Top 0.3%
6.9%
50% of probability mass above
5
Nucleic Acids Research
1128 papers in training set
Top 3%
6.4%
6
Nature Communications
4913 papers in training set
Top 38%
3.7%
7
Nature Machine Intelligence
61 papers in training set
Top 1%
2.8%
8
PLOS Computational Biology
1633 papers in training set
Top 13%
2.1%
9
Molecular Therapy Nucleic Acids
32 papers in training set
Top 0.3%
1.8%
10
Computational and Structural Biotechnology Journal
216 papers in training set
Top 4%
1.7%
11
Advanced Science
249 papers in training set
Top 11%
1.7%
12
Scientific Reports
3102 papers in training set
Top 62%
1.5%
13
BMC Bioinformatics
383 papers in training set
Top 5%
1.2%
14
Cell Systems
167 papers in training set
Top 10%
1.1%
15
Nature Methods
336 papers in training set
Top 6%
0.8%
16
PLOS ONE
4510 papers in training set
Top 66%
0.8%
17
NAR Genomics and Bioinformatics
214 papers in training set
Top 4%
0.8%
18
iScience
1063 papers in training set
Top 31%
0.8%
19
Nature Biotechnology
147 papers in training set
Top 7%
0.8%
20
Journal of Cheminformatics
25 papers in training set
Top 0.6%
0.6%
21
ACS Synthetic Biology
256 papers in training set
Top 3%
0.6%
22
Proceedings of the National Academy of Sciences
2130 papers in training set
Top 47%
0.6%
23
Genome Medicine
154 papers in training set
Top 9%
0.6%
24
Patterns
70 papers in training set
Top 3%
0.6%
25
Frontiers in Genetics
197 papers in training set
Top 12%
0.5%