Back

Skill-Augmented Frontier Agents Nearly Saturate BixBench-Verified-50

Zhang, X.

2026-05-01 bioinformatics
10.64898/2026.04.28.721523 bioRxiv
Show abstract

Large language model (LLM) agents are increasingly used for biological data analysis, but prior benchmark results have given a mixed picture of whether they are ready for routine bioinformatics work. The original BixBench study reported only [~] 17-21% accuracy for frontier agents on open-answer bioinformatics questions [1]. Subsequent curation of BixBench-Verified-50 removed or revised ambiguous items, revealing much higher performance for modern agents [2]. Here we evaluate three frontier-model configurations on the 50 verified questions using the same local benchmark, prompt structure, answer format, and grading pipeline: GPT-5.4 with Claude Scientific Skills and no web access, Claude Opus 4.7 with Claude Scientific Skills and no web access, and GPT-5.5 with Claude Scientific Skills, bioSkills, and web access. The three configurations achieve 88.0% (44/50), 84.0% (42/50), and 98.0% (49/50) accuracy, respectively. The remaining GPT-5.5 error is not a clear analytical failure: the agent correctly computed Spearman correlations on the distributed CRISPRGeneEffect.csv values and selected CCND1, whereas the reference answer is recovered only after interpreting stronger essentiality as the opposite sign of the raw gene-effect score. Offline errors mainly occurred when agents lacked pathway, organism-annotation, BUSCO, or PhyKIT-related resources. These results show that frontier agents equipped with high-quality scientific skills can nearly saturate a curated bioinformatics benchmark, while also emphasizing that question wording, score sign conventions, and access to current external resources remain decisive for reliable evaluation.

Matching journals

The top 4 journals account for 50% of the predicted probability mass.

1
Bioinformatics
1061 papers in training set
Top 2%
18.4%
2
BMC Bioinformatics
383 papers in training set
Top 0.5%
14.2%
3
Briefings in Bioinformatics
326 papers in training set
Top 0.4%
10.0%
4
Bioinformatics Advances
184 papers in training set
Top 0.2%
8.3%
50% of probability mass above
5
NAR Genomics and Bioinformatics
214 papers in training set
Top 0.2%
6.3%
6
PLOS Computational Biology
1633 papers in training set
Top 9%
3.9%
7
Genomics, Proteomics & Bioinformatics
171 papers in training set
Top 2%
3.5%
8
GigaScience
172 papers in training set
Top 0.5%
3.5%
9
Nature Communications
4913 papers in training set
Top 41%
3.5%
10
Computational and Structural Biotechnology Journal
216 papers in training set
Top 3%
2.3%
11
Genome Biology
555 papers in training set
Top 4%
1.9%
12
Nucleic Acids Research
1128 papers in training set
Top 10%
1.9%
13
Scientific Reports
3102 papers in training set
Top 60%
1.6%
14
Nature Methods
336 papers in training set
Top 5%
1.5%
15
Genome Research
409 papers in training set
Top 3%
1.5%
16
IEEE/ACM Transactions on Computational Biology and Bioinformatics
32 papers in training set
Top 0.3%
1.5%
17
International Journal of Molecular Sciences
453 papers in training set
Top 11%
1.1%
18
Patterns
70 papers in training set
Top 2%
0.9%
19
IEEE Transactions on Computational Biology and Bioinformatics
17 papers in training set
Top 0.6%
0.8%
20
Journal of Molecular Biology
217 papers in training set
Top 4%
0.7%
21
Nature Machine Intelligence
61 papers in training set
Top 4%
0.7%
22
Cell Systems
167 papers in training set
Top 12%
0.7%
23
BMC Genomics
328 papers in training set
Top 6%
0.7%
24
iScience
1063 papers in training set
Top 35%
0.7%
25
PLOS ONE
4510 papers in training set
Top 69%
0.7%
26
Communications Biology
886 papers in training set
Top 30%
0.6%