Back

Benchmarking Agentic Bioinformatics Systems for Complex Protein-Set Retrieval: A Coccolithophore Calcification Case Study

Zhang, X.

2026-04-02 bioinformatics
10.64898/2026.03.28.715041 bioRxiv
Show abstract

Large language model agents are increasingly used for bioinformatics tasks that require external databases, tool use, and long multi-step retrieval workflows. However, practical evaluation of these systems remains limited, especially for prompts whose target set is both large and biologically heterogeneous. Here, I benchmarked three agent systems on the same difficult retrieval task: downloading coccolithophore calcification-related proteins from UniProt across six mechanistically distinct categories, while producing category-separated FASTA files and supporting evidence. The compared systems were Codex app agents extended with Claude Scientific Skills, Biomni Lab online, and DeerFlow 2 with default skills only. Outputs were normalized at the UniProt accession level and compared category by category using overlap analysis, Venn decomposition, and a heuristic relevance assessment of each subset relative to the benchmark prompt. Across the six shared categories, Codex retrieved 2,118 proteins, DeerFlow 6,255, and Biomni 8,752 in a run. Codex showed the best balance between sensitivity and specificity: 92.4% of its proteins fell into subsets labeled high relevance and the remaining 7.6% into medium relevance. DeerFlow was substantially more exhaustive, but 43.8% of its proteins fell into low or low-medium relevance subsets. Biomni produced the largest sets, yet 69.5% of its proteins fell into low or low-medium relevance subsets, mainly due to broad expansion into generic calcium sensors, kinases, transcription factors, and poorly specific domain families. Category-specific analysis showed that Codex was the strongest primary source for inorganic carbon transport, calcium and pH regulation, vesicle trafficking, and signaling, whereas DeerFlow contributed valuable complementary matrix and polysaccharide candidates. A second run for each system also separated them strongly by repeatability: Codex had the highest within-system stability (mean category Jaccard 0.982; micro-Jaccard 0.974), DeerFlow was intermediate (0.795; 0.571), and Biomni was least stable (0.412; 0.319). These results suggest that for complex protein-family retrieval tasks, agent quality depends less on raw output volume than on prompt decomposition, taxonomic scoping, exact query generation, provenance-rich export artifacts, and repeated-run stability.

Matching journals

The top 6 journals account for 50% of the predicted probability mass.

1
PLOS Computational Biology
1633 papers in training set
Top 2%
14.8%
2
GigaScience
172 papers in training set
Top 0.1%
10.5%
3
Bioinformatics
1061 papers in training set
Top 3%
10.1%
4
Bioinformatics Advances
184 papers in training set
Top 0.3%
6.8%
5
NAR Genomics and Bioinformatics
214 papers in training set
Top 0.2%
6.4%
6
BMC Bioinformatics
383 papers in training set
Top 2%
4.9%
50% of probability mass above
7
Nucleic Acids Research
1128 papers in training set
Top 5%
4.0%
8
Computational and Structural Biotechnology Journal
216 papers in training set
Top 2%
3.1%
9
Genome Biology
555 papers in training set
Top 3%
2.4%
10
PLOS ONE
4510 papers in training set
Top 48%
2.1%
11
Protein Science
221 papers in training set
Top 0.7%
1.9%
12
Briefings in Bioinformatics
326 papers in training set
Top 3%
1.9%
13
PeerJ
261 papers in training set
Top 6%
1.9%
14
Nature Communications
4913 papers in training set
Top 48%
1.9%
15
Frontiers in Bioinformatics
45 papers in training set
Top 0.2%
1.8%
16
Scientific Reports
3102 papers in training set
Top 62%
1.5%
17
Scientific Data
174 papers in training set
Top 1%
1.5%
18
Cell Systems
167 papers in training set
Top 8%
1.5%
19
Nature Methods
336 papers in training set
Top 5%
1.3%
20
Genome Medicine
154 papers in training set
Top 5%
1.3%
21
Nature Biotechnology
147 papers in training set
Top 5%
1.3%
22
Journal of Molecular Biology
217 papers in training set
Top 3%
1.1%
23
Patterns
70 papers in training set
Top 2%
0.9%
24
Molecular Biology and Evolution
488 papers in training set
Top 4%
0.8%
25
Advanced Science
249 papers in training set
Top 19%
0.7%
26
Journal of Proteome Research
215 papers in training set
Top 2%
0.7%
27
npj Systems Biology and Applications
99 papers in training set
Top 2%
0.7%
28
eLife
5422 papers in training set
Top 57%
0.7%
29
BMC Genomics
328 papers in training set
Top 6%
0.7%
30
Journal of Chemical Information and Modeling
207 papers in training set
Top 3%
0.6%