Back

Benchmarking Agentic Large Language Models for ComplexProtein-Set Functional Annotation

Zhang, X.

2026-04-21 bioinformatics
10.64898/2026.04.18.719404 bioRxiv
Show abstract

Large language model (LLM) agents are increasingly used to synthesize heterogeneous bioinformatics evidence, but their reliability for high-volume biological annotation remains poorly characterized. We evaluated three agent configurations on a controlled protein annotation task: Claude App with Claude Opus 4.7, Claude Code CLI with Claude Opus 4.7 and Claude Scientific Skills, and Codex App with GPT-5.4 and Claude Scientific Skills. Each configuration was run three times on the same verbatim prompt, the same 73 selected orthogroup FASTA files (1,705 protein sequences), and the same local evidence: Swiss-Prot BLASTP output, Pfam/HMMER domain hits, DeepTMHMM topology predictions, and SignalP secretion predictions. We audited the nine outputs for coverage, biological correctness, missing evidence, hallucinated or over-specific annotations, and within-method consistency, then merged the best-supported evidence into a final orthogroup annotation table. All nine runs covered all 73 orthogroups, indicating that the agents could retrieve and organize the complete input set. However, normalized calcification-relevance calls were only moderately reproducible: within-method exact tier agreement ranged from 0.397 to 0.685 for Claude App (mean 0.562), 0.342 to 0.740 for Claude Code (mean 0.516), and 0.411 to 0.630 for Codex App (mean 0.539), and the per-run number of high-confidence calls varied from 0 to 12 across the nine runs. The final curated table retained 3 high-confidence, 9 moderate, 18 watchlist, and 43 low-relevance orthogroups. The most robust direct candidates were sulfatase (OG0017138) and sulfotransferase (OG0020703) families and an FG-GAP/integrin-like surface protein family (OG0018986), whereas common error modes included elevating pentapeptide-repeat orthogroups on motif evidence alone, treating weakly secreted housekeeping enzymes as matrix proteins, and taking low-complexity BLAST labels at face value. Skill-enabled agents improved file handling, evidence traceability, and reproducibility of computational checking, but they did not eliminate biological overinterpretation. These results support a best-practice workflow in which LLM agents draft annotations only after deterministic evidence tables are generated, with explicit scoring rules, provenance columns, run-to-run replication, and expert review of high-impact claims.

Matching journals

The top 7 journals account for 50% of the predicted probability mass.

1
Bioinformatics
1061 papers in training set
Top 3%
10.4%
2
PLOS Computational Biology
1633 papers in training set
Top 3%
10.1%
3
GigaScience
172 papers in training set
Top 0.1%
9.1%
4
BMC Bioinformatics
383 papers in training set
Top 1%
6.8%
5
Nucleic Acids Research
1128 papers in training set
Top 4%
4.8%
6
Genome Biology
555 papers in training set
Top 1%
4.8%
7
Bioinformatics Advances
184 papers in training set
Top 0.8%
4.3%
50% of probability mass above
8
Nature Communications
4913 papers in training set
Top 35%
4.3%
9
NAR Genomics and Bioinformatics
214 papers in training set
Top 0.6%
3.7%
10
Cell Systems
167 papers in training set
Top 4%
3.6%
11
Nature Methods
336 papers in training set
Top 3%
3.6%
12
Nature Biotechnology
147 papers in training set
Top 3%
3.6%
13
Briefings in Bioinformatics
326 papers in training set
Top 2%
3.6%
14
Genome Medicine
154 papers in training set
Top 3%
2.6%
15
PLOS ONE
4510 papers in training set
Top 52%
1.8%
16
PeerJ
261 papers in training set
Top 9%
1.3%
17
Computational and Structural Biotechnology Journal
216 papers in training set
Top 6%
1.2%
18
BMC Genomics
328 papers in training set
Top 4%
1.2%
19
Genome Research
409 papers in training set
Top 3%
1.2%
20
Molecular Systems Biology
142 papers in training set
Top 1%
1.1%
21
Scientific Reports
3102 papers in training set
Top 69%
0.9%
22
eLife
5422 papers in training set
Top 53%
0.9%
23
Scientific Data
174 papers in training set
Top 2%
0.9%
24
Protein Science
221 papers in training set
Top 2%
0.8%
25
Frontiers in Bioinformatics
45 papers in training set
Top 0.9%
0.7%
26
Molecular & Cellular Proteomics
158 papers in training set
Top 2%
0.7%
27
Database
51 papers in training set
Top 1%
0.7%
28
Life Science Alliance
263 papers in training set
Top 2%
0.6%
29
Genomics, Proteomics & Bioinformatics
171 papers in training set
Top 7%
0.6%