Back

Set-up, validation, evaluation, and cost-benefit analysis of an AI-assisted assessment of responsible research practices in a sample of life science publications

Kniffert, S.; Kathoefer, B.; Emprechtinger, R.; Pellegrini, P.; Funk, E. M.; Dhamrait, I. S.; Zang, Y.; Bornmueller, A.; Toelch, U.

2026-02-02 scientific communication and education
10.64898/2026.01.23.701317 bioRxiv
Show abstract

The (semi-)automated screening of publications for diverse quality and transparency criteria is at the core of systematic literature assessment. Typically, the assessment process involves two initial reviewers and one additional reviewer for cases that require reconciliation. Here, we explore to what extent this process can be assisted by Large Language Models (LLMs). Specifically, whether LLMs are capable of assessing responsible research practices (RRPs) in scientific papers in a robust way. We employed proprietary LLMs to assess an initial set of 37 papers across ten RRPs. The same papers were also reviewed by three human reviewers. We iteratively redesigned prompts to increase model accuracy compared to human ratings which we treated as the gold standard. The resulting pipeline was validated on an additional set of 15 papers. We show that LLM accuracy is comparable to single human reviewer performance (90% for LLM vs 86% for a single human reviewer). However, performance strongly depended on the specific RRPs with accuracy ranging from 40% to 100%. LLMs exhibited an affirmative bias, making more errors when practices were not reported in the papers. Overall, we show how such an approach potentially replaces one human reviewer, enabling AI-assisted assessment of research papers. We discuss how dataset imbalances, validation procedures, and implementation time limit the broad applicability of such approaches. Through this, we develop initial guidance on the utility of proprietary LLMs in evidence synthesis.

Matching journals

The top 4 journals account for 50% of the predicted probability mass.

1
PLOS Biology
408 papers in training set
Top 0.1%
26.0%
2
Journal of the American Medical Informatics Association
61 papers in training set
Top 0.3%
10.1%
3
eLife
5422 papers in training set
Top 8%
8.4%
4
Research Synthesis Methods
20 papers in training set
Top 0.1%
8.4%
50% of probability mass above
5
PLOS Computational Biology
1633 papers in training set
Top 4%
8.2%
6
Philosophical Transactions of the Royal Society B
51 papers in training set
Top 1%
4.0%
7
PLOS ONE
4510 papers in training set
Top 36%
4.0%
8
Nature Human Behaviour
85 papers in training set
Top 0.7%
4.0%
9
Scientific Reports
3102 papers in training set
Top 53%
1.9%
10
Scientific Data
174 papers in training set
Top 1%
1.7%
11
Journal of Clinical Epidemiology
28 papers in training set
Top 0.3%
1.7%
12
Wellcome Open Research
57 papers in training set
Top 1%
1.5%
13
Bioinformatics
1061 papers in training set
Top 8%
1.5%
14
Journal of Cell Biology
333 papers in training set
Top 3%
1.2%
15
BioData Mining
15 papers in training set
Top 0.5%
1.2%
16
FACETS
11 papers in training set
Top 0.2%
0.9%
17
FEBS Letters
42 papers in training set
Top 0.3%
0.7%
18
BMC Medicine
163 papers in training set
Top 7%
0.7%
19
GigaScience
172 papers in training set
Top 3%
0.7%
20
npj Digital Medicine
97 papers in training set
Top 4%
0.7%
21
eneuro
389 papers in training set
Top 10%
0.7%
22
BMC Biology
248 papers in training set
Top 5%
0.7%
23
Annals of Internal Medicine
27 papers in training set
Top 1%
0.6%
24
PLOS Digital Health
91 papers in training set
Top 3%
0.6%
25
BMC Bioinformatics
383 papers in training set
Top 8%
0.6%