Back

Is One Run Enough? Reproducibility of Flagship Large Language Models Across Temperature and Reasoning Settings in Biomedical Text Processing

Windisch, P.; Koechli, C.; Dennstaedt, F.; Aebersold, D. M.; Zwahlen, D. R.; Foerster, R.; Schroeder, C.

2026-02-03 health informatics
10.64898/2026.02.02.26345352 medRxiv
Show abstract

PurposeTo quantify run-to-run reproducibility of Gemini 3 Flash Preview and GPT-5.2 for biomedical trial-success classification across temperature and reasoning/thinking settings, and to assess whether single-run reporting is sufficient. MethodsWe utilized 250 randomized controlled oncology trial abstracts labeled POSITIVE/NEGATIVE based on primary endpoint success. With a fixed prompt requiring exactly "POSITIVE" or "NEGATIVE", we evaluated Gemini across thinking levels (minimal, low, medium, high) and temperatures 0.0 - 2.0, and GPT-5.2 across reasoning-effort levels (none to xhigh) with an additional temperature sweep when reasoning was disabled. Each setting was run three times. Reproducibility was quantified with Fleiss {kappa} across replicates, performance was summarized with F1 (per run and majority vote), and invalid-format outputs were recorded. ResultsGemini showed near-perfect agreement across settings ({kappa}=0.942 - 1.000), including perfect agreement at temperature 0. Invalid outputs were uncommon (0 - 1.5%). GPT-5.2 reproducibility was similarly high ({kappa}=0.984 - 0.995) with no invalid outputs. Performance remained stable (mean/majority-vote F1 = 0.955 - 0.971), and majority voting offered only marginal gains. ConclusionFor strict binary biomedical classification with tightly constrained outputs, both models were highly reproducible across common decoding and reasoning configurations, indicating that one run is often adequate while minimal replication provides a practical stability check.

Matching journals

The top 6 journals account for 50% of the predicted probability mass.

1
JCO Clinical Cancer Informatics
18 papers in training set
Top 0.1%
22.3%
2
Artificial Intelligence in Medicine
15 papers in training set
Top 0.1%
6.7%
3
BMC Bioinformatics
383 papers in training set
Top 1%
6.7%
4
Journal of Medical Internet Research
85 papers in training set
Top 0.8%
6.2%
5
Journal of the American Medical Informatics Association
61 papers in training set
Top 0.5%
6.2%
6
Scientific Reports
3102 papers in training set
Top 29%
4.1%
50% of probability mass above
7
Bioinformatics
1061 papers in training set
Top 6%
2.7%
8
International Journal of Medical Informatics
25 papers in training set
Top 0.6%
2.3%
9
BMC Medical Informatics and Decision Making
39 papers in training set
Top 1%
2.1%
10
PLOS ONE
4510 papers in training set
Top 51%
1.9%
11
Journal of Biomedical Informatics
45 papers in training set
Top 0.7%
1.9%
12
Biology Methods and Protocols
53 papers in training set
Top 1.0%
1.7%
13
Nature Communications
4913 papers in training set
Top 52%
1.7%
14
The Lancet Digital Health
25 papers in training set
Top 0.4%
1.7%
15
Frontiers in Digital Health
20 papers in training set
Top 0.7%
1.7%
16
JMIR Medical Informatics
17 papers in training set
Top 0.8%
1.7%
17
eBioMedicine
130 papers in training set
Top 1%
1.7%
18
npj Digital Medicine
97 papers in training set
Top 2%
1.7%
19
Cancer Medicine
24 papers in training set
Top 0.8%
1.6%
20
JAMIA Open
37 papers in training set
Top 0.9%
1.5%
21
Frontiers in Artificial Intelligence
18 papers in training set
Top 0.4%
1.3%
22
Informatics in Medicine Unlocked
21 papers in training set
Top 0.7%
1.2%
23
BMC Medical Research Methodology
43 papers in training set
Top 0.9%
1.1%
24
Computer Methods and Programs in Biomedicine
27 papers in training set
Top 0.7%
0.9%
25
Journal of Clinical Epidemiology
28 papers in training set
Top 0.5%
0.9%
26
iScience
1063 papers in training set
Top 25%
0.9%
27
Computers in Biology and Medicine
120 papers in training set
Top 4%
0.9%
28
JAMA Network Open
127 papers in training set
Top 4%
0.9%
29
PLOS Computational Biology
1633 papers in training set
Top 25%
0.7%
30
BMJ Health & Care Informatics
13 papers in training set
Top 0.9%
0.7%