Back

Is One Run Enough? Reproducibility of Flagship Large Language Models Across Temperature and Reasoning Settings in Biomedical Text Processing

Windisch, P.; Koechli, C.; Dennstaedt, F.; Aebersold, D. M.; Zwahlen, D. R.; Foerster, R.; Schroeder, C.

2026-02-03 health informatics
10.64898/2026.02.02.26345352
Show abstract

PurposeTo quantify run-to-run reproducibility of Gemini 3 Flash Preview and GPT-5.2 for biomedical trial-success classification across temperature and reasoning/thinking settings, and to assess whether single-run reporting is sufficient. MethodsWe utilized 250 randomized controlled oncology trial abstracts labeled POSITIVE/NEGATIVE based on primary endpoint success. With a fixed prompt requiring exactly "POSITIVE" or "NEGATIVE", we evaluated Gemini across thinking levels (minimal, low, medium, high) and temperatures 0.0 - 2.0, and GPT-5.2 across reasoning-effort levels (none to xhigh) with an additional temperature sweep when reasoning was disabled. Each setting was run three times. Reproducibility was quantified with Fleiss {kappa} across replicates, performance was summarized with F1 (per run and majority vote), and invalid-format outputs were recorded. ResultsGemini showed near-perfect agreement across settings ({kappa}=0.942 - 1.000), including perfect agreement at temperature 0. Invalid outputs were uncommon (0 - 1.5%). GPT-5.2 reproducibility was similarly high ({kappa}=0.984 - 0.995) with no invalid outputs. Performance remained stable (mean/majority-vote F1 = 0.955 - 0.971), and majority voting offered only marginal gains. ConclusionFor strict binary biomedical classification with tightly constrained outputs, both models were highly reproducible across common decoding and reasoning configurations, indicating that one run is often adequate while minimal replication provides a practical stability check.

Matching journals

The top 6 journals account for 50% of the predicted probability mass.

1
Journal of the American Medical Informatics Association
based on 53 papers
Top 0.5%
15.7%
2
JCO Clinical Cancer Informatics
based on 14 papers
Top 0.1%
13.3%
3
BMC Medical Informatics and Decision Making
based on 36 papers
Top 2%
6.5%
4
Journal of Medical Internet Research
based on 81 papers
Top 2%
6.5%
5
npj Digital Medicine
based on 85 papers
Top 3%
5.9%
6
JAMIA Open
based on 35 papers
Top 2%
4.8%
50% of probability mass above
7
Journal of Biomedical Informatics
based on 37 papers
Top 2%
2.9%
8
Scientific Reports
based on 701 papers
Top 54%
2.9%
9
JAMA Network Open
based on 125 papers
Top 6%
2.9%
10
BMC Medical Research Methodology
based on 41 papers
Top 1%
2.5%
11
Nature Communications
based on 483 papers
Top 26%
2.4%
12
JMIR Medical Informatics
based on 16 papers
Top 3%
2.3%
13
The Lancet Digital Health
based on 25 papers
Top 2%
1.8%
14
International Journal of Medical Informatics
based on 25 papers
Top 4%
1.6%
15
PLOS ONE
based on 1737 papers
Top 89%
1.6%
16
Nature Medicine
based on 88 papers
Top 8%
1.6%
17
BMJ Health & Care Informatics
based on 13 papers
Top 2%
1.3%
18
Journal of Clinical Epidemiology
based on 29 papers
Top 2%
1.2%
19
PLOS Digital Health
based on 88 papers
Top 10%
1.2%
20
Computers in Biology and Medicine
based on 39 papers
Top 6%
0.8%
21
eBioMedicine
based on 82 papers
Top 6%
0.8%
22
Scientific Data
based on 30 papers
Top 3%
0.8%
23
Patterns
based on 15 papers
Top 3%
0.7%
24
Cancers
based on 57 papers
Top 7%
0.7%
25
Frontiers in Digital Health
based on 18 papers
Top 5%
0.7%
26
Informatics in Medicine Unlocked
based on 11 papers
Top 3%
0.7%
27
Annals of Internal Medicine
based on 27 papers
Top 2%
0.7%
28
Research Synthesis Methods
based on 17 papers
Top 1%
0.7%