Is One Run Enough? Reproducibility of Flagship Large Language Models Across Temperature and Reasoning Settings in Biomedical Text Processing
Windisch, P.; Koechli, C.; Dennstaedt, F.; Aebersold, D. M.; Zwahlen, D. R.; Foerster, R.; Schroeder, C.
Show abstract
PurposeTo quantify run-to-run reproducibility of Gemini 3 Flash Preview and GPT-5.2 for biomedical trial-success classification across temperature and reasoning/thinking settings, and to assess whether single-run reporting is sufficient. MethodsWe utilized 250 randomized controlled oncology trial abstracts labeled POSITIVE/NEGATIVE based on primary endpoint success. With a fixed prompt requiring exactly "POSITIVE" or "NEGATIVE", we evaluated Gemini across thinking levels (minimal, low, medium, high) and temperatures 0.0 - 2.0, and GPT-5.2 across reasoning-effort levels (none to xhigh) with an additional temperature sweep when reasoning was disabled. Each setting was run three times. Reproducibility was quantified with Fleiss {kappa} across replicates, performance was summarized with F1 (per run and majority vote), and invalid-format outputs were recorded. ResultsGemini showed near-perfect agreement across settings ({kappa}=0.942 - 1.000), including perfect agreement at temperature 0. Invalid outputs were uncommon (0 - 1.5%). GPT-5.2 reproducibility was similarly high ({kappa}=0.984 - 0.995) with no invalid outputs. Performance remained stable (mean/majority-vote F1 = 0.955 - 0.971), and majority voting offered only marginal gains. ConclusionFor strict binary biomedical classification with tightly constrained outputs, both models were highly reproducible across common decoding and reasoning configurations, indicating that one run is often adequate while minimal replication provides a practical stability check.
Matching journals
The top 6 journals account for 50% of the predicted probability mass.