Back

Don't stop the heart: a performance analysis of large language models and potassium dosing

Blotske, K.; Zhao, X.; Henry, K.; Murray, B.; Gao, Y.; Smith, S. E.; Wayne, N.; Ku, P.; Smith, B.; Moua, S.; Sikora, A.

2026-06-04 pharmacology and therapeutics
10.64898/2026.06.02.26354762 medRxiv
Show abstract

Background: Electrolyte replacement is ubiquitous in the acute care setting, but its familiarity cannot belie that even small dosing errors with potassium can cause lethal cardiac arrhythmias. Recently, MedAgentBench offered a benchmark for agentic artificial intelligence (AI) including the ability to correctly dose potassium based on a single rule; however, this does not adequately reflect the clinical complexity or safety concerns of an agent that has been used as the lethal injection. The purpose of this analysis was to a probe leaderboard large language model (LLM) capabilities to follow basic dosing rules to safely replace potassium in a series of clinician-annotated cases. Methods: Using a clinician panel, we developed a series of dosing principles and 20 clinical cases reflective of the complexity of potassium replacement. External clinicians were surveyed to assess practice variability and agreement to clinician panel answers. We tested GPT-5-chat with each case in triplicate, with and without the clinician curated dosing principles, and prompted the model to answer six questions involving potassium goals, dosing, route, lab frequency, concurrent interventions, and the model's perceived level of confidence for the output and complexity of the case. The primary outcome was the rate of appropriate recommendations in comparison to clinician answers. Results: A total of 54 clinicians reviewed the 20 hypokalemia cases and hypokalemia dosing guideline. Clinicians expressed "highly agree" or "somewhat agree" for 66.8% of the cases evaluated when asked if they agree with the guideline-recommended management. When given the potassium dosing guideline, total errors dropped from 165 to 104, and average accuracy improved from 45% to 65% with GPT-5-Chat. GPT-5-Chat conveyed a high level of confidence for 100% of responses, while labeling 80% and 76% of cases as highly complex with and without the criteria, respectively. Potential harm scores were considerable in both groups, however, a notable reduction in severity scores occurred with the dosing guidance document. Recommendations on concurrent interventions and dosing had the highest rate of errors in both groups. Conclusions: Benchmarks must appropriately reflect clinical complexity to be considered valuable for the deployment of agentic artificial intelligence tools in the healthcare domain. GPT-5-Chat assessment on a comprehensive medication management task for potassium replacement showed improvement with dosing guidance, yet unfit benchmarking performance.

Matching journals

The top 8 journals account for 50% of the predicted probability mass.

1
JMIRx Med
31 papers in training set
Top 0.1%
10.3%
2
npj Digital Medicine
97 papers in training set
Top 0.7%
7.3%
3
BioData Mining
15 papers in training set
Top 0.1%
7.3%
4
Clinical Pharmacology & Therapeutics
25 papers in training set
Top 0.1%
6.9%
5
Journal of Biomedical Informatics
45 papers in training set
Top 0.2%
6.9%
6
PLOS ONE
4510 papers in training set
Top 26%
6.5%
7
Clinical and Translational Science
21 papers in training set
Top 0.2%
3.9%
8
F1000Research
79 papers in training set
Top 0.5%
3.6%
50% of probability mass above
9
Journal of the American Medical Informatics Association
61 papers in training set
Top 0.8%
3.6%
10
Scientific Reports
3102 papers in training set
Top 47%
2.4%
11
Computational and Structural Biotechnology Journal
216 papers in training set
Top 3%
2.1%
12
Epilepsy Research
12 papers in training set
Top 0.2%
2.1%
13
JMIR Medical Informatics
17 papers in training set
Top 0.6%
1.9%
14
Pilot and Feasibility Studies
12 papers in training set
Top 0.2%
1.8%
15
Frontiers in Pharmacology
100 papers in training set
Top 2%
1.7%
16
BMC Medical Informatics and Decision Making
39 papers in training set
Top 2%
1.5%
17
Biology Methods and Protocols
53 papers in training set
Top 1%
1.4%
18
Pharmacoepidemiology and Drug Safety
13 papers in training set
Top 0.3%
1.4%
19
PLOS Computational Biology
1633 papers in training set
Top 18%
1.4%
20
npj Genomic Medicine
33 papers in training set
Top 0.5%
1.2%
21
British Journal of Clinical Pharmacology
21 papers in training set
Top 0.4%
1.2%
22
Frontiers in Psychiatry
83 papers in training set
Top 3%
0.8%
23
npj Systems Biology and Applications
99 papers in training set
Top 2%
0.8%
24
Frontiers in Digital Health
20 papers in training set
Top 1%
0.8%
25
Heliyon
146 papers in training set
Top 5%
0.8%
26
IEEE Journal of Biomedical and Health Informatics
34 papers in training set
Top 2%
0.8%
27
Healthcare
16 papers in training set
Top 2%
0.8%
28
Circulation
66 papers in training set
Top 2%
0.8%
29
BMC Medical Research Methodology
43 papers in training set
Top 1%
0.8%
30
International Journal of Medical Informatics
25 papers in training set
Top 2%
0.8%