Back

Silent numerical failures in large language model-generated pharmacokinetic simulation code: a benchmark against target-controlled infusion validation criteria using the Marsh propofol model

Omote, M.

2026-04-28 health informatics
10.64898/2026.04.27.26351582 medRxiv
Show abstract

BackgroundLarge language models (LLMs) are increasingly used by clinicians to generate executable code for pharmacokinetic (PK) simulation. Whether such code meets the accuracy standards of target-controlled infusion systems has not been systematically evaluated. MethodsFive LLMs (ChatGPT, Claude, DeepSeek, Gemini, Grok) were prompted to generate Python code for the Marsh three-compartment propofol model under a standardized 120-minute bolus-plus-infusion regimen. Each LLM was tested in two phases: Phase 1, integrator free; Phase 2, fourth-order Runge-Kutta with 1-second step size mandated. Twenty runs per LLM per phase were collected (n = 200). Plasma concentrations were compared against a triple-validated reference using median prediction error (MDPE), median absolute prediction error (MDAPE), and Wobble. Runs were classified as Class A (MDAPE < 1 %), B (1-30 %), C ([&ge;] 30 %), or D (failed). ResultsAll 200 scripts were invokable and created a CSV file; 199/200 (99.5 %, 95 % CI 97.1-99.9 %) produced a valid time-concentration series. The remaining script (Gemini Phase 2 run 18) aborted during row formatting with ValueError and left a header-only CSV. Median MDAPE per LLM x phase ranged 0.0043-0.020 %, with 195/200 runs (97.5 %, 95 % CI 94.3-98.9 %) achieving Class A. Five runs (2.5 %, 95 % CI 1.1-5.7 %) were non-excellent or structurally defective: three were Class C due to time-scale/unit-handling errors (one DeepSeek run with a 6-second effective Euler step from a minute-as-second declaration, two Grok runs with min-{superscript 1} rate constants applied per second), one was Class D (the empty-CSV failure above), and one was Class B but reflected a duplicated-bolus implementation error rather than a benign numerical deviation. Kruskal-Wallis testing showed significant inter-LLM heterogeneity across all metrics and phases (all omnibus p < 0.01). Strict compliance with Phase 2 directives was 98 % (98/100 runs; 95 % CI 93.1-99.5 %); lenient compliance accepting RK4-adaptive implementations as a superset was 100 % (100/100 runs; 95 % CI 96.3-100 %). Yet all three numerically divergent Phase 2 runs occurred under nominally compliant RK4/dt = 1 s configurations; the fourth non-Class-A Phase 2 outcome was a formatting failure that produced no usable trajectory. ConclusionLLMs generate numerically accurate Marsh-model code in most runs but silently diverge in a clinically non-negligible minority. The Marsh model -- the simplest fixed-parameter three-compartment propofol model -- functioned here as a positive control: even so, three distinct classes of structural bug (unit/time-scale mismatch, duplicated-bolus event handling, malformed f-string formatting) slipped past apparent execution success. Two additional Phase 2 runs used an RK4-adaptive variant rather than classical RK4 and are therefore better interpreted as strict prompt non-compliance than as numerical failure. Prompt-level method specification substantially reduced algorithm-selection errors but did not eliminate unit or structural bugs. LLM-generated pharmacokinetic code requires reference-based validation before any safety-relevant use.

Matching journals

The top 11 journals account for 50% of the predicted probability mass.

1
Clinical and Translational Science
21 papers in training set
Top 0.1%
12.8%
2
PLOS ONE
4510 papers in training set
Top 23%
7.3%
3
JAMA Network Open
127 papers in training set
Top 0.4%
6.5%
4
British Journal of Clinical Pharmacology
21 papers in training set
Top 0.1%
4.3%
5
Clinical Pharmacology & Therapeutics
25 papers in training set
Top 0.1%
4.3%
6
BMJ Open
554 papers in training set
Top 5%
3.7%
7
npj Digital Medicine
97 papers in training set
Top 1%
3.7%
8
Scientific Reports
3102 papers in training set
Top 40%
3.1%
9
Journal of Antimicrobial Chemotherapy
43 papers in training set
Top 0.2%
1.9%
10
Journal of the American Medical Informatics Association
61 papers in training set
Top 1%
1.9%
11
Antimicrobial Agents and Chemotherapy
167 papers in training set
Top 1%
1.7%
50% of probability mass above
12
BMC Medicine
163 papers in training set
Top 3%
1.7%
13
JAMIA Open
37 papers in training set
Top 0.8%
1.7%
14
Pilot and Feasibility Studies
12 papers in training set
Top 0.3%
1.7%
15
Critical Care Explorations
15 papers in training set
Top 0.3%
1.5%
16
JCO Clinical Cancer Informatics
18 papers in training set
Top 0.5%
1.5%
17
European Respiratory Journal
54 papers in training set
Top 1%
1.4%
18
Journal of Infection
71 papers in training set
Top 2%
1.4%
19
Med
38 papers in training set
Top 0.4%
1.4%
20
BMC Bioinformatics
383 papers in training set
Top 5%
1.3%
21
Frontiers in Digital Health
20 papers in training set
Top 0.9%
1.3%
22
JMIR Public Health and Surveillance
45 papers in training set
Top 2%
1.3%
23
Frontiers in Pharmacology
100 papers in training set
Top 3%
1.3%
24
Annals of Internal Medicine
27 papers in training set
Top 0.7%
1.0%
25
eBioMedicine
130 papers in training set
Top 3%
0.9%
26
Nature Communications
4913 papers in training set
Top 61%
0.8%
27
Pharmacoepidemiology and Drug Safety
13 papers in training set
Top 0.4%
0.8%
28
British Journal of Anaesthesia
14 papers in training set
Top 0.7%
0.8%
29
Frontiers in Artificial Intelligence
18 papers in training set
Top 0.6%
0.8%
30
Microbiology Spectrum
435 papers in training set
Top 5%
0.8%