Silent numerical failures in large language model-generated pharmacokinetic simulation code: a benchmark against target-controlled infusion validation criteria using the Marsh propofol model

Omote, M.

2026-04-28 health informatics

10.64898/2026.04.27.26351582 medRxiv

Show abstract

BackgroundLarge language models (LLMs) are increasingly used by clinicians to generate executable code for pharmacokinetic (PK) simulation. Whether such code meets the accuracy standards of target-controlled infusion systems has not been systematically evaluated. MethodsFive LLMs (ChatGPT, Claude, DeepSeek, Gemini, Grok) were prompted to generate Python code for the Marsh three-compartment propofol model under a standardized 120-minute bolus-plus-infusion regimen. Each LLM was tested in two phases: Phase 1, integrator free; Phase 2, fourth-order Runge-Kutta with 1-second step size mandated. Twenty runs per LLM per phase were collected (n = 200). Plasma concentrations were compared against a triple-validated reference using median prediction error (MDPE), median absolute prediction error (MDAPE), and Wobble. Runs were classified as Class A (MDAPE < 1 %), B (1-30 %), C ([≥] 30 %), or D (failed). ResultsAll 200 scripts were invokable and created a CSV file; 199/200 (99.5 %, 95 % CI 97.1-99.9 %) produced a valid time-concentration series. The remaining script (Gemini Phase 2 run 18) aborted during row formatting with ValueError and left a header-only CSV. Median MDAPE per LLM x phase ranged 0.0043-0.020 %, with 195/200 runs (97.5 %, 95 % CI 94.3-98.9 %) achieving Class A. Five runs (2.5 %, 95 % CI 1.1-5.7 %) were non-excellent or structurally defective: three were Class C due to time-scale/unit-handling errors (one DeepSeek run with a 6-second effective Euler step from a minute-as-second declaration, two Grok runs with min-{superscript 1} rate constants applied per second), one was Class D (the empty-CSV failure above), and one was Class B but reflected a duplicated-bolus implementation error rather than a benign numerical deviation. Kruskal-Wallis testing showed significant inter-LLM heterogeneity across all metrics and phases (all omnibus p < 0.01). Strict compliance with Phase 2 directives was 98 % (98/100 runs; 95 % CI 93.1-99.5 %); lenient compliance accepting RK4-adaptive implementations as a superset was 100 % (100/100 runs; 95 % CI 96.3-100 %). Yet all three numerically divergent Phase 2 runs occurred under nominally compliant RK4/dt = 1 s configurations; the fourth non-Class-A Phase 2 outcome was a formatting failure that produced no usable trajectory. ConclusionLLMs generate numerically accurate Marsh-model code in most runs but silently diverge in a clinically non-negligible minority. The Marsh model -- the simplest fixed-parameter three-compartment propofol model -- functioned here as a positive control: even so, three distinct classes of structural bug (unit/time-scale mismatch, duplicated-bolus event handling, malformed f-string formatting) slipped past apparent execution success. Two additional Phase 2 runs used an RK4-adaptive variant rather than classical RK4 and are therefore better interpreted as strict prompt non-compliance than as numerical failure. Prompt-level method specification substantially reduced algorithm-selection errors but did not eliminate unit or structural bugs. LLM-generated pharmacokinetic code requires reference-based validation before any safety-relevant use.

Silent numerical failures in large language model-generated pharmacokinetic simulation code: a benchmark against target-controlled infusion validation criteria using the Marsh propofol model

Matching journals