Back

Closed-Loop Quality Assurance for Production Clinical AI Documentation

Napier, A.; Wiley, J.; Heslin, M.

2026-05-29 health informatics
10.64898/2026.05.27.26353977 medRxiv
Show abstract

A closed-loop quality system deployed across thirteen US hospital sites resolved physician complaints with zero regressions on 42 tracked cases across 1,089 optimization iterations, while a deterministic assembly-agent replacement cut H+P trace latency from 19.6 s to 10.8 s (-8.8 s, 95% CI [-10.5, -7.1] s; n = 100 pre, n = 100 post). We report four observations and an architectural follow-through. First, the same binary-check instrument produces opposite outcomes depending on the question asked: "maximize this score" produces structurally-correct notes that physicians reject (Spearman rho = -0.077, 95% CI [-0.40, 0.26], n = 36); "did this specific fabrication stop?" produces rater-invariant deployment decisions. Second, in our pipeline, assembly-stage agents did not respond to prompt optimization the way reasoning agents did: four consecutive optimization attempts produced 18-28 point regressions. Third, physician preference is rater-fragile at typical clinical-AI calibration sample sizes (Cohen's kappa = 0.028 between two board-certified physicians, 95% CI [-0.30, 0.36] on n = 35 overlapping pairs). Fourth, the architectural punchline: six weeks after the prediction, the LLM call at the chart-assembly step was replaced with a deterministic renderer (sub-500-character template plus sandboxed scripting), lifting the defect-free rate on a 51-case holdout from 49% to 84%. We introduce a Pareto-with-absolute-floors acceptance rule (multi-axis commit with severity-class categorical vetoes) as a methodological contribution distinct from scalar-reward acceptance in standard prompt-optimization frameworks. Cross-iteration rejection memory prevents the loop from re-proposing edits already rejected three or more times. A reproducibility bundle (anonymized ablation per-case counts, bootstrap-CI data, analysis scripts) is released under CC BY 4.0 at github.com/sayvant/SQS-Auditor-paper-data.

Matching journals

The top 5 journals account for 50% of the predicted probability mass.

1
Nature Communications
4913 papers in training set
Top 4%
22.1%
2
npj Digital Medicine
97 papers in training set
Top 0.3%
18.3%
3
Journal of the American Medical Informatics Association
61 papers in training set
Top 0.7%
4.1%
4
eLife
5422 papers in training set
Top 27%
3.5%
5
Nature Medicine
117 papers in training set
Top 0.9%
3.5%
50% of probability mass above
6
Scientific Reports
3102 papers in training set
Top 41%
3.0%
7
PLOS ONE
4510 papers in training set
Top 44%
2.7%
8
Nature Biomedical Engineering
42 papers in training set
Top 0.6%
2.0%
9
PLOS Digital Health
91 papers in training set
Top 1%
2.0%
10
JCO Clinical Cancer Informatics
18 papers in training set
Top 0.4%
1.9%
11
Nature Methods
336 papers in training set
Top 4%
1.8%
12
Med
38 papers in training set
Top 0.3%
1.7%
13
iScience
1063 papers in training set
Top 16%
1.7%
14
Nature Computational Science
50 papers in training set
Top 0.7%
1.7%
15
Cell Systems
167 papers in training set
Top 8%
1.6%
16
Nature Machine Intelligence
61 papers in training set
Top 2%
1.6%
17
Philosophical Transactions of the Royal Society B
51 papers in training set
Top 4%
1.5%
18
Science
429 papers in training set
Top 17%
1.2%
19
Nature
575 papers in training set
Top 13%
1.2%
20
Bioinformatics
1061 papers in training set
Top 8%
1.2%
21
Nature Human Behaviour
85 papers in training set
Top 3%
1.2%
22
Patterns
70 papers in training set
Top 2%
1.1%
23
The Lancet Digital Health
25 papers in training set
Top 0.8%
0.9%
24
Science Advances
1098 papers in training set
Top 27%
0.9%
25
Annals of Internal Medicine
27 papers in training set
Top 0.8%
0.9%
26
Science Translational Medicine
111 papers in training set
Top 6%
0.8%
27
Nature Biotechnology
147 papers in training set
Top 9%
0.6%
28
Nature Genetics
240 papers in training set
Top 9%
0.6%