Operationalizing Eight-Dimensional Patient-Safety Risk Scoring at Scale: A Multi-Model Large Language Model Reliability Study

LIn, H.-M.; Lyu, J.; Wang, I.-L.

2026-06-01 health informatics

10.64898/2026.05.29.26354437 medRxiv

Show abstract

Background: Hospital incident risk scoring has long relied on two- or three-dimensional frameworks (Severity Assessment Codes or Risk Priority Numbers),even though root cause analysis standards recognize that clinical risk is multi-factorial. The obstacle has been mainly cognitive: human reviewers cannotreliably score many dimensions across high incident volumes, so richer assessmenthas not been operationalized at scale.Objective: To extend the traditional three-dimensional FMEA to an eight-dimensional patient-safety risk feature framework, to establish a multi-modellarge language model (LLM) extraction pipeline that scores these dimensionsautomatically, and to demonstrate a variance-aware integer optimization (mean-variance integer programming, MV-IP) that provides a reproducible tie-breakingrule for incident prioritization under extraction uncertainty, rather than improvedrisk coverage.Methods: An 8-dimensional framework covering harm severity, potential harm,frequency, detectability, systemic impact, vulnerable populations, regulatoryrelevance, and economic impact was applied to 213 synthetic and 196 realcurated incident narratives. Three independent LLMs (GPT-5.4, Gemini 3.1 Pro, Grok-4.1 Fast) from different provider families extracted structured risk scores.Inter-model consistency was assessed via ICC(A,1). Among coverage-equivalentselections, MV-IP minimized inter-model variance to give a reproducible prioriti-zation rule. An English-language sensitivity analysis was conducted on 31 AHRQPSNet WebM&M cases.Results: On real cases, seven of eight dimensions reached Fair or betterinter-model reliability (ICC(A,1) 0.53 to 0.83); D5 (Systemic Impact) was theexception at Poor reliability (0.275), driven by little between-case variation ratherthan by wide model disagreement. Reliability was not uniform: two dimensionswere Excellent (D1 actual harm 0.834, D8 economic impact 0.782), two Good,and three only Fair, so some dimensions are more readily extractable than others.The same anchors gave broadly similar results on English-language narratives.When deterministic top-K selection returned several equal-coverage solutions(11 on real cases, total inter-model variance 0.205 to 1.274), MV-IP selected theminimum-disagreement set, replacing ad hoc tie-breaking with an explicit rulewithout improving coverage. Bootstrap resampling found 74% to 90% of per-casevariance estimates stable despite the three-model panel.Conclusions: The eight-dimensional framework operationalizes patient-safetyrisk features that quality teams have considered only implicitly, and three inde-pendent LLM families produced reproducible scores on most dimensions ofcurated narratives. Inter-model agreement, however, measures reproducibilityrather than clinical correctness, and high agreement does not by itself establishthat a score is right; the dimensions that are reliably extractable today (notablyD6 and D8) differ from those that are not yet (D5, and to a lesser degree D4 andD7), which has direct implications for incident-reporting form design. MV-IP con-tributes a reproducible, variance-aware tie-breaking rule rather than improvedcoverage. Validation against expert-prioritized RCA lists and deployment on rawinstitutional incident reports remain the next steps toward clinical use.

Operationalizing Eight-Dimensional Patient-Safety Risk Scoring at Scale: A Multi-Model Large Language Model Reliability Study

Matching journals