The Independence of Discrimination and Calibration in Clinical Risk Prediction: Lessons from a Multi-Timeframe Diabetes Prediction Framework
OReilly, E.; Kurakovas, T.
Show abstract
BackgroundClinical risk prediction models are typically evaluated by discrimination (area under the receiver operating characteristic curve, AUC), with calibration receiving less attention. We developed a multi-timeframe diabetes prediction framework emphasizing calibration and used synthetic data validation to investigate whether good discrimination guarantees good calibration. MethodsWe generated 500,000 synthetic patients using published epidemiological parameters from QDiabetes-2018, FINDRISC, and the Diabetes Prevention Program. The framework comprises a discrete-time survival ensemble with isotonic calibration, producing predictions at 6, 12, 24, and 36 months with bootstrap confidence intervals. We evaluated discrimination (AUC), bin-level calibration (expected calibration error, ECE), calibration-in-the-large (observed-to-expected ratio), and clinical utility (decision curve analysis). We compared performance against QDiabetes-2018 implemented on the same synthetic cohort. ResultsDespite achieving excellent discrimination (AUC = 0.844, 95% CI: 0.840-0.848) and low bin-level calibration error (ECE = 0.006), the framework systematically overpredicted risk by 50%: mean predicted probability was 8.4% versus observed rate of 5.6% (observed-to-expected ratio = 0.66, 95% CI: 0.65-0.67). This miscalibration occurred despite isotonic regression on a held-out calibration set. Overprediction was present in 9 of 10 risk deciles. Risk stratification remained valid (23.5-fold separation, 95% CI: 22.8-24.3, between highest and lowest tiers), confirming that discrimination was preserved. QDiabetes-2018 achieved comparable discrimination (AUC = 0.831) with better calibration (O:E = 0.89). Decision curve analysis showed net benefit across threshold range 5-30%, though recalibration would improve clinical utility. ConclusionsGood discrimination does not guarantee good calibration. Our primary finding is negative: isotonic calibration failed to produce well-calibrated predictions even on synthetic data from a single generator. This has important implications for model deployment, where distribution shift is inevitable. We recommend that prediction model studies report calibration-in-the-large alongside bin-level metrics, as ECE alone can be misleading when risk distributions are skewed. Recalibration on deployment populations will likely be necessary for any prediction model, regardless of development-phase calibration performance. Key MessagesO_ST_ABSWhat is already knownC_ST_ABSO_LIClinical prediction models require both discrimination (ranking patients correctly) and calibration (accurate probability estimates) C_LIO_LIIsotonic regression is a recommended approach for post-hoc calibration C_LIO_LIExpected calibration error (ECE) is commonly reported as a summary calibration metric C_LI What this study addsO_LIDemonstrates empirically that excellent discrimination (AUC = 0.844) can coexist with substantial miscalibration (50% overprediction) C_LIO_LIShows that low ECE can be misleading when most patients fall in low-risk deciles C_LIO_LIProvides evidence that isotonic calibration on held-out data may not generalize even within synthetic data from one generator C_LIO_LIDemonstrates a discrete-time survival architecture that reduces monotonicity violations to <0.1% C_LI How this study might affect research, practice, or policyO_LIPrediction model studies should report calibration-in-the-large (O:E ratio) alongside ECE C_LIO_LIDevelopers should expect recalibration to be necessary when deploying to new populations C_LIO_LIClaims of calibrated prediction should be viewed skeptically without comprehensive calibration assessment C_LI
Matching journals
The top 5 journals account for 50% of the predicted probability mass.