Back

The Independence of Discrimination and Calibration in Clinical Risk Prediction: Lessons from a Multi-Timeframe Diabetes Prediction Framework

OReilly, E.; Kurakovas, T.

2026-02-14 health informatics
10.64898/2026.02.12.26346147 medRxiv
Show abstract

BackgroundClinical risk prediction models are typically evaluated by discrimination (area under the receiver operating characteristic curve, AUC), with calibration receiving less attention. We developed a multi-timeframe diabetes prediction framework emphasizing calibration and used synthetic data validation to investigate whether good discrimination guarantees good calibration. MethodsWe generated 500,000 synthetic patients using published epidemiological parameters from QDiabetes-2018, FINDRISC, and the Diabetes Prevention Program. The framework comprises a discrete-time survival ensemble with isotonic calibration, producing predictions at 6, 12, 24, and 36 months with bootstrap confidence intervals. We evaluated discrimination (AUC), bin-level calibration (expected calibration error, ECE), calibration-in-the-large (observed-to-expected ratio), and clinical utility (decision curve analysis). We compared performance against QDiabetes-2018 implemented on the same synthetic cohort. ResultsDespite achieving excellent discrimination (AUC = 0.844, 95% CI: 0.840-0.848) and low bin-level calibration error (ECE = 0.006), the framework systematically overpredicted risk by 50%: mean predicted probability was 8.4% versus observed rate of 5.6% (observed-to-expected ratio = 0.66, 95% CI: 0.65-0.67). This miscalibration occurred despite isotonic regression on a held-out calibration set. Overprediction was present in 9 of 10 risk deciles. Risk stratification remained valid (23.5-fold separation, 95% CI: 22.8-24.3, between highest and lowest tiers), confirming that discrimination was preserved. QDiabetes-2018 achieved comparable discrimination (AUC = 0.831) with better calibration (O:E = 0.89). Decision curve analysis showed net benefit across threshold range 5-30%, though recalibration would improve clinical utility. ConclusionsGood discrimination does not guarantee good calibration. Our primary finding is negative: isotonic calibration failed to produce well-calibrated predictions even on synthetic data from a single generator. This has important implications for model deployment, where distribution shift is inevitable. We recommend that prediction model studies report calibration-in-the-large alongside bin-level metrics, as ECE alone can be misleading when risk distributions are skewed. Recalibration on deployment populations will likely be necessary for any prediction model, regardless of development-phase calibration performance. Key MessagesO_ST_ABSWhat is already knownC_ST_ABSO_LIClinical prediction models require both discrimination (ranking patients correctly) and calibration (accurate probability estimates) C_LIO_LIIsotonic regression is a recommended approach for post-hoc calibration C_LIO_LIExpected calibration error (ECE) is commonly reported as a summary calibration metric C_LI What this study addsO_LIDemonstrates empirically that excellent discrimination (AUC = 0.844) can coexist with substantial miscalibration (50% overprediction) C_LIO_LIShows that low ECE can be misleading when most patients fall in low-risk deciles C_LIO_LIProvides evidence that isotonic calibration on held-out data may not generalize even within synthetic data from one generator C_LIO_LIDemonstrates a discrete-time survival architecture that reduces monotonicity violations to <0.1% C_LI How this study might affect research, practice, or policyO_LIPrediction model studies should report calibration-in-the-large (O:E ratio) alongside ECE C_LIO_LIDevelopers should expect recalibration to be necessary when deploying to new populations C_LIO_LIClaims of calibrated prediction should be viewed skeptically without comprehensive calibration assessment C_LI

Matching journals

The top 5 journals account for 50% of the predicted probability mass.

1
BMC Medical Research Methodology
43 papers in training set
Top 0.1%
16.8%
2
Frontiers in Artificial Intelligence
18 papers in training set
Top 0.1%
13.8%
3
Journal of the American Medical Informatics Association
61 papers in training set
Top 0.3%
9.7%
4
BMJ Health & Care Informatics
13 papers in training set
Top 0.1%
6.1%
5
PLOS ONE
4510 papers in training set
Top 30%
6.1%
50% of probability mass above
6
BMC Medical Informatics and Decision Making
39 papers in training set
Top 0.7%
3.8%
7
JCO Clinical Cancer Informatics
18 papers in training set
Top 0.3%
3.4%
8
PLOS Digital Health
91 papers in training set
Top 0.9%
2.8%
9
npj Digital Medicine
97 papers in training set
Top 2%
2.6%
10
JAMIA Open
37 papers in training set
Top 0.6%
2.5%
11
JMIR Medical Informatics
17 papers in training set
Top 0.6%
2.0%
12
Computer Methods and Programs in Biomedicine
27 papers in training set
Top 0.3%
1.8%
13
BMJ Open
554 papers in training set
Top 8%
1.8%
14
Journal of Medical Internet Research
85 papers in training set
Top 2%
1.8%
15
JMIR Public Health and Surveillance
45 papers in training set
Top 2%
1.6%
16
Scientific Reports
3102 papers in training set
Top 61%
1.6%
17
PLOS Computational Biology
1633 papers in training set
Top 18%
1.4%
18
Biology Methods and Protocols
53 papers in training set
Top 1%
1.4%
19
JAMA Network Open
127 papers in training set
Top 3%
1.2%
20
Frontiers in Digital Health
20 papers in training set
Top 1.0%
1.2%
21
Annals of Internal Medicine
27 papers in training set
Top 0.6%
1.2%
22
The Lancet Digital Health
25 papers in training set
Top 0.7%
1.1%
23
Frontiers in Public Health
140 papers in training set
Top 7%
0.9%
24
Physiological Measurement
12 papers in training set
Top 0.4%
0.7%
25
GENETICS
189 papers in training set
Top 2%
0.7%
26
Statistics in Medicine
34 papers in training set
Top 0.4%
0.7%