Back

Separation-like irregularity and sample size optimism in high-discrimination logistic prediction models

Liu, Z.; Liang, Y.; Wang, L. S.; Yu, J.; Liu, J.

2026-01-23 health informatics
10.64898/2026.01.21.26344587 medRxiv
Show abstract

Closed-form minimum sample size criteria for developing logistic prediction models, such as the Riley framework implemented in pmsampsize, are widely used but may become optimistic when anticipated discrimination is high. We conducted a Monte Carlo simulation study to compare the formula-based recommended development sample size, nRiley, with an empirical required sample size, nreq, defined by out-of-sample calibration-slope stability under repeated development sampling. Scenarios fixed the candidate parameter dimension at p = 10 and crossed predictor distribution (normal, standardized skewed continuous, binary), signal density (dense versus sparse), prevalence ({phi} [isin] {0.05, 0.10, 0.20}), and target discrimination (AUCtarget [isin] {0.70, 0.75, 0.80, 0.85, 0.90}), with intercept and signal strength calibrated to match targets. We defined nreq as the smallest n such that [E] (bn) [&ge;] 0.90 and Pr(bn < 0.80) [&le;] 0.20, where bn is the truth-based logit-scale calibration slope evaluated on a large fixed validation covariate set. At moderate discrimination, nRiley approximated nreq, but as discrimination increased the formula increasingly underestimated the sample size required for calibration stability, with large deficits at AUCtarget = 0.90. Separation-like behavior (extreme fitted risks and linear predictors) at n = nRiley became common in high-discrimination settings despite nominal convergence, providing a plausible mechanism for formula optimism. These findings support augmenting formula-based planning with targeted simulation stress tests and instability diagnostics when high discrimination is anticipated.

Matching journals

The top 6 journals account for 50% of the predicted probability mass.

1
BMC Medical Research Methodology
43 papers in training set
Top 0.1%
14.4%
2
Scientific Reports
3102 papers in training set
Top 3%
12.6%
3
PLOS ONE
4510 papers in training set
Top 20%
9.2%
4
PLOS Computational Biology
1633 papers in training set
Top 4%
8.5%
5
Frontiers in Artificial Intelligence
18 papers in training set
Top 0.1%
4.3%
6
JMIR Public Health and Surveillance
45 papers in training set
Top 0.6%
3.6%
50% of probability mass above
7
Journal of the American Medical Informatics Association
61 papers in training set
Top 0.9%
2.9%
8
npj Digital Medicine
97 papers in training set
Top 2%
2.4%
9
Computers in Biology and Medicine
120 papers in training set
Top 2%
2.1%
10
JAMIA Open
37 papers in training set
Top 0.6%
2.1%
11
Philosophical Transactions of the Royal Society B
51 papers in training set
Top 2%
1.9%
12
Medical Decision Making
10 papers in training set
Top 0.1%
1.9%
13
Physical Biology
43 papers in training set
Top 1%
1.7%
14
Computer Methods and Programs in Biomedicine
27 papers in training set
Top 0.3%
1.7%
15
Bulletin of Mathematical Biology
84 papers in training set
Top 1%
1.7%
16
Nature Communications
4913 papers in training set
Top 52%
1.7%
17
PLOS Digital Health
91 papers in training set
Top 2%
1.3%
18
Frontiers in Digital Health
20 papers in training set
Top 1%
0.9%
19
Computational and Structural Biotechnology Journal
216 papers in training set
Top 8%
0.8%
20
eLife
5422 papers in training set
Top 55%
0.8%
21
Proceedings of the Royal Society A: Mathematical, Physical and Engineering Sciences
15 papers in training set
Top 0.7%
0.8%
22
Mathematical Biosciences
42 papers in training set
Top 1%
0.8%
23
PNAS Nexus
147 papers in training set
Top 1%
0.8%
24
BMC Bioinformatics
383 papers in training set
Top 7%
0.8%
25
Journal of Biomedical Informatics
45 papers in training set
Top 1%
0.8%
26
Journal of The Royal Society Interface
189 papers in training set
Top 5%
0.8%
27
npj Systems Biology and Applications
99 papers in training set
Top 3%
0.7%
28
PLOS Biology
408 papers in training set
Top 21%
0.7%
29
Journal of Medical Internet Research
85 papers in training set
Top 5%
0.6%
30
Physical Review X
23 papers in training set
Top 0.7%
0.6%