Whole population cohorts versus sampled comparators designs for evaluating health and educational outcomes of children with inborn rare conditions: a simulation study

Tan, J.; Ruiz Nishiki, M.; Cortina-Borja, M.; Knowles, R. L.; Harron, K.; Peters, C.; Hardelid, P.

2026-01-11 pediatrics

10.64898/2026.01.09.26343758 medRxiv

Show abstract

BackgroundLinked administrative data covering whole populations are fundamental resources for longitudinal studies of children with rare conditions (cases) and unaffected peers (comparators). Data minimisation regulations sometimes limit the number of comparators per case (sampled comparators, SC), with unknown impact on study findings. MethodsUsing Monte Carlo draws, we simulated 100 000 children with and without an exemplar condition, congenital hypothyroidism (CHT), with covariates (sex and comorbidity). Three outcomes (Y: Maths tests z-score, age 11 years; L: achieving expected Maths attainment (binary); T: months to neurodevelopmental disorder diagnosis) were modelled as linear combinations of CHT, sex and comorbidity. Varying parameters (comorbidity prevalence; comorbidity-CHT association; CHT effect on Y/L/T) factorially produced 36 data-generating mechanisms (DGMs). We used regression coefficients (CHT effect), standard errors (SEs) and p-values from 1000 simulations to evaluate power, precision and bias, comparing SCn (n=5/10/15/25/50/100) versus full cohort (FC). ResultsMean p-values and SEs for SC25 generally deviated [≤]5% versus FC with medium effects (z-score difference=0.3; odds/hazard ratios[≥]2), and [≤]2% for large effects (z-score difference[≥]0.6; odds/hazard ratios[≥]5). For all outcomes, no SC nor FC had sufficient power (>80% of p-values[≤]0.05) with small or medium effects, whilst all SC had sufficient power with large effects. Compared with FC, precision loss for SC25 was 2.0-4.3%, 5.0-8.9%, 6.7-15.5% for Y, L, T respectively. SC was not associated with bias. ConclusionSC25 provided comparable performance as FC for rare disease studies under several scenarios, but small effects posed challenges, notwithstanding sampling. This approach generates cost-effective recommendations for study design and data minimisation. Key messagesWhat is the minimum ratio of children without disease (comparators) to children with disease (cases) needed to reliably quantify differences in health and educational outcomes, if whole population data were not accessible? Sampling 25 comparators per case would generally provide comparable inferences as whole population data for typical scenarios likely to be encountered in longitudinal studies involving children with rare diseases. Decreasing sample sizes helps studies to fulfil data minimisation principles, guides negotiations with data providers and facilitates approvals by research governance bodies, without compromising the quality of research findings.

Whole population cohorts versus sampled comparators designs for evaluating health and educational outcomes of children with inborn rare conditions: a simulation study

Matching journals