Comparison of the prevalence of all diagnosed diseases among Estonian Biobank participants against the general population
Pajusalu, M.; Oja, M.; Mooses, K.; Heinsar, S.; Aasmets, O.; Laisk, T.; Palta, P.; Org, E.; Magi, R.; Vosa, U.; Fischer, K.; Estonian Biobank Research Team, ; Tillmann, T.; Laur, S.; Reisberg, S.; Vilo, J.; Kolde, R.
Show abstract
Characterizing study sample representativeness is critical for the validity of biobank-derived findings, yet selection biases are rarely quantified across the full clinical spectrum. Here we systematically evaluate the Estonian Biobank (EstBB) - comprising [~]20% of the adult population - by comparing two recruitment waves against a 30% national reference dataset (Est-Health-30). Analyzing prevalence ratios (PR) across 1,028 ICD-10 categories, we reveal a bifurcated landscape of representativeness. While EstBB achieves population parity (PR [~] 1.0) for widespread chronic conditions like Type 2 diabetes, 47% of diagnostic categories exhibit significant deviations (>1.3-fold). We identify a distinct "managed symptomatic" phenotype: a systematic enrichment of outpatient diagnoses - such as melanocytic nevi (PR=2.07, 95% CI 1.93...2.21) and major depression (PR=1.53, 1.4...1.66) - coupled with a depletion of high-mortality conditions like lung cancer (PR=0.69, 0.64...0.75) and vascular dementia (PR=0.45, 0.38...0.54). These biases evolved across recruitment phases, with the later EstBB2 cohort exhibiting a healthier, prevention-oriented profile. To support research integrity, we provide an interactive open-access dashboard for phenotype refinement. Accounting for such selection-driven "clinical visibility" is essential to avoid collider bias in risk prediction and causal inference models.
Matching journals
The top 7 journals account for 50% of the predicted probability mass.