Back

Comparison of the prevalence of all diagnosed diseases among Estonian Biobank participants against the general population

Pajusalu, M.; Oja, M.; Mooses, K.; Heinsar, S.; Aasmets, O.; Laisk, T.; Palta, P.; Org, E.; Magi, R.; Vosa, U.; Fischer, K.; Estonian Biobank Research Team, ; Tillmann, T.; Laur, S.; Reisberg, S.; Vilo, J.; Kolde, R.

2026-02-06 health informatics
10.64898/2026.02.05.26345634 medRxiv
Show abstract

Characterizing study sample representativeness is critical for the validity of biobank-derived findings, yet selection biases are rarely quantified across the full clinical spectrum. Here we systematically evaluate the Estonian Biobank (EstBB) - comprising [~]20% of the adult population - by comparing two recruitment waves against a 30% national reference dataset (Est-Health-30). Analyzing prevalence ratios (PR) across 1,028 ICD-10 categories, we reveal a bifurcated landscape of representativeness. While EstBB achieves population parity (PR [~] 1.0) for widespread chronic conditions like Type 2 diabetes, 47% of diagnostic categories exhibit significant deviations (>1.3-fold). We identify a distinct "managed symptomatic" phenotype: a systematic enrichment of outpatient diagnoses - such as melanocytic nevi (PR=2.07, 95% CI 1.93...2.21) and major depression (PR=1.53, 1.4...1.66) - coupled with a depletion of high-mortality conditions like lung cancer (PR=0.69, 0.64...0.75) and vascular dementia (PR=0.45, 0.38...0.54). These biases evolved across recruitment phases, with the later EstBB2 cohort exhibiting a healthier, prevention-oriented profile. To support research integrity, we provide an interactive open-access dashboard for phenotype refinement. Accounting for such selection-driven "clinical visibility" is essential to avoid collider bias in risk prediction and causal inference models.

Matching journals

The top 7 journals account for 50% of the predicted probability mass.

1
Nature Communications
4913 papers in training set
Top 6%
18.3%
2
Science Translational Medicine
111 papers in training set
Top 0.2%
7.1%
3
Nature Medicine
117 papers in training set
Top 0.3%
6.7%
4
npj Digital Medicine
97 papers in training set
Top 0.8%
6.2%
5
Communications Medicine
85 papers in training set
Top 0.1%
4.8%
6
Science Advances
1098 papers in training set
Top 2%
4.8%
7
Nature Genetics
240 papers in training set
Top 2%
4.2%
50% of probability mass above
8
Scientific Reports
3102 papers in training set
Top 38%
3.5%
9
The American Journal of Human Genetics
206 papers in training set
Top 1%
3.5%
10
Nature Human Behaviour
85 papers in training set
Top 1%
3.0%
11
Nature Biomedical Engineering
42 papers in training set
Top 0.5%
2.3%
12
The Lancet Digital Health
25 papers in training set
Top 0.2%
2.3%
13
Cell Reports Medicine
140 papers in training set
Top 4%
1.7%
14
Nature
575 papers in training set
Top 11%
1.7%
15
Science
429 papers in training set
Top 14%
1.7%
16
eBioMedicine
130 papers in training set
Top 1%
1.7%
17
eLife
5422 papers in training set
Top 48%
1.3%
18
Journal of the American College of Cardiology
12 papers in training set
Top 0.5%
1.2%
19
Communications Biology
886 papers in training set
Top 15%
1.2%
20
PNAS Nexus
147 papers in training set
Top 1%
0.9%
21
Neuron
282 papers in training set
Top 8%
0.8%
22
Nature Neuroscience
216 papers in training set
Top 6%
0.8%
23
PLOS ONE
4510 papers in training set
Top 66%
0.8%
24
Cell
370 papers in training set
Top 17%
0.7%
25
Cell Genomics
162 papers in training set
Top 7%
0.7%
26
Med
38 papers in training set
Top 0.9%
0.7%
27
Patterns
70 papers in training set
Top 3%
0.7%
28
Proceedings of the National Academy of Sciences
2130 papers in training set
Top 46%
0.7%
29
Genome Medicine
154 papers in training set
Top 9%
0.6%
30
iScience
1063 papers in training set
Top 38%
0.6%