Back

Multi-stage reweighting to correct for participation bias in a nationwide biobank with nested recruitment

Traeholt, J.; Didriksen, M.; Helenius, D.; Christoffersen, L. A. N.; Dinh, K. M.; Dowsett, J.; Mikkelsen, C.; Hindhede, L.; Quinn, L. J. E.; Bruun, M. T.; Aagaard, B.; Hansen, T. F.; Hjalgrim, H.; Rostgaard, K.; Sorensen, E.; Erikstrup, C.; Pedersen, O. B. V.; Hansen, T.; Schork, A. J.; Markussen, B.; Ostrowski, S. R.

2026-04-02 epidemiology
10.64898/2026.04.01.26349852 medRxiv
Show abstract

Selective participation in biobanks often compromises inference to the general population, particularly when selection occurs across multiple stages, whether at recruitment or during subsequent participation. Inverse probability (IP) weighting can reduce systematic differences using suitable external benchmarks, but most applications assume a single selection process. Here, we present a multi-stage IP-weighting framework and apply it to the Danish Blood Donor Study (DBDS), a nationwide biobank embedded in Denmark's blood-donation infrastructure. Using national registers, we estimated year-specific probabilities of (i) donation activity and (ii) DBDS enrolment conditional on donation activity, yielding two-stage inclusion weights for 169,893 participants. These weights reduced inclusion-associated imbalance across the 52 auxiliary variables in the probability models by 97.6% (median) and, despite strong health selection under donation-based recruitment, reduced relative-prevalence discrepancies across held-out prescription phenotypes by 69.7% (median). The effective sample size after weighting was 30,627 (18.0% of 169,893). Combining the inclusion weights with questionnaire-specific response weights across five DBDS questionnaires (>500 questions) produced the largest changes from unweighted to weighted responses for health behaviours and symptom severity, including tobacco and alcohol consumption, menstrual-pain severity, restless-legs severity, nocturia, sleep disturbance, and fatigue. These findings support multi-stage IP-weighting to improve population alignment in biobanks with staged selection.

Matching journals

The top 6 journals account for 50% of the predicted probability mass.

1
Nature Communications
4913 papers in training set
Top 6%
18.4%
2
Nature Human Behaviour
85 papers in training set
Top 0.2%
9.1%
3
Nature Genetics
240 papers in training set
Top 1.0%
7.1%
4
BMC Medicine
163 papers in training set
Top 0.4%
7.1%
5
American Journal of Epidemiology
57 papers in training set
Top 0.1%
6.3%
6
Science Advances
1098 papers in training set
Top 1%
6.3%
50% of probability mass above
7
International Journal of Epidemiology
74 papers in training set
Top 0.4%
4.8%
8
The American Journal of Human Genetics
206 papers in training set
Top 1%
3.6%
9
Nature Medicine
117 papers in training set
Top 0.9%
3.6%
10
eLife
5422 papers in training set
Top 26%
3.6%
11
Epidemics
104 papers in training set
Top 0.5%
3.6%
12
PLOS Computational Biology
1633 papers in training set
Top 13%
2.4%
13
PLOS Medicine
98 papers in training set
Top 2%
2.1%
14
Scientific Reports
3102 papers in training set
Top 54%
1.9%
15
PLOS ONE
4510 papers in training set
Top 55%
1.6%
16
Epidemiology
26 papers in training set
Top 0.3%
1.5%
17
Nature
575 papers in training set
Top 13%
1.1%
18
Science
429 papers in training set
Top 18%
0.9%
19
Genome Medicine
154 papers in training set
Top 7%
0.9%
20
npj Digital Medicine
97 papers in training set
Top 3%
0.8%
21
Proceedings of the National Academy of Sciences
2130 papers in training set
Top 43%
0.8%
22
Cell Genomics
162 papers in training set
Top 7%
0.7%
23
PLOS Biology
408 papers in training set
Top 20%
0.7%
24
Communications Biology
886 papers in training set
Top 24%
0.7%
25
Science Translational Medicine
111 papers in training set
Top 6%
0.7%
26
PLOS Genetics
756 papers in training set
Top 16%
0.7%
27
BMC Medical Research Methodology
43 papers in training set
Top 2%
0.6%