Back

Double Machine Learning for Causal Inference in High-Dimensional Electronic Health Records

Du, M.; Guo, Y.; Li, X.; Catala, M.; Prieto-Alhambra, D.

2025-07-22 epidemiology
10.1101/2025.07.21.25331944 medRxiv
Show abstract

BackgroundEstimating causal effects in observational health data is challenging due to confounding by indication. Traditional approaches such as inverse probability of treatment weighting (IPTW) rely on correct model specification, which is difficult in high-dimensional settings. We implemented an offset-based double machine learning (Offset-DML) practical framework for estimating binary treatment effects on the log-odds scale using logistic regression. MethodsWe have conducted a plasmode simulation study based on real-world clinical data, varying sample sizes (5,000, 10,000, 20,000) and outcome prevalence (5%, 10%, 20%) with 200 repetitions. We compared the performance of IPTW, stabilised IPTW, offset-DML (with and without cross-fitting), and high-dimensional DML (HD-DML). We measured and compared the performance of the different models with the following metrics: absolute bias, empirical standard error, and root mean square error relative to the true average causal effect. ResultsAcross most scenarios, DML-based approaches outperformed IPTW methods in terms of bias and empirical standard error, particularly in larger sample sizes. Offset-DML showed comparable performance to HD-DML while avoiding convergence issues observed with HD-DML in sparse data settings. All DML methods had overlapping confidence intervals in most scenarios. ConclusionOffset-DML is a practical and robust alternative for causal inference in high-dimensional health data. Future work should investigate extensions to other outcomes and diagnostics to assess confounding control. Key messagesO_LIDouble machine learning based methods consistently outperform IPTW regarding bias and empirical standard error, particularly in large sample sizes and sparse-data scenarios. C_LIO_LIOffset Double machine learning is a practical and robust binary causal effect estimation method in high-dimensional settings. C_LIO_LIUnlike high-dimensional Double machine learning, the offset-based Double machine learning approach demonstrated consistent convergence across all scenarios, including those with low outcome prevalence and small sample sizes. C_LI

Matching journals

The top 3 journals account for 50% of the predicted probability mass.

1
BMC Medical Research Methodology
43 papers in training set
Top 0.1%
39.5%
2
International Journal of Epidemiology
74 papers in training set
Top 0.1%
10.1%
3
American Journal of Epidemiology
57 papers in training set
Top 0.1%
6.8%
50% of probability mass above
4
Epidemiology
26 papers in training set
Top 0.1%
6.8%
5
Journal of Biomedical Informatics
45 papers in training set
Top 0.4%
4.0%
6
PLOS ONE
4510 papers in training set
Top 39%
3.6%
7
BMJ Open
554 papers in training set
Top 7%
2.7%
8
BMC Medical Informatics and Decision Making
39 papers in training set
Top 1%
2.4%
9
Statistics in Medicine
34 papers in training set
Top 0.1%
2.1%
10
Journal of the American Medical Informatics Association
61 papers in training set
Top 1%
1.9%
11
European Journal of Epidemiology
40 papers in training set
Top 0.3%
1.7%
12
Pharmacoepidemiology and Drug Safety
13 papers in training set
Top 0.3%
1.3%
13
Scientific Reports
3102 papers in training set
Top 68%
1.1%
14
BMC Medicine
163 papers in training set
Top 5%
1.1%
15
Journal of Clinical Epidemiology
28 papers in training set
Top 0.5%
0.9%
16
International Journal of Medical Informatics
25 papers in training set
Top 1%
0.9%
17
PeerJ
261 papers in training set
Top 14%
0.8%
18
npj Digital Medicine
97 papers in training set
Top 3%
0.7%
19
BMC Research Notes
29 papers in training set
Top 0.7%
0.7%
20
Biometrics
22 papers in training set
Top 0.2%
0.7%
21
JAMIA Open
37 papers in training set
Top 2%
0.7%
22
Computer Methods and Programs in Biomedicine
27 papers in training set
Top 1%
0.7%
23
Heliyon
146 papers in training set
Top 7%
0.7%
24
Journal of Medical Internet Research
85 papers in training set
Top 5%
0.6%