Double Machine Learning for Causal Inference in High-Dimensional Electronic Health Records

Du, M.; Guo, Y.; Li, X.; Catala, M.; Prieto-Alhambra, D.

2025-07-22 epidemiology

10.1101/2025.07.21.25331944 medRxiv

Show abstract

BackgroundEstimating causal effects in observational health data is challenging due to confounding by indication. Traditional approaches such as inverse probability of treatment weighting (IPTW) rely on correct model specification, which is difficult in high-dimensional settings. We implemented an offset-based double machine learning (Offset-DML) practical framework for estimating binary treatment effects on the log-odds scale using logistic regression. MethodsWe have conducted a plasmode simulation study based on real-world clinical data, varying sample sizes (5,000, 10,000, 20,000) and outcome prevalence (5%, 10%, 20%) with 200 repetitions. We compared the performance of IPTW, stabilised IPTW, offset-DML (with and without cross-fitting), and high-dimensional DML (HD-DML). We measured and compared the performance of the different models with the following metrics: absolute bias, empirical standard error, and root mean square error relative to the true average causal effect. ResultsAcross most scenarios, DML-based approaches outperformed IPTW methods in terms of bias and empirical standard error, particularly in larger sample sizes. Offset-DML showed comparable performance to HD-DML while avoiding convergence issues observed with HD-DML in sparse data settings. All DML methods had overlapping confidence intervals in most scenarios. ConclusionOffset-DML is a practical and robust alternative for causal inference in high-dimensional health data. Future work should investigate extensions to other outcomes and diagnostics to assess confounding control. Key messagesO_LIDouble machine learning based methods consistently outperform IPTW regarding bias and empirical standard error, particularly in large sample sizes and sparse-data scenarios. C_LIO_LIOffset Double machine learning is a practical and robust binary causal effect estimation method in high-dimensional settings. C_LIO_LIUnlike high-dimensional Double machine learning, the offset-based Double machine learning approach demonstrated consistent convergence across all scenarios, including those with low outcome prevalence and small sample sizes. C_LI

Double Machine Learning for Causal Inference in High-Dimensional Electronic Health Records

Matching journals