A Machine Learning Based Causal Interface for Time-Varying Environmental Predictors of Substance Use Initiation in the ABCD Study
Wei, M.; Yadlapati, L.; Peng, Q.
Show abstract
Background: The Adolescent Brain Cognitive Development (ABCD) Study provides rich longitudinal data on environmental, genetic, and behavioral factors related to substance use initiation. Classical marginal structural models (MSMs) require selecting covariates for propensity models, which is challenging when there are many correlated predictors. Methods: We analyzed longitudinal panel data from 11,868 ABCD participants with repeated observations over time. Interval-level binary outcomes were defined for initiation of alcohol, nicotine, cannabis, and any substance, including only participants at risk before initiation. All predictors were constructed as lagged variables to preserve temporal ordering. We used a two-stage machine learning-based causal framework. First, we performed graph discovery using a Granger-inspired lagged predictive modeling approach with elastic-net logistic regression to identify relationships between past predictors and future outcomes. Stable candidate edges were selected using subject-level bootstrap stability selection. Second, we estimated adjusted effects for stable predictors using double machine learning (DML) with partialling-out and cross-fitting. For each predictor, the lagged variable was treated as the exposure and adjusted for high-dimensional lagged covariates. Cross-fitting with group-based splitting accounted for within-subject dependence. Nuisance functions were estimated using random forests, and cluster-robust standard errors were used for inference. Results: We identified stable predictors across multiple domains, including sleep patterns, family environment, peer relationships, behavioral traits, and genetic risk. Many predictors were shared across substance outcomes, while some were outcome-specific. Effect sizes were modest, typically ranging from -0.01 to 0.02 per standard deviation increase in the predictor. Both risk-increasing and protective associations were observed. Risk factors included sleep disturbance and behavioral risk indicators, while protective factors included parental monitoring and structured environments. Conclusions: This study presents a practical framework for analyzing high-dimensional longitudinal data and identifying time-varying predictors of substance use initiation. The approach combines machine learning for variable selection with causal inference for effect estimation. The results highlight both shared and outcome-specific risk factors and identify modifiable targets, such as family environment and sleep, that may inform prevention strategies.
Matching journals
The top 7 journals account for 50% of the predicted probability mass.