Back

Methodological Guidance for Predictor Variable Selection for Adolescent Smoking Outcomes in Global Youth Tobacco Survey Using R and Python

Ng'ambi, W. F.; Zyambo, C.; Kazembe, L.

2026-02-17 epidemiology
10.64898/2026.02.14.26346305 medRxiv
Show abstract

BackgroundThe Global Youth Tobacco Survey (GYTS) is widely used to monitor tobacco use among adolescents worldwide. However, inconsistent analytical approaches particularly in handling complex survey designs and predictor selection limit comparability across countries, survey waves, and software platforms. Although much of the GYTS literature relies on proprietary tools such as SAS and SPSS, practical and transparent guidance on implementing reproducible, theory-informed analyses remains limited. A unified workflow that respects the surveys design while supporting cross-platform implementation is needed. MethodsWe developed a reproducible, open-source workflow for analysing GYTS data using R and Python. In R, analyses were conducted using the survey package (svydesign and svyglm) with constrained stepwise selection via stepAIC. In Python, a custom constrained stepwise procedure was implemented using statsmodels generalized linear models. The workflow explicitly incorporates survey weights, stratification, and clustering; harmonises variables across countries; protects a priori demographic covariates; and ensures consistent treatment of categorical predictors. The approach is illustrated using data from Zambia (n = 2,959) and pooled data from Ghana, Mauritius, Seychelles, and Togo (n = 15,914). Predictor selection was guided by Social Cognitive Theory and evidence from systematic reviews. ResultsThe constrained selection framework consistently retained key demographic variables (age, sex, and grade) while allowing data-driven selection of modifiable predictors using the Akaike Information Criterion. When identical constraints were applied, the R and Python implementations selected identical models and produced nearly equivalent point estimates (adjusted odds ratio differences <0.01), although Python-based confidence intervals did not account for clustering. Of 18 candidate predictors across individual, social, media, and policy domains, 14 were retained. The strongest independent predictors included awareness of tobacco products (OR = 5.61, 95% CI: 4.65- 6.78), peer smoking (OR = 4.57, 95% CI: 3.34-6.25), and exposure to tobacco marketing (OR = 2.34, 95% CI: 1.89-2.91). ConclusionsThis study provides a generalisable, theory-informed framework for predictor selection in complex survey data using open-source tools. The workflow supports consistent analyses across countries, survey waves, and software platforms, and is transferable to other youth and adult population surveys. All code and harmonisation resources are openly available to support reproducibility and adaptation. Plain-Language SummaryO_LIWhat we asked: Can we predict adolescent smoking using GYTS data in a way that is easy to follow and reproducible across software? C_LIO_LIWhat we did: Built a single workflow that respects survey design (weights, strata, clusters) and selects predictors using four explicit criteria: theoretical grounding in Social Cognitive Theory, empirical support from prior studies, relevance for intervention, and cross-country validity. Core demographics (age, sex, grade, region) were protected as essential confounders, while other predictors were selected based on statistical fit. The workflow runs equivalently in R and Python. C_LIO_LIWhy it matters: Many GYTS studies use weights only and ignore clustering and stratification, which makes confidence intervals too narrow. More importantly, most analyses include variables arbitrarily or let software drop important confounders automatically. Our approach ensures theoretically meaningful, policy-relevant variables are retained, producing more reliable and actionable results for prevention programs. C_LI

Matching journals

The top 6 journals account for 50% of the predicted probability mass.

1
Epidemiology
26 papers in training set
Top 0.1%
12.7%
2
American Journal of Epidemiology
57 papers in training set
Top 0.1%
12.3%
3
Addiction
25 papers in training set
Top 0.1%
10.4%
4
BMC Public Health
147 papers in training set
Top 0.6%
6.4%
5
BMC Medical Research Methodology
43 papers in training set
Top 0.1%
6.4%
6
PLOS ONE
4510 papers in training set
Top 28%
6.4%
50% of probability mass above
7
International Journal of Epidemiology
74 papers in training set
Top 0.4%
4.8%
8
PLOS Medicine
98 papers in training set
Top 2%
1.9%
9
Nicotine & Tobacco Research
11 papers in training set
Top 0.1%
1.7%
10
BMJ Open
554 papers in training set
Top 9%
1.7%
11
Preventive Medicine
11 papers in training set
Top 0.1%
1.7%
12
Drug and Alcohol Dependence
37 papers in training set
Top 0.4%
1.5%
13
JAMA Network Open
127 papers in training set
Top 3%
1.5%
14
Nature Human Behaviour
85 papers in training set
Top 3%
1.5%
15
BMC Medicine
163 papers in training set
Top 4%
1.3%
16
JMIR Public Health and Surveillance
45 papers in training set
Top 3%
1.2%
17
JMIR mHealth and uHealth
10 papers in training set
Top 0.3%
1.1%
18
Scientific Reports
3102 papers in training set
Top 68%
1.1%
19
Nature Communications
4913 papers in training set
Top 58%
0.9%
20
International Journal of Medical Informatics
25 papers in training set
Top 1%
0.9%
21
American Journal of Preventive Medicine
11 papers in training set
Top 0.4%
0.9%
22
Public Health Nutrition
14 papers in training set
Top 0.5%
0.9%
23
Preventive Medicine Reports
14 papers in training set
Top 0.4%
0.8%
24
BMJ Global Health
98 papers in training set
Top 3%
0.8%
25
International Journal of Drug Policy
11 papers in training set
Top 0.3%
0.7%
26
PLOS Global Public Health
293 papers in training set
Top 6%
0.7%
27
BMC Research Notes
29 papers in training set
Top 0.7%
0.7%
28
Developmental Cognitive Neuroscience
81 papers in training set
Top 0.6%
0.7%
29
JMIR Research Protocols
18 papers in training set
Top 2%
0.6%