Back

AI-Driven Feature Selection Using Only Survey Variable Descriptions: Large Language Models Identify Adolescent Vaping Predictors

Zhang, K.; Zhao, Z.; Hu, Y.; Le, T.

2026-03-09 health informatics
10.64898/2026.03.06.26347816 medRxiv
Show abstract

ObjectiveTo evaluate the effectiveness of various Large Language Models (LLMs) in identifying reliable predictors of Electronic Nicotine Delivery Systems (ENDS) initiation among adolescents, using solely large-scale survey variable descriptions. MethodsA cohort of 7,943 tobacco-naive adolescents aged 12-16 years from the Population Assessment of Tobacco and Health (PATH) Study was analyzed to predict ENDS use at wave 5. Four instruction-tuned LLMs - GPT-4o, LLaMA 3.1-70B, Qwen 2.5-72B-Instruct, and DeepSeek-V3 - were systematically evaluated for text-based feature selection using only variable descriptions from wave 4.5. Selected features were used to train LightGBM classifiers, with model performance compared to a baseline. ResultsOur findings reveal notable consistency among the four instruction-tuned LLMs, with substantial overlap in the top predictors each model identified. These selected variables spanned critical domains such as peer and household influence, risk perception, and exposure to tobacco-related cues. LightGBM classifiers trained on PATH wave 4.5-5 data using features selected by the LLMs demonstrated strong predictive performance. Notably, Qwen 2.5-72B-Instruct achieved an AUC of 0.791 with 30 predictors, surpassing the baseline AUC of 0.768. DiscussionThe substantial overlap among the top predictors identified by different LLMs suggests a shared reasoning process, despite variations in model architecture and training. LightGBM classifiers trained on these LLM-selected features achieved performance comparable to, or exceeding, models trained on the full set of survey variables, underscoring the high quality of features selected solely from textual descriptions. Moreover, these findings are consistent with previous tobacco regulatory research, further validating the effectiveness of LLM-driven feature selection. ConclusionInstruction-tuned large language models can effectively perform text-based feature selection using survey variable descriptions alone, without accessing raw survey data. This scalable, interpretable, and privacy-preserving framework holds promise for behavioral health research and tobacco use surveillance.

Matching journals

The top 7 journals account for 50% of the predicted probability mass.

1
JMIR Public Health and Surveillance
45 papers in training set
Top 0.1%
14.7%
2
Frontiers in Digital Health
20 papers in training set
Top 0.1%
10.1%
3
International Journal of Medical Informatics
25 papers in training set
Top 0.1%
10.1%
4
International Journal of Drug Policy
11 papers in training set
Top 0.1%
4.9%
5
Nicotine and Tobacco Research
13 papers in training set
Top 0.1%
4.9%
6
JAMIA Open
37 papers in training set
Top 0.4%
3.6%
7
PLOS ONE
4510 papers in training set
Top 39%
3.6%
50% of probability mass above
8
npj Digital Medicine
97 papers in training set
Top 1%
3.6%
9
Journal of Medical Internet Research
85 papers in training set
Top 2%
2.7%
10
Frontiers in Artificial Intelligence
18 papers in training set
Top 0.1%
2.7%
11
Scientific Reports
3102 papers in training set
Top 48%
2.4%
12
Journal of Biomedical Informatics
45 papers in training set
Top 0.6%
2.1%
13
BMJ Open
554 papers in training set
Top 9%
1.8%
14
Journal of the American Medical Informatics Association
61 papers in training set
Top 1%
1.8%
15
BMC Medical Informatics and Decision Making
39 papers in training set
Top 1%
1.7%
16
Nicotine & Tobacco Research
11 papers in training set
Top 0.1%
1.7%
17
Preventive Medicine Reports
14 papers in training set
Top 0.2%
1.7%
18
Frontiers in Public Health
140 papers in training set
Top 5%
1.7%
19
BMC Bioinformatics
383 papers in training set
Top 5%
1.2%
20
PLOS Digital Health
91 papers in training set
Top 2%
1.2%
21
The Lancet Digital Health
25 papers in training set
Top 0.6%
1.2%
22
BMC Medical Research Methodology
43 papers in training set
Top 0.9%
1.2%
23
JAMA Network Open
127 papers in training set
Top 4%
0.9%
24
JMIR mHealth and uHealth
10 papers in training set
Top 0.4%
0.7%
25
Bioinformatics
1061 papers in training set
Top 9%
0.7%
26
JMIR Research Protocols
18 papers in training set
Top 2%
0.7%
27
JMIR Medical Informatics
17 papers in training set
Top 2%
0.7%
28
Addiction
25 papers in training set
Top 0.4%
0.6%
29
BMJ Health & Care Informatics
13 papers in training set
Top 1%
0.6%
30
International Journal of Environmental Research and Public Health
124 papers in training set
Top 8%
0.6%