Back

Identifying Effect Modification of Latent Population Characteristics on Risk Factors with a Sparse Varying Coefficient Regression

Wang, R.; Fang, L.; Wang, Y.; Jin, J.

2024-12-05 genetics
10.1101/2024.11.30.626101 bioRxiv
Show abstract

Leveraging observational data to understand the associations between risk factors and disease outcomes and conduct disease risk prediction is a common task in epidemiology. While traditional linear regression and other machine learning models have been extensively implemented for this task, the associations between risk factors and disease outcomes are typically deemed fixed. In many cases, however, such associations may vary by some underlying features of the individuals, which may involve certain subpopulation characteristics and environmental factors. While data for these latent features may not be available, the observed data on risk factors may have captured some proportion of the variation in these features. Thus extracting latent factors from risk factors and incorporating this effect modification into the model may better capture the underlying data structure and improve inference. We develop a novel regression model with some coefficients varying as functions of latent features extracted from the risk factors. We have demonstrated the superiority of our approach in various data settings via simulation studies. An application on a dataset for lung cancer patients from The Cancer Genome Atlas (TCGA) Program showed that our approach led to a 6% - 118% increase in (AUC-0.5) for distinguishing between different lung cancer stages compared to the classic lasso and elastic net regressions and identified interesting latent effect modifications associated with certain gene pathways.

Matching journals

The top 7 journals account for 50% of the predicted probability mass.

1
PLOS Genetics
756 papers in training set
Top 1%
10.0%
2
Genetic Epidemiology
46 papers in training set
Top 0.1%
8.3%
3
PLOS Computational Biology
1633 papers in training set
Top 4%
8.3%
4
Bioinformatics
1061 papers in training set
Top 3%
7.1%
5
Biometrics
22 papers in training set
Top 0.1%
7.1%
6
PLOS ONE
4510 papers in training set
Top 28%
6.3%
7
Frontiers in Genetics
197 papers in training set
Top 0.9%
6.3%
50% of probability mass above
8
International Journal of Epidemiology
74 papers in training set
Top 0.4%
4.8%
9
Nature Communications
4913 papers in training set
Top 35%
4.3%
10
BMC Bioinformatics
383 papers in training set
Top 3%
3.6%
11
Statistics in Medicine
34 papers in training set
Top 0.1%
2.1%
12
American Journal of Epidemiology
57 papers in training set
Top 0.5%
2.1%
13
The American Journal of Human Genetics
206 papers in training set
Top 2%
2.1%
14
Scientific Reports
3102 papers in training set
Top 54%
1.9%
15
Briefings in Bioinformatics
326 papers in training set
Top 4%
1.8%
16
European Journal of Human Genetics
49 papers in training set
Top 0.6%
1.7%
17
eLife
5422 papers in training set
Top 44%
1.6%
18
Biostatistics
21 papers in training set
Top 0.1%
1.5%
19
Journal of the American Medical Informatics Association
61 papers in training set
Top 1%
1.3%
20
Human Molecular Genetics
130 papers in training set
Top 2%
1.2%
21
Genome Research
409 papers in training set
Top 4%
0.8%
22
Proceedings of the National Academy of Sciences
2130 papers in training set
Top 43%
0.8%
23
BioData Mining
15 papers in training set
Top 0.9%
0.7%
24
Journal of The Royal Society Interface
189 papers in training set
Top 5%
0.7%
25
Communications Biology
886 papers in training set
Top 27%
0.7%
26
Genome Medicine
154 papers in training set
Top 9%
0.7%
27
iScience
1063 papers in training set
Top 35%
0.7%
28
NAR Genomics and Bioinformatics
214 papers in training set
Top 4%
0.7%
29
Physical Biology
43 papers in training set
Top 2%
0.7%
30
Journal of Computational Biology
37 papers in training set
Top 0.8%
0.6%