Back

Dirichlet process mixture models to estimate outcomes for individuals with missing predictor data: application to predict optimal type 2 diabetes therapy in electronic health record data

Cardoso, P.; Dennis, J. M.; Bowden, J.; Shields, B.; McKinley, T.; MASTERMIND Consortium,

2022-07-29 epidemiology
10.1101/2022.07.26.22278066 medRxiv
Show abstract

BackgroundMissing data is a common problem in regression modelling. Much of the literature focuses on handling missing outcome variables, but there are also challenges when dealing with missing predictor information, particularly when trying to build prediction models for use in practice. MethodsWe develop a flexible Bayesian approach for handling missing predictor information in regression models. For prediction this provides practitioners with full posterior predictive distributions for both the missing predictor information and the outcome variable, conditional on the observed predictors. We apply our approach to a previously proposed treatment selection model for type 2 diabetes second-line therapies. Our approach combines a regression model and a Dirichlet process mixture model (DPMM), where the former defines the treatment selection model and the latter provides a flexible way to model the joint distribution of the predictors. ResultsWe show that under missing-completely-at-random (MCAR) and missing-at-random (MAR) assumptions (with respect to the missing predictors), the DPMM can model complex relationships between predictor variables, and predict missing values conditionally on existing information. We also demonstrate that in the presence of multiple missing predictors, the DPMM model can be used to explore which variable(s), if collected, could provide the most additional information about the likely outcome. ConclusionsOur approach can provide practitioners with supplementary information to aid treatment selection decisions in the presence of missing data, and can be readily extended to other types of response model. Key MessagesO_LIMissing predictor variables present a significant challenge when building and implementing prediction models in clinical practice. C_LIO_LIRemoving individuals with missing information and performing a complete case analysis can lead to imprecision and bias. Multiple imputation approaches typically translate uncertainty through prediction model parameter standard errors, as opposed to a consistent joint probability model. C_LIO_LIAlternatively, a Bayesian approach using Dirichlet process mixture models (DPMMs) offers a flexible way to model complex joint distributions of predictor variables, which can be used to estimate posterior (predictive) distributions for the missing predictors, conditional on the observed predictors. C_LIO_LIUsing a DPMM, in this way allows uncertainties around missing predictor data to be propagated through to a prediction model of interest using a Bayesian hierarchical framework. This allows prediction models to be developed using datasets with incomplete predictor information (assuming missing-completely-at-random/missing-at-random). Furthermore, predictions can be made on new individuals even if they have incomplete predictor information (under the same assumptions). C_LIO_LIThis approach provides full posterior predictive probability distributions for both missing predictor variables and the outcome variable, allowing a wide range of probabilistic models outputs to be derived to support clinical decision making. C_LI

Matching journals

The top 4 journals account for 50% of the predicted probability mass.

1
BMC Medical Research Methodology
43 papers in training set
Top 0.1%
33.7%
2
BMC Medical Informatics and Decision Making
39 papers in training set
Top 0.3%
8.6%
3
Epidemiology
26 papers in training set
Top 0.1%
7.0%
4
BMJ Open
554 papers in training set
Top 4%
5.0%
50% of probability mass above
5
Journal of the American Medical Informatics Association
61 papers in training set
Top 0.7%
3.9%
6
PLOS ONE
4510 papers in training set
Top 38%
3.7%
7
American Journal of Epidemiology
57 papers in training set
Top 0.4%
3.1%
8
International Journal of Medical Informatics
25 papers in training set
Top 0.5%
2.7%
9
Statistics in Medicine
34 papers in training set
Top 0.1%
2.4%
10
Journal of Biomedical Informatics
45 papers in training set
Top 0.6%
2.1%
11
Biology Methods and Protocols
53 papers in training set
Top 0.6%
2.1%
12
Pharmacoepidemiology and Drug Safety
13 papers in training set
Top 0.2%
1.7%
13
PLOS Computational Biology
1633 papers in training set
Top 18%
1.4%
14
Frontiers in Artificial Intelligence
18 papers in training set
Top 0.5%
1.1%
15
Journal of Medical Internet Research
85 papers in training set
Top 4%
0.8%
16
Wellcome Open Research
57 papers in training set
Top 2%
0.8%
17
BMC Research Notes
29 papers in training set
Top 0.4%
0.8%
18
JAMIA Open
37 papers in training set
Top 2%
0.7%
19
Developmental Cognitive Neuroscience
81 papers in training set
Top 0.7%
0.7%
20
Heliyon
146 papers in training set
Top 8%
0.7%
21
Scientific Reports
3102 papers in training set
Top 78%
0.7%
22
Computers in Biology and Medicine
120 papers in training set
Top 5%
0.7%
23
International Journal of Epidemiology
74 papers in training set
Top 3%
0.7%
24
Psychiatry Research
35 papers in training set
Top 2%
0.5%
25
PLOS Digital Health
91 papers in training set
Top 3%
0.5%
26
Journal of Affective Disorders Reports
10 papers in training set
Top 0.5%
0.5%
27
Frontiers in Digital Health
20 papers in training set
Top 2%
0.5%
28
Trials
25 papers in training set
Top 2%
0.5%
29
Medical Decision Making
10 papers in training set
Top 0.4%
0.5%