Back

DiabetIA: Building Machine Learning Models for Type 2 Diabetes Complications

Tripp, J.; Santana-Quinteros, D.; Perez-Estrada, R.; Rodriguez-Moran, M. F.; Arcos-Gonzalez, C.; Mercado-Rios, J.; Cristobal-Perez, F.; Hernandez-Martinez, B. R.; Nava-Aguilar, M. A.; Gonzalez-Arroyo, G.; Salazar-Fernandez, E. P.; Quiroz-Armada, P. S.; Cortes-Vieyra, R.; Noriega-Cisneros, R.; Zinzun-Ixta, G.; Maldonado-Pichardo, M. C.; Flores-Alvarez, L. J.; Reyes-Granados, S. C.; Chagolla-Morales, R.; Paredes-Saralegui, J. G.; Flores-Garrido, M.; Garcia-Velazquez, L. M.; Figueroa-Mora, K. M.; Gomez-Garcia, A.; Alvarez-Aguilar, C.; Lopez-Pineda, A.

2023-10-23 health informatics
10.1101/2023.10.22.23297277 medRxiv
Show abstract

BackgroundArtificial intelligence (AI) models applied to diabetes mellitus research have grown in recent years, particularly in the field of medical imaging. However little work has been done exploring real-world data (RWD) sources such as electronic health records (EHR) mostly due to the lack of reliable public diabetes databases. However, with more than 500 million patients affected worldwide, complications of this condition have catastrophic consequences. In this manuscript we aim to first extract, clean and transform a novel diabetes research database, DiabetIA, and secondly train machine learning (ML) models to predict diabetic complications. MethodsIn this study, we used observational retrospective data from the Mexican Institute for Social Security (IMSS) extracting and de-identifying EHR data for almost 2 million patients seen at primary care facilities. After applying eligibility criteria for this study, we constructed a diabetes complications database. Next, we trained naive Bayesian models with various subsets of variables, including an expert-selected model. ResultsThe DiabetIA database is composed of 136,674 patients (414,770 records and 447 variables), with 33,314 presenting diabetes (24.3%). The most frequent diabetic complications were diabetic foot with 2,537 patients, nephropathy with 1,914 patients, retinopathy with 1,829 patients, and neuropathy with 786 patients. These complications were accurately predicted by the Gaussian naive Bayessian models with an average area under the curve AUC of 0.86. Our expert-selected model, achieved an average AUC of 0.84 with 21 curated variables. ConclusionOur study offers the largest longitudinal research database from EHR data in Latin America for research. The DiabetIA database provides a useful resource to estimate the burden of diabetic complications on healthcare systems. Machine learning models can provide accurate estimations of the total cases presented in medical units. For patients and their clinicians, it is imperative to have a way to calculate this risk and start clinical interventions to slow down or prevent the complications of this condition. Brief descriptionThe study centers on establishing the DiabetIA database, a substantial repository encompassing de-identified electronic health records from 136,674 patients sourced from primary care facilities within the Mexican Institute for Social Security (IMSS). Our efforts involved curating, cleansing, and transforming this extensive dataset, and then employing machine learning models to predict diabetic complications with high accuracy.

Matching journals

The top 4 journals account for 50% of the predicted probability mass.

1
JAMIA Open
37 papers in training set
Top 0.1%
22.3%
2
BMC Medical Informatics and Decision Making
39 papers in training set
Top 0.1%
14.5%
3
PLOS ONE
4510 papers in training set
Top 19%
10.0%
4
JMIR Medical Informatics
17 papers in training set
Top 0.1%
7.1%
50% of probability mass above
5
JMIR Public Health and Surveillance
45 papers in training set
Top 0.2%
6.3%
6
Frontiers in Medicine
113 papers in training set
Top 1%
3.6%
7
International Journal of Medical Informatics
25 papers in training set
Top 0.4%
3.5%
8
Journal of the American Medical Informatics Association
61 papers in training set
Top 0.8%
3.2%
9
PLOS Digital Health
91 papers in training set
Top 1%
2.1%
10
BMJ Open
554 papers in training set
Top 8%
2.1%
11
Journal of Medical Internet Research
85 papers in training set
Top 2%
2.1%
12
BMJ Health & Care Informatics
13 papers in training set
Top 0.3%
2.1%
13
Scientific Reports
3102 papers in training set
Top 54%
1.9%
14
JMIR Formative Research
32 papers in training set
Top 0.9%
1.6%
15
Journal of Biomedical Informatics
45 papers in training set
Top 1.0%
1.3%
16
Cureus
67 papers in training set
Top 4%
1.2%
17
BMC Infectious Diseases
118 papers in training set
Top 5%
0.8%
18
npj Digital Medicine
97 papers in training set
Top 4%
0.7%
19
Frontiers in Cardiovascular Medicine
49 papers in training set
Top 3%
0.7%
20
BMC Health Services Research
42 papers in training set
Top 2%
0.7%
21
BMC Medical Research Methodology
43 papers in training set
Top 2%
0.7%