Back

Temporal deep learning with clinically engineered biomarkers for the early prediction of type 2 diabetes

Naveed, I.; Noaeen, M.; AboArab, M. A.; Kaleem, M. F.; Keshavjee, K.; Guergachi, A.

2025-12-01 health informatics
10.1101/2025.11.26.25341040 medRxiv
Show abstract

Diabetes mellitus remains a major global health burden, causing an estimated 3.4 million deaths in 2024 and highlighting the need for accurate early identification of individuals at risk of developing type 2 diabetes (T2D). Electronic health records (EHRs) provide longitudinal clinical trajectories, yet many predictive frameworks fail to capture short-, intermediate-, and long-term temporal patterns or incorporate clinically validated metabolic biomarkers. This study introduces a hybrid deep learning framework that integrates hierarchical temporal modeling with clinically engineered predictors for early T2D risk estimation. The approach includes data preprocessing, temporal sequencing, and the incorporation of derived biomarkers such as triglyceride-to-high-density lipoprotein cholesterol ratio (TG/HDL-C), low-density lipoprotein to high-density lipoprotein cholesterol ratio (LDL/HDL-C), total cholesterol to high-density lipoprotein cholesterol ratio (TC/HDL-C), very low-density lipoprotein (VLDL), obesity status, and prediabetes indicators. A multilevel convolutional neural network (CNN) extracts low-, mid-, and high-level temporal features, which are processed in parallel by long short-term memory (LSTM) modules to capture multi-scale temporal dependencies. The fused temporal and biochemical representations form a unified CNN-LSTM architecture that is evaluated using standard classification metrics. Experiments conducted on 19,218 patients and 368,790 clinical visits from the Canadian Primary Care Sentinel Surveillance Network (CPCSSN) achieved 93.2% accuracy, 75.7% sensitivity, 98.8% specificity, and an 84.4% F1 score, outperforming bidirectional long short-term memory (Bi-LSTM), support vector machine (SVM), k-nearest neighbor (KNN), and baseline CNN-LSTM models. Feature importance analysis identified fasting blood sugar (FBS), glycated hemoglobin (HbA1c), and lipid ratios as the strongest predictors. By combining temporal representation learning with clinically grounded biomarkers, the proposed framework provides an interpretable, scalable, and robust foundation for early diabetes risk prediction and can be extended to other chronic diseases characterized by longitudinal EHR data. Author SummaryIn this study, we focus on the growing challenge of type 2 diabetes, a condition that develops gradually and often remains undetected until significant health damage has occurred. Our goal was to create an approach that identifies individuals at increased risk much earlier by examining how their clinical measurements change over time. To achieve this, we analyzed routine health information collected during repeated medical visits and combined it with key biological markers known to reflect metabolic health, such as blood sugar levels, long-term glucose measures, and cholesterol-related indicators. We developed a computational model that learns how these factors evolve and how they relate to the future onset of diabetes. When tested on a large population dataset, our model detected risk patterns more accurately than several widely used prediction methods. We also found that variations in blood sugar, long-term glucose, and lipid measures played a particularly important role in identifying individuals likely to develop the disease. By offering earlier and more reliable risk assessment, our work supports more proactive and personalized preventive care. Ultimately, this approach has the potential to help clinicians intervene sooner and reduce the burden of diabetes-related complications.

Matching journals

The top 8 journals account for 50% of the predicted probability mass.

1
npj Digital Medicine
97 papers in training set
Top 0.4%
12.7%
2
Communications Medicine
85 papers in training set
Top 0.1%
8.4%
3
Journal of Biomedical Informatics
45 papers in training set
Top 0.2%
6.8%
4
JAMIA Open
37 papers in training set
Top 0.2%
6.4%
5
Scientific Reports
3102 papers in training set
Top 24%
4.8%
6
eBioMedicine
130 papers in training set
Top 0.2%
4.3%
7
Nature Machine Intelligence
61 papers in training set
Top 0.8%
4.0%
8
BMC Medical Informatics and Decision Making
39 papers in training set
Top 0.7%
4.0%
50% of probability mass above
9
IEEE Journal of Biomedical and Health Informatics
34 papers in training set
Top 0.4%
4.0%
10
Nature Communications
4913 papers in training set
Top 40%
3.6%
11
Bioinformatics
1061 papers in training set
Top 5%
3.6%
12
PLOS Digital Health
91 papers in training set
Top 0.8%
3.2%
13
Journal of Medical Internet Research
85 papers in training set
Top 2%
3.2%
14
JMIR Public Health and Surveillance
45 papers in training set
Top 0.8%
2.9%
15
Patterns
70 papers in training set
Top 0.7%
1.9%
16
Journal of the American Medical Informatics Association
61 papers in training set
Top 1%
1.9%
17
PLOS ONE
4510 papers in training set
Top 52%
1.8%
18
Expert Systems with Applications
11 papers in training set
Top 0.2%
1.2%
19
Cell Reports Medicine
140 papers in training set
Top 6%
1.1%
20
Communications Biology
886 papers in training set
Top 17%
0.9%
21
PLOS Computational Biology
1633 papers in training set
Top 22%
0.9%
22
eClinicalMedicine
55 papers in training set
Top 2%
0.8%
23
Briefings in Bioinformatics
326 papers in training set
Top 7%
0.7%
24
Frontiers in Artificial Intelligence
18 papers in training set
Top 0.8%
0.7%
25
PNAS Nexus
147 papers in training set
Top 2%
0.7%
26
iScience
1063 papers in training set
Top 32%
0.7%
27
Frontiers in Digital Health
20 papers in training set
Top 1%
0.7%
28
Advanced Science
249 papers in training set
Top 20%
0.7%
29
eLife
5422 papers in training set
Top 60%
0.7%
30
JMIR Medical Informatics
17 papers in training set
Top 2%
0.6%