Temporal deep learning with clinically engineered biomarkers for the early prediction of type 2 diabetes

Naveed, I.; Noaeen, M.; AboArab, M. A.; Kaleem, M. F.; Keshavjee, K.; Guergachi, A.

2025-12-01 health informatics

10.1101/2025.11.26.25341040 medRxiv

Show abstract

Diabetes mellitus remains a major global health burden, causing an estimated 3.4 million deaths in 2024 and highlighting the need for accurate early identification of individuals at risk of developing type 2 diabetes (T2D). Electronic health records (EHRs) provide longitudinal clinical trajectories, yet many predictive frameworks fail to capture short-, intermediate-, and long-term temporal patterns or incorporate clinically validated metabolic biomarkers. This study introduces a hybrid deep learning framework that integrates hierarchical temporal modeling with clinically engineered predictors for early T2D risk estimation. The approach includes data preprocessing, temporal sequencing, and the incorporation of derived biomarkers such as triglyceride-to-high-density lipoprotein cholesterol ratio (TG/HDL-C), low-density lipoprotein to high-density lipoprotein cholesterol ratio (LDL/HDL-C), total cholesterol to high-density lipoprotein cholesterol ratio (TC/HDL-C), very low-density lipoprotein (VLDL), obesity status, and prediabetes indicators. A multilevel convolutional neural network (CNN) extracts low-, mid-, and high-level temporal features, which are processed in parallel by long short-term memory (LSTM) modules to capture multi-scale temporal dependencies. The fused temporal and biochemical representations form a unified CNN-LSTM architecture that is evaluated using standard classification metrics. Experiments conducted on 19,218 patients and 368,790 clinical visits from the Canadian Primary Care Sentinel Surveillance Network (CPCSSN) achieved 93.2% accuracy, 75.7% sensitivity, 98.8% specificity, and an 84.4% F1 score, outperforming bidirectional long short-term memory (Bi-LSTM), support vector machine (SVM), k-nearest neighbor (KNN), and baseline CNN-LSTM models. Feature importance analysis identified fasting blood sugar (FBS), glycated hemoglobin (HbA1c), and lipid ratios as the strongest predictors. By combining temporal representation learning with clinically grounded biomarkers, the proposed framework provides an interpretable, scalable, and robust foundation for early diabetes risk prediction and can be extended to other chronic diseases characterized by longitudinal EHR data. Author SummaryIn this study, we focus on the growing challenge of type 2 diabetes, a condition that develops gradually and often remains undetected until significant health damage has occurred. Our goal was to create an approach that identifies individuals at increased risk much earlier by examining how their clinical measurements change over time. To achieve this, we analyzed routine health information collected during repeated medical visits and combined it with key biological markers known to reflect metabolic health, such as blood sugar levels, long-term glucose measures, and cholesterol-related indicators. We developed a computational model that learns how these factors evolve and how they relate to the future onset of diabetes. When tested on a large population dataset, our model detected risk patterns more accurately than several widely used prediction methods. We also found that variations in blood sugar, long-term glucose, and lipid measures played a particularly important role in identifying individuals likely to develop the disease. By offering earlier and more reliable risk assessment, our work supports more proactive and personalized preventive care. Ultimately, this approach has the potential to help clinicians intervene sooner and reduce the burden of diabetes-related complications.

Temporal deep learning with clinically engineered biomarkers for the early prediction of type 2 diabetes

Matching journals