Back

Development of Explainable Machine Learning Framework for Early Detection and Risk Stratification of Diabetes in Age Specific Variations

Lukhele, N.; Mostafa, F.

2026-04-27 health informatics
10.64898/2026.04.25.26351733 medRxiv
Show abstract

Objective To develop and evaluate a novel machine learning (ML) framework tailored to a clinical diabetes dataset and to assess whether demographic stratification enhances model performance and interpretability for multiclass diabetes classification. Methods A clinical dataset of 264 patients records was used to classify individuals into non-diabetic, prediabetic and diabetic categories. Several supervised learning models were trained using 80:20 train-test split and optimized using RandomizedSearchCV Model and 10-fold cross validation. Model performance was evaluated using the metrics accuracy, precision, recall and the F1-score. Area under the receiver operating characteristic curve (AUC) was calculated for the best generalizing model. A structured ML framework was developed for this dataset, incorporating preprocessing, model optimization, age stratification analysis age (<35 vs >35 years) and gender. SHAP was developed for model interpretability. Results Ensemble methods demonstrated superior performance in comparison to linear or single-tree approaches, with Gradient Boosting showing the most stable generalization with a test accuracy of 0.981 and stable cross validation accuracy of 0.972. AUC-ROC analysis using Gradient Boosting yielded good discriminative ability across the three diabetes classes: 0.991 (non-diabetic), 0.986 (prediabetic) and 0.972 (diabetic). Stratified analysis showed improved reliability in individuals aged >;35 years (accuracy = 0.94, F1-score = 0.92), while performance in younger individuals was unstable due to small sample size. SHAP analysis identified HbA1c, BMI, and age as dominant predictors. Conclusion This study presents a ML framework integrating age stratified modelling with explainable ML frameworks to improve interpretability. The findings offer clinically relevant results that can support clinical decision-making systems, individualized risk assessment, and potential applications for targeted intervention in diabetes progression.

Matching journals

The top 6 journals account for 50% of the predicted probability mass.

1
BMC Medical Informatics and Decision Making
39 papers in training set
Top 0.1%
18.5%
2
JAMIA Open
37 papers in training set
Top 0.1%
10.0%
3
Scientific Reports
3102 papers in training set
Top 14%
6.8%
4
PLOS ONE
4510 papers in training set
Top 28%
6.3%
5
Frontiers in Artificial Intelligence
18 papers in training set
Top 0.1%
4.8%
6
Computers in Biology and Medicine
120 papers in training set
Top 0.8%
3.6%
50% of probability mass above
7
Journal of Medical Internet Research
85 papers in training set
Top 1%
3.6%
8
JMIR Public Health and Surveillance
45 papers in training set
Top 0.7%
3.6%
9
PLOS Digital Health
91 papers in training set
Top 0.8%
3.6%
10
JMIR Medical Informatics
17 papers in training set
Top 0.4%
3.2%
11
Frontiers in Public Health
140 papers in training set
Top 4%
2.1%
12
International Journal of Environmental Research and Public Health
124 papers in training set
Top 3%
1.9%
13
Frontiers in Digital Health
20 papers in training set
Top 0.5%
1.9%
14
International Journal of Medical Informatics
25 papers in training set
Top 0.7%
1.9%
15
BMC Medical Research Methodology
43 papers in training set
Top 0.6%
1.7%
16
Informatics in Medicine Unlocked
21 papers in training set
Top 0.5%
1.5%
17
Biology Methods and Protocols
53 papers in training set
Top 1%
1.3%
18
Frontiers in Medicine
113 papers in training set
Top 5%
1.2%
19
BioMed Research International
25 papers in training set
Top 2%
1.2%
20
BMC Infectious Diseases
118 papers in training set
Top 5%
0.9%
21
PeerJ
261 papers in training set
Top 14%
0.8%
22
Briefings in Bioinformatics
326 papers in training set
Top 6%
0.8%
23
npj Systems Biology and Applications
99 papers in training set
Top 2%
0.8%
24
JMIR Formative Research
32 papers in training set
Top 2%
0.7%
25
BMJ Health & Care Informatics
13 papers in training set
Top 0.9%
0.7%
26
Biomedicines
66 papers in training set
Top 3%
0.7%
27
Frontiers in Physiology
93 papers in training set
Top 6%
0.7%
28
IEEE Journal of Biomedical and Health Informatics
34 papers in training set
Top 2%
0.7%
29
GeroScience
97 papers in training set
Top 2%
0.6%