Back

Development of Explainable Machine Learning Framework for Early Detection and Risk Stratification of Diabetes in Age Specific Variations

Lukhele, N.; Mostafa, F.

2026-04-27 health informatics
10.64898/2026.04.25.26351733 medRxiv
Show abstract

ObjectiveTo develop and evaluate a novel machine learning (ML) framework tailored to a clinical diabetes dataset and to assess whether demographic stratification enhances model performance and interpretability for multiclass diabetes classification. MethodsA clinical dataset of 264 patients records was used to classify individuals into non-diabetic, prediabetic and diabetic categories. Several supervised learning models were trained using 80:20 train-test split and optimized using RandomizedSearchCV Model and 10-fold cross validation. Model performance was evaluated using the metrics accuracy, precision, recall and the F1-score. Area under the receiver operating characteristic curve (AUC) was calculated for the best generalizing model. A structured ML framework was developed for this dataset, incorporating preprocessing, model optimization, age stratification analysis age (<35 vs [&ge;]35 years) and gender. SHAP was developed for model interpretability. ResultsEnsemble methods demonstrated superior performance in comparison to linear or single-tree approaches, with Gradient Boosting showing the most stable generalization with a test accuracy of 0.981 and stable cross validation accuracy of 0.972. AUC-ROC analysis using Gradient Boosting yielded good discriminative ability across the three diabetes classes: 0.991 (non-diabetic), 0.986 (prediabetic) and 0.972 (diabetic). Stratified analysis showed improved reliability in individuals aged [&ge;]35 years (accuracy = 0.94, F1-score = 0.92), while performance in younger individuals was unstable due to small sample size. SHAP analysis identified HbA1c, BMI, and age as dominant predictors. ConclusionThis study presents a ML framework integrating age stratified modelling with explainable ML frameworks to improve interpretability. The findings offer clinically relevant results that can support clinical decision-making systems, individualized risk assessment, and potential applications for targeted intervention in diabetes progression.

Matching journals

The top 6 journals account for 50% of the predicted probability mass.

1
BMC Medical Informatics and Decision Making
39 papers in training set
Top 0.1%
18.8%
2
JAMIA Open
37 papers in training set
Top 0.1%
10.2%
3
PLOS ONE
4510 papers in training set
Top 26%
6.4%
4
Scientific Reports
3102 papers in training set
Top 17%
6.4%
5
Frontiers in Artificial Intelligence
18 papers in training set
Top 0.1%
4.9%
6
JMIR Public Health and Surveillance
45 papers in training set
Top 0.3%
4.4%
50% of probability mass above
7
Journal of Medical Internet Research
85 papers in training set
Top 1%
4.0%
8
PLOS Digital Health
91 papers in training set
Top 0.7%
3.7%
9
JMIR Medical Informatics
17 papers in training set
Top 0.4%
3.3%
10
Computers in Biology and Medicine
120 papers in training set
Top 1%
3.1%
11
International Journal of Medical Informatics
25 papers in training set
Top 0.6%
2.4%
12
Frontiers in Digital Health
20 papers in training set
Top 0.5%
1.9%
13
BMC Medical Research Methodology
43 papers in training set
Top 0.5%
1.9%
14
Biology Methods and Protocols
53 papers in training set
Top 1%
1.2%
15
International Journal of Environmental Research and Public Health
124 papers in training set
Top 5%
1.1%
16
BMC Infectious Diseases
118 papers in training set
Top 4%
1.1%
17
JMIR Formative Research
32 papers in training set
Top 1%
1.0%
18
Informatics in Medicine Unlocked
21 papers in training set
Top 0.8%
1.0%
19
Frontiers in Public Health
140 papers in training set
Top 6%
1.0%
20
BioMed Research International
25 papers in training set
Top 2%
0.9%
21
PeerJ
261 papers in training set
Top 12%
0.9%
22
Frontiers in Medicine
113 papers in training set
Top 5%
0.9%
23
npj Systems Biology and Applications
99 papers in training set
Top 2%
0.8%
24
Briefings in Bioinformatics
326 papers in training set
Top 6%
0.8%
25
BMJ Health & Care Informatics
13 papers in training set
Top 0.9%
0.8%
26
Cureus
67 papers in training set
Top 5%
0.8%
27
Frontiers in Physiology
93 papers in training set
Top 7%
0.7%
28
Biomedicines
66 papers in training set
Top 4%
0.7%
29
BMJ Open
554 papers in training set
Top 13%
0.7%
30
IEEE Journal of Biomedical and Health Informatics
34 papers in training set
Top 3%
0.5%