Back

Transcriptomic Architecture of Type 2 Diabetes in Human Pancreatic Islets:An Integrative Meta-Analysis and Machine Learning Framework for Biomarker Discovery

Romero, R.

2026-06-10 endocrinology
10.64898/2026.06.08.26355184 medRxiv
Show abstract

Background. Type 2 diabetes mellitus (T2D) is defined by progressive pancreatic {beta}-cell dysfunction whose molecular underpinnings remain incompletely understood. Single-cohort transcriptomic analyses of donor islets have yielded heterogeneous gene lists of limited cross-study reproducibility, constraining both mechanistic interpretation and biomarker development. Methods. We combined two complementary analytical strategies applied to four public human islet transcriptomic cohorts (GSE25724, GSE20966, GSE38642, and GSE164416; n = 7-57 donors per contrast). For the integrative arm, three microarray datasets and one bulk RNA-seq dataset were processed independently and unified through gene-level random-effects meta-analysis, hallmark pathway scoring (GSVA/MSigDB), and iterative module refinement, yielding a two-axis disease framework. For the diagnostic arm, a consensus multi-method machine learning pipeline, combining LASSO penalized logistic regression, Support Vector Machine Recursive Feature Elimination (SVM-RFE), and Random Forest importance scoring, was applied to 184 differentially expressed genes from the RNA-seq cohort, with all normalization steps performed within leave-one-out cross-validation (LOOCV) folds to prevent data leakage. Machine learning classification of the RNA-seq cohort was additionally subjected to external transportability testing in the independent bulk human islet RNA-seq cohort GSE50244 using an overlap-restricted reduced score and a threshold fixed in the discovery cohort. Results. Meta-analysis across all four cohorts identified 337 high-confidence T2D-associated genes (96.1% directional concordance in beta-cell-enriched tissue). These were distilled into two refined 14-gene modules: ImmuneStress (MICB, HLA-DRA, HLA-DPA1, IL1R2, and others) and BetaCellIdentitySecretion (RASGRP1, PPP1R1A, SLC2A2, and others), whose composite IsletDysfunctionScore provided the most stable cross-platform separation of non-diabetic from T2D islets (Hedges' g = 1.80, p = 9.83 x $10^-17$, $\text{I}^2$= 0%). Consistent with progressive disease, IsletDysfunctionScore increased monotonically from non-diabetic to impaired glucose tolerance to T2D. Separately, the machine learning pipeline derived a 10-gene diagnostic panel: GABRA2, SLC2A2, ARG2, DKK3, PRIMA1, TAFA4, HHATL, PARVG, RNU1-70P, and the novel lncRNA ENSG00000284653, that achieved perfect discrimination in LOOCV (AUC = 1.000, sensitivity = 1.000, specificity = 1.000, zero misclassifications across all 57 donors). A leakage-verification experiment confirmed that this performance reflected genuine biological signal: global quantile normalization prior to cross-validation collapsed AUC to 0.380. External testing showed that 8 of the 10 panel genes were measurable in GSE50244. The frozen 8-gene reduced score retained strong discrimination (external AUC = 0.907), with 6 of 8 genes preserving directional concordance, but the discovery-derived threshold did not transfer because the external score distribution was shifted upward and compressed, yielding complete sensitivity but zero specificity at the frozen cutoff Conclusions. Integrating pathway-level meta-analysis with machine learning classification, we present a coherent two-axis model: immune/stress activation and loss of beta-cell identity/secretory competence, together with a compact, biologically interpretable 10-gene diagnostic signature. Panel genes converge on GABA signaling, glucose transport, arginine metabolism, WNT pathway inhibition, and a novel lncRNA, providing both mechanistic hypotheses and high-priority targets for external validation. These findings offer a reproducible transcriptomic scaffold for future mechanistic, biomarker, and clinical translation studies of human islet dysfunction. They also support external transportability of the core biological signal, while indicating that absolute operating thresholds are cohort-dependent and would require recalibration before deployment in independent datasets.

Matching journals

The top 3 journals account for 50% of the predicted probability mass.

1
Diabetologia
36 papers in training set
Top 0.1%
32.5%
2
Diabetes
53 papers in training set
Top 0.1%
9.9%
3
Cell Reports Medicine
140 papers in training set
Top 0.2%
8.3%
50% of probability mass above
4
Cell Metabolism
49 papers in training set
Top 0.2%
6.2%
5
Nature Communications
4913 papers in training set
Top 34%
4.8%
6
EMBO Molecular Medicine
85 papers in training set
Top 0.6%
3.5%
7
Advanced Science
249 papers in training set
Top 6%
3.2%
8
Molecular Metabolism
105 papers in training set
Top 0.6%
3.2%
9
JCI Insight
241 papers in training set
Top 2%
3.0%
10
Nature Medicine
117 papers in training set
Top 1%
2.8%
11
The Journal of Clinical Endocrinology & Metabolism
35 papers in training set
Top 0.6%
1.9%
12
eBioMedicine
130 papers in training set
Top 1%
1.7%
13
Cell Genomics
162 papers in training set
Top 4%
1.5%
14
Life Science Alliance
263 papers in training set
Top 0.5%
1.5%
15
Molecular Systems Biology
142 papers in training set
Top 1.0%
1.2%
16
BMC Medicine
163 papers in training set
Top 5%
0.9%
17
Science Advances
1098 papers in training set
Top 31%
0.7%
18
Scientific Reports
3102 papers in training set
Top 75%
0.7%
19
Nature Metabolism
56 papers in training set
Top 3%
0.7%
20
Nature Genetics
240 papers in training set
Top 8%
0.7%
21
Diabetes Care
12 papers in training set
Top 0.3%
0.7%
22
BMC Genomics
328 papers in training set
Top 7%
0.6%
23
eLife
5422 papers in training set
Top 62%
0.6%
24
Journal of Clinical Investigation
164 papers in training set
Top 8%
0.6%
25
Molecular Therapy - Nucleic Acids
24 papers in training set
Top 0.5%
0.6%
26
Cell Reports
1338 papers in training set
Top 36%
0.6%