A Hybrid Machine Learning Framework for Early Prediction of Chronic Kidney Disease Progression Using Longitudinal Claims Data: An XGBoost-LSTM Ensemble with Temporal Attention
SAXENA, J. N.; Potturu, D. V. P.; Nagraj, A.
Show abstract
Background: Chronic kidney disease (CKD) affects approximately 850 million individuals worldwide and remains a leading cause of morbidity, premature mortality, and escalating healthcare costs. Despite the availability of clinical biomarkers, CKD progression to end stage renal disease (ESRD) is frequently identified late, limiting opportunities for preventive intervention. Conventional predictive models have relied predominantly on static cross sectional laboratory values, failing to capture the temporal dynamics of disease trajectory that longitudinal claims data can provide. Objective: This study proposes a novel hybrid machine learning framework: XGBoost LSTM Attention (XLA), that integrates gradient boosted feature selection with long short-term memory (LSTM) networks and a temporal attention mechanism to improve early prediction of CKD progression from Stage 3 to Stages 4/5 or ESRD using longitudinal claims based features. Methods: We conducted two complementary analyses. Primary analysis: a cross sectional validation using real NHANES 2015 to 2018 data (n=701 CKD Stage 3 adults) predicting significant proteinuria (UACR greater than or equal to 30 mg/g) from clinical features excluding UACR. Supplementary analysis: an NHANES-calibrated longitudinal cohort (n=8,412) with simulated quarterly measurements demonstrated XLA performance under real world longitudinal data conditions. All models were evaluated using 5-fold stratified cross-validation. Results: In the primary NHANES cross sectional analysis, the XLA framework achieved AUC ROC of 0.684 (95% CI: 0.641 to 0.727), with all models performing comparably (AUC 0.684 to 0.710), confirming that cross sectional clinical features alone provide limited signal for proteinuria prediction and underscoring the necessity of UACR measurement. In the longitudinal supplementary analysis, XLA achieved AUC ROC of 0.994 versus 0.939 for the best cross-sectional baseline (+5.5%), demonstrating that temporal trajectory features particularly eGFR slope and RAAS adherence trends: confer substantial incremental predictive value when longitudinal data are available. Conclusion: The XLA framework demonstrates meaningful advantages over traditional approaches when applied to longitudinal claims data. Cross sectional findings highlight the irreplaceable role of direct UACR measurement in CKD risk stratification. Together, these results provide actionable evidence for both the limitations of static prediction and the promise of trajectory based approaches in value based care programs managing large CKD populations. Keywords: chronic kidney disease, CKD progression, machine learning, XGBoost, LSTM, temporal attention, claims data, NHANES, proteinuria, healthcare informatics, value based care.
Matching journals
The top 4 journals account for 50% of the predicted probability mass.