The Risk Factors, Detection and Classification of Esophageal Cancer Using Ensemble Machine Learning Models
Gaso, M. S.; Mekuria, R. R.; Cankurt, S.; Deybasso, H. A.; Abdo, A. A.; Abbas, G. H.
Show abstract
Esophageal cancer (EC) remains one of the most lethal malignancies worldwide, with poor survival outcomes largely attributable to late-stage diagnosis and limited treatment effectiveness. Early detection and accurate risk stratification are therefore essential for improving clinical management. In this study, we investigate the predictive value of socio-demographic, dietary, behavioral, environmental, and clinical variables collected from 312 individuals (104 EC cases and 208 controls) in the Arsi Zone, Ethiopia. An ensemble features ranking approach based on Random Forest machine learning was first applied to identify the most relevant predictive features. Subsequently, multiple ensemble machine learning models were evaluated, including Histogram-based Gradient Boosting (Model I), Extreme Gradient Boosting (Model II), AdaBoost (Model III), Random Forest (Model IV), and k-Nearest Neighbors (Model V). These models were tested under multiple experimental settings using both full and reduced feature subsets. To enhance robustness and minimize variability, a multi-seed ensemble framework was employed. Different seed values generate distinct train-test splits and slight variations in model initialization and optimization, leading to minor differences in training outcomes; aggregating results across multiple seeds mitigates this variability and provides more stable and reliable performance estimates. The experimental results demonstrate that boosting-based ensemble models consistently outperform other classifiers across all evaluation metrics. Model I achieved the highest overall performance, reaching an accuracy of 0.983, with precision of 0.982, recall of 0.980, and F1-score of 0.981 using the reduced feature set, while maintaining nearly identical performance with the full feature set. Model II also showed stable and strong predictive capability, achieving accuracies of 0.963 and 0.961 for the full and reduced feature sets, respectively, with balanced precision, recall, and F1-score values. These findings indicate that feature importance-based dimensionality reduction preserves essential predictive information without compromising classification performance. Overall, the results highlight the significant predictive contribution of dietary and environmental risk factors and demonstrate that ensemble learning provides a reliable, efficient, and clinically meaningful approach for early EC detection. The proposed framework offers a promising direction for supporting diagnostic decision-making and risk stratification in resource-limited healthcare settings. HighlightsO_LIMachine Learning Framework for Esophageal Cancer Classification A robust ensemble machine learning framework was developed to classify esophageal cancer using socio-demographic, dietary, behavioral, environmental, and clinical risk factors, enabling accurate and reliable disease prediction. C_LIO_LIMulti-Seed Ensemble Strategy for Improved Model Stability A novel multi-seed ensemble classification approach was implemented to reduce model variance and improve robustness by aggregating predictions across multiple randomized training and testing splits. C_LIO_LIEnsemble Feature Ranking for Optimal Feature Selection An ensemble Random Forest-based feature ranking framework was designed to identify the most predictive features, ensuring stable biomarker selection and improved model interpretability. C_LIO_LIHigh Classification Performance with Reduced Feature Set The proposed ensemble HGBC model achieved outstanding performance with 98.3% accuracy, 98.2% precision, 98.0% recall, and 98.1% F1-score using a reduced feature subset, demonstrating efficient dimensionality reduction without performance loss. C_LIO_LIExceptional Discriminative Ability with Near-Perfect AUC The ensemble HGBC model achieved an AUC of 0.994, indicating excellent discrimination between cancer and non-cancer cases and confirming its suitability for high-precision clinical decision support. C_LIO_LIZero False-Negative Predictions and Maximum Diagnostic Sensitivity The proposed model achieved zero false negatives in evaluation, resulting in 100% statistical power and perfect sensitivity, ensuring reliable detection of esophageal cancer cases. C_LIO_LIIdentification of Key Dietary and Environmental Risk Factors Feature importance analysis revealed that dietary habits, hot food consumption, environmental exposures, and behavioral factors are among the most significant predictors of esophageal cancer risk. C_LIO_LIEnsemble Learning Outperforms Traditional Machine Learning Models Boosting-based ensemble models, particularly HGBC and XGBoost, consistently outperformed other classifiers, demonstrating superior predictive accuracy, stability, and robustness. C_LIO_LIEfficient and Interpretable AI Framework for Clinical Decision Support The proposed framework balances high predictive accuracy with interpretability, making it suitable for assisting clinicians in early diagnosis and risk stratification of esophageal cancer. C_LIO_LIAI-Driven Solution for Resource-Constrained Healthcare Settings The proposed ensemble machine learning approach provides an effective and scalable diagnostic support tool, particularly valuable for healthcare systems with limited resources and access to specialized medical expertise. C_LI
Matching journals
The top 8 journals account for 50% of the predicted probability mass.