Back

The Risk Factors, Detection and Classification of Esophageal Cancer Using Ensemble Machine Learning Models

Gaso, M. S.; Mekuria, R. R.; Cankurt, S.; Deybasso, H. A.; Abdo, A. A.; Abbas, G. H.

2026-03-11 health informatics
10.64898/2026.03.09.26347944 medRxiv
Show abstract

Esophageal cancer (EC) remains one of the most lethal malignancies worldwide, with poor survival outcomes largely attributable to late-stage diagnosis and limited treatment effectiveness. Early detection and accurate risk stratification are therefore essential for improving clinical management. In this study, we investigate the predictive value of socio-demographic, dietary, behavioral, environmental, and clinical variables collected from 312 individuals (104 EC cases and 208 controls) in the Arsi Zone, Ethiopia. An ensemble features ranking approach based on Random Forest machine learning was first applied to identify the most relevant predictive features. Subsequently, multiple ensemble machine learning models were evaluated, including Histogram-based Gradient Boosting (Model I), Extreme Gradient Boosting (Model II), AdaBoost (Model III), Random Forest (Model IV), and k-Nearest Neighbors (Model V). These models were tested under multiple experimental settings using both full and reduced feature subsets. To enhance robustness and minimize variability, a multi-seed ensemble framework was employed. Different seed values generate distinct train-test splits and slight variations in model initialization and optimization, leading to minor differences in training outcomes; aggregating results across multiple seeds mitigates this variability and provides more stable and reliable performance estimates. The experimental results demonstrate that boosting-based ensemble models consistently outperform other classifiers across all evaluation metrics. Model I achieved the highest overall performance, reaching an accuracy of 0.983, with precision of 0.982, recall of 0.980, and F1-score of 0.981 using the reduced feature set, while maintaining nearly identical performance with the full feature set. Model II also showed stable and strong predictive capability, achieving accuracies of 0.963 and 0.961 for the full and reduced feature sets, respectively, with balanced precision, recall, and F1-score values. These findings indicate that feature importance-based dimensionality reduction preserves essential predictive information without compromising classification performance. Overall, the results highlight the significant predictive contribution of dietary and environmental risk factors and demonstrate that ensemble learning provides a reliable, efficient, and clinically meaningful approach for early EC detection. The proposed framework offers a promising direction for supporting diagnostic decision-making and risk stratification in resource-limited healthcare settings. HighlightsO_LIMachine Learning Framework for Esophageal Cancer Classification A robust ensemble machine learning framework was developed to classify esophageal cancer using socio-demographic, dietary, behavioral, environmental, and clinical risk factors, enabling accurate and reliable disease prediction. C_LIO_LIMulti-Seed Ensemble Strategy for Improved Model Stability A novel multi-seed ensemble classification approach was implemented to reduce model variance and improve robustness by aggregating predictions across multiple randomized training and testing splits. C_LIO_LIEnsemble Feature Ranking for Optimal Feature Selection An ensemble Random Forest-based feature ranking framework was designed to identify the most predictive features, ensuring stable biomarker selection and improved model interpretability. C_LIO_LIHigh Classification Performance with Reduced Feature Set The proposed ensemble HGBC model achieved outstanding performance with 98.3% accuracy, 98.2% precision, 98.0% recall, and 98.1% F1-score using a reduced feature subset, demonstrating efficient dimensionality reduction without performance loss. C_LIO_LIExceptional Discriminative Ability with Near-Perfect AUC The ensemble HGBC model achieved an AUC of 0.994, indicating excellent discrimination between cancer and non-cancer cases and confirming its suitability for high-precision clinical decision support. C_LIO_LIZero False-Negative Predictions and Maximum Diagnostic Sensitivity The proposed model achieved zero false negatives in evaluation, resulting in 100% statistical power and perfect sensitivity, ensuring reliable detection of esophageal cancer cases. C_LIO_LIIdentification of Key Dietary and Environmental Risk Factors Feature importance analysis revealed that dietary habits, hot food consumption, environmental exposures, and behavioral factors are among the most significant predictors of esophageal cancer risk. C_LIO_LIEnsemble Learning Outperforms Traditional Machine Learning Models Boosting-based ensemble models, particularly HGBC and XGBoost, consistently outperformed other classifiers, demonstrating superior predictive accuracy, stability, and robustness. C_LIO_LIEfficient and Interpretable AI Framework for Clinical Decision Support The proposed framework balances high predictive accuracy with interpretability, making it suitable for assisting clinicians in early diagnosis and risk stratification of esophageal cancer. C_LIO_LIAI-Driven Solution for Resource-Constrained Healthcare Settings The proposed ensemble machine learning approach provides an effective and scalable diagnostic support tool, particularly valuable for healthcare systems with limited resources and access to specialized medical expertise. C_LI

Matching journals

The top 8 journals account for 50% of the predicted probability mass.

1
Scientific Reports
3102 papers in training set
Top 4%
12.4%
2
BMC Medical Informatics and Decision Making
39 papers in training set
Top 0.2%
10.0%
3
Computers in Biology and Medicine
120 papers in training set
Top 0.2%
8.3%
4
PLOS ONE
4510 papers in training set
Top 28%
6.3%
5
Journal of Medical Internet Research
85 papers in training set
Top 1%
4.3%
6
Frontiers in Artificial Intelligence
18 papers in training set
Top 0.1%
4.1%
7
Biology Methods and Protocols
53 papers in training set
Top 0.2%
3.9%
8
Computational and Structural Biotechnology Journal
216 papers in training set
Top 2%
3.0%
50% of probability mass above
9
Informatics in Medicine Unlocked
21 papers in training set
Top 0.3%
2.3%
10
Artificial Intelligence in Medicine
15 papers in training set
Top 0.2%
2.1%
11
JMIR Medical Informatics
17 papers in training set
Top 0.5%
2.1%
12
Cancer Medicine
24 papers in training set
Top 0.6%
1.9%
13
PLOS Digital Health
91 papers in training set
Top 1%
1.8%
14
Computer Methods and Programs in Biomedicine
27 papers in training set
Top 0.4%
1.7%
15
Frontiers in Public Health
140 papers in training set
Top 5%
1.7%
16
Frontiers in Bioinformatics
45 papers in training set
Top 0.2%
1.7%
17
npj Digital Medicine
97 papers in training set
Top 2%
1.7%
18
International Journal of Medical Informatics
25 papers in training set
Top 0.9%
1.7%
19
Frontiers in Digital Health
20 papers in training set
Top 0.7%
1.7%
20
IEEE Journal of Biomedical and Health Informatics
34 papers in training set
Top 1%
1.2%
21
Expert Systems with Applications
11 papers in training set
Top 0.2%
1.2%
22
JCO Clinical Cancer Informatics
18 papers in training set
Top 0.6%
1.2%
23
JAMIA Open
37 papers in training set
Top 1%
1.2%
24
International Journal of Environmental Research and Public Health
124 papers in training set
Top 6%
0.9%
25
Diagnostics
48 papers in training set
Top 2%
0.9%
26
Viruses
318 papers in training set
Top 4%
0.9%
27
Journal of Biomedical Informatics
45 papers in training set
Top 1%
0.9%
28
Life
27 papers in training set
Top 0.2%
0.9%
29
BMJ Health & Care Informatics
13 papers in training set
Top 0.8%
0.9%
30
JMIR Public Health and Surveillance
45 papers in training set
Top 3%
0.8%