Back

A Population-Specific Breast Cancer Risk Prediction Model for Indian Women (A Pilot Study): Advancing Beyond Traditional Assessment Tools

Prakash, P.; Arora, K.; Gupta, A.; Zjigyasu, E.; Saley, V. V.; Rathore, V. K.; Satia, A.; Arora, C.; Mausam, ; Rangarajan, K.; Gupta, A.; Singh, S.; Sagiraju, H.; Das, K. J.; Meena, J. K.; Gupta, I.

2025-07-21 public and global health
10.1101/2025.07.20.25331883 medRxiv
Show abstract

BackgroundBreast cancer is the most prevalent cancer among women in India, characterized by late-stage diagnoses and high mortality rates. Existing breast cancer risk prediction models, such as the Gail and Tyrer-Cuzick models, were primarily developed using Western datasets, limiting their applicability to the Indian context due to socio-demographic, genetic, and cultural differences. ObjectiveThis pilot study aims to develop and validate a machine learning (ML)-based breast cancer risk prediction model tailored specifically to the Indian population, addressing the limitations of traditional tools, with the potential for future methodological expansion to build more robust and generalizable models. MethodsA retrospective case-control pilot study was conducted using data from the National Cancer Institute (NCI)-AIIMS, comprising 590 breast cancer cases and 1,366 controls. Data preprocessing included cleaning, missing value imputation, and feature engineering of 66 clinical, genetic, and lifestyle factors. To address class imbalance and multivariate complexities, the XGBoost ensemble model was employed. Model performance was evaluated using accuracy, recall, precision, F1-score, and AUC-ROC metrics. Gini index values were used to interpret model predictions and identify key features for risk stratification. ResultsThe model demonstrated robust predictive performance with an accuracy of 0.89 and AUC-ROC > 0.9, sensitivity of 73.95%, and specificity of 94.90% on the test dataset. Feature importance analysis enabled the development of a reduced model using the top 20 features, maintaining high accuracy and clinical relevance. The reduced model simplifies risk assessment in resource-limited settings. ConclusionsThis pilot study introduces a population-specific ML-based breast cancer risk prediction tool tailored to the Indian demographic. By incorporating culturally relevant variables and leveraging advanced machine learning techniques, the model addresses key limitations of Western-centric risk prediction tools. While the current methodology serves as an initial framework, it can be further expanded and refined to develop more robust and generalizable models for broader population coverage. Integration into clinical workflows and further validation across diverse Indian populations could transform early detection and personalized intervention strategies, significantly reducing the burden of breast cancer in India.

Matching journals

The top 7 journals account for 50% of the predicted probability mass.

1
PLOS ONE
4510 papers in training set
Top 12%
15.0%
2
BMC Cancer
52 papers in training set
Top 0.1%
12.6%
3
Scientific Reports
3102 papers in training set
Top 16%
6.5%
4
BMC Research Notes
29 papers in training set
Top 0.1%
4.9%
5
BMC Medical Informatics and Decision Making
39 papers in training set
Top 0.6%
4.4%
6
Cancer Medicine
24 papers in training set
Top 0.3%
4.0%
7
Diagnostics
48 papers in training set
Top 0.4%
3.6%
50% of probability mass above
8
International Journal of Epidemiology
74 papers in training set
Top 0.6%
3.6%
9
PLOS Digital Health
91 papers in training set
Top 0.9%
2.8%
10
Frontiers in Public Health
140 papers in training set
Top 3%
2.4%
11
American Journal of Epidemiology
57 papers in training set
Top 0.5%
2.1%
12
International Journal of Cancer
42 papers in training set
Top 0.5%
2.1%
13
BMC Medical Research Methodology
43 papers in training set
Top 0.5%
1.9%
14
Cancers
200 papers in training set
Top 3%
1.8%
15
Frontiers in Oncology
95 papers in training set
Top 2%
1.7%
16
BMJ Open
554 papers in training set
Top 10%
1.2%
17
JMIR Public Health and Surveillance
45 papers in training set
Top 2%
1.2%
18
PeerJ
261 papers in training set
Top 12%
0.9%
19
International Journal of Medical Informatics
25 papers in training set
Top 1%
0.9%
20
BMC Medicine
163 papers in training set
Top 6%
0.9%
21
Journal of Medical Internet Research
85 papers in training set
Top 4%
0.8%
22
BioData Mining
15 papers in training set
Top 0.9%
0.8%
23
Cureus
67 papers in training set
Top 5%
0.8%
24
JAMA Network Open
127 papers in training set
Top 4%
0.8%
25
Cancer Epidemiology, Biomarkers & Prevention
17 papers in training set
Top 0.6%
0.8%
26
Communications Medicine
85 papers in training set
Top 1%
0.8%
27
JNCI Cancer Spectrum
10 papers in training set
Top 0.5%
0.8%
28
Frontiers in Medicine
113 papers in training set
Top 7%
0.8%
29
Epidemiology and Infection
84 papers in training set
Top 3%
0.7%
30
PLOS Global Public Health
293 papers in training set
Top 6%
0.7%