Back

Advancing Breast Cancer Detection: A Comprehensive Evaluation of Machine Learning Models on Mammogram Imaging

Al Muttaki, M. R. R.; Afrin, S.; Anil, A. I. A.; Shawon, M. M. H.

2025-10-10 medical education
10.1101/2025.10.08.25337620 medRxiv
Show abstract

Breast cancer, which is among the top causes of cancer-related deaths in women worldwide, demonstrates the importance of effective and rapid diagnostic tools, especially in early diagnosis, to enhance the survival level. Although machine learning (ML) advances have had an increasing number of medical imaging applications, limitations of diversity and applicability of datasets, the interpretation and efficiency of models remain a challenge to clinical use. The paper assesses eight of the most popular ML models, such as Convolutional Neural Network (CNN), Kolmogorov-Arnold Network (KAN), k-Nearest Neighbors, Support Vector Machine, XGBoost, Random Forest, Naive Bayes, and a Hybrid model based on the Mammogram Mastery dataset of Iraq-Sulaymaniyah, which consists of 745 original and 9,685 augmented mammogram images. The hybrid model has the best accuracy (0.9667) and F1 Score (0.9444), and the KAN model has the best ROC AUC (0.9760) and Log Loss (0.1421), meaning they are best in terms of discriminative power and proper calibration. Random Forest, which has the lowest false negatives (3) when compared with Fast Multinomial and Fast Text, became most secure in clinical screening since it struck a balance between sensitivity and computing efficiency. The two practical challenges, though, are the slow inference time of the KAN model (0.323 seconds) and the expensive training cost (1009.10 seconds) of the Hybrid model. These insights explain that the Hybrid and KAN models are promising means of improving the accuracy of the diagnostics, and Random Forest can serve as a practically representative tool for reducing the number of missed diagnoses. The context of future research needs to address multi-dataset validation from multiple institutions, speed optimization of inference, multi-classification, and improved interpretability that will be used in clinically integrative settings. By addressing these gaps, ML-based diagnostics have the potential to increase the rate of breast cancer diagnosis, minimizing diagnostic errors and improving patient outcomes in various clinical contexts, which can facilitate the scaling of screening services available across the world.

Matching journals

The top 3 journals account for 50% of the predicted probability mass.

1
Frontiers in Medicine
113 papers in training set
Top 0.1%
27.0%
2
PLOS ONE
4510 papers in training set
Top 11%
15.5%
3
PLOS Digital Health
91 papers in training set
Top 0.2%
8.9%
50% of probability mass above
4
Scientific Reports
3102 papers in training set
Top 12%
7.2%
5
Diagnostics
48 papers in training set
Top 0.6%
2.7%
6
Cancers
200 papers in training set
Top 2%
2.7%
7
International Journal of Medical Informatics
25 papers in training set
Top 0.8%
1.8%
8
Journal of Medical Internet Research
85 papers in training set
Top 2%
1.8%
9
Medical Physics
14 papers in training set
Top 0.4%
1.6%
10
Computers in Biology and Medicine
120 papers in training set
Top 2%
1.6%
11
Frontiers in Public Health
140 papers in training set
Top 5%
1.4%
12
PLOS Computational Biology
1633 papers in training set
Top 18%
1.4%
13
Cureus
67 papers in training set
Top 3%
1.3%
14
BMC Medical Informatics and Decision Making
39 papers in training set
Top 2%
1.3%
15
Informatics in Medicine Unlocked
21 papers in training set
Top 0.7%
1.0%
16
npj Digital Medicine
97 papers in training set
Top 3%
1.0%
17
BMC Medical Education
20 papers in training set
Top 0.7%
1.0%
18
Genomics, Proteomics & Bioinformatics
171 papers in training set
Top 5%
1.0%
19
BMC Bioinformatics
383 papers in training set
Top 6%
1.0%
20
IEEE Access
31 papers in training set
Top 0.7%
0.9%
21
BMJ Open
554 papers in training set
Top 11%
0.9%
22
Nature Communications
4913 papers in training set
Top 58%
0.9%
23
Bioengineering
24 papers in training set
Top 2%
0.7%
24
Frontiers in Artificial Intelligence
18 papers in training set
Top 1.0%
0.5%
25
Computer Methods and Programs in Biomedicine
27 papers in training set
Top 1%
0.5%