Back

Metabolomic Profiling of Dried Blood Spots for Breast Cancer Detection: A Multi-Classifier Validation Study in 2,734 Participants

Anctil, N.; Hauguel, P.; Noel, L.-P.

2026-04-27 oncology
10.64898/2026.04.24.26351695 medRxiv
Show abstract

Background. Breast cancer (BC) remains the most diagnosed malignancy and leading cancer-related cause of mortality in women worldwide. Although blood-based untargeted metabolomics has emerged as a promising modality for detecting early-stage BC, the clinical translation of this approach has been bottlenecked by two unresolved issues: (i) the field has almost exclusively relied on serum or plasma, which require venipuncture and cold-chain logistics, and (ii) machine-learning models reported on such data are frequently validated with protocols that are blind to analytical batch structure, producing optimistically biased performance estimates. Methods. We present a breast cancer detection study based on dried blood spots (DBS), an analytical matrix that enables self-collection and ambient-temperature shipping. A cohort of 2,734 participants (114 biopsy-confirmed BC cases; 2,620 non-cancer controls) was profiled by untargeted LC-MS/MS on a Thermo Scientific Orbitrap IQ-X coupled to a Vanquish UHPLC. A 39-metabolite panel meeting MSI Level 1 identification criteria was pre-specified a priori from the published breast-cancer metabolomics literature, frozen prior to LC-MS acquisition, and applied to the present cohort without any feature selection on the data. Six standard supervised-learning architectures (LASSO, Elastic Net, Linear SVM, PLS-DA, OPLS-DA, XGBoost) were evaluated on this pre-specified panel; OPLS-DA is reported only in the sex-matched subgroup analysis where a single-seed 5-fold stratified protocol permits a directly comparable fit. Per-batch control-median normalization is applied upstream; kNN imputation, log transform, and robust scaling are fit within each training fold. The evaluation battery comprises batch-aware StratifiedGroupKFold CV at single-seed (seed=42) with inter-seed SD quantified across 10 independent seeds, batch-aware nested CV, a 100-seed held-out 20%-batch validation with disjoint-batch isotonic probability calibration (30% calibration partition), PPV/NPV reporting at multiple operating points and three deployment prevalences, subgroup analyses by TNM stage and tumor grade, pathway-ablation sensitivity analysis, and a 1,000-iteration permutation test. Results. Under batch-aware evaluation (StratifiedGroupKFold, single-seed=42), AUC ranged from 0.914 to 0.949 across classifiers, with LASSO achieving 0.928 and XGBoost 0.949; inter-seed SD across 10 seeds was 0.002-0.006. At 95% specificity, LASSO reached 75.4% sensitivity and XGBoost 81.6%. Held-out batch validation (100 seeds) yielded mean AUC 0.912 for Elastic Net and 0.935 for XGBoost, confirming robust generalization. All 39 panel features showed high coefficient stability, and permutation testing on representative classifiers (LASSO, Linear SVM, PLS-DA) yielded p <= 0.001. Subgroup analyses showed weaker detection of stage IIA tumors (AUC 0.87, n=40) compared with stage IIB/IIIA (AUC 0.95), consistent with stronger metabolic signatures in more advanced disease. Bootstrap coefficient consistency of the Elastic Net classifier confirmed that all 39 panel features received a non-zero multivariate weight in >=80% of 100 stratified bootstraps. Conclusions. On this cohort of diagnosed, pre-treatment breast-cancer cases, DBS LC-MS metabolomic profiling delivers classification performance (AUC 0.928 for LASSO and 0.949 for XGBoost under batch-aware GroupKFold CV at single-seed=42; held-out AUC 0.912-0.935) that is robust across classifier families and biological pathways. The DBS matrix is non-radiating, self-collectable by finger-prick, and mailable at ambient temperature. Performance is weaker on stage IIA than on more advanced disease, and prospective validation in an independent asymptomatic screening cohort is required before clinical positioning as a decentralized triage modality.

Matching journals

The top 8 journals account for 50% of the predicted probability mass.

1
Breast Cancer Research
32 papers in training set
Top 0.1%
10.7%
2
Molecular & Cellular Proteomics
158 papers in training set
Top 0.3%
9.4%
3
Analytical Chemistry
205 papers in training set
Top 0.4%
6.6%
4
Metabolites
50 papers in training set
Top 0.1%
6.6%
5
PLOS ONE
4510 papers in training set
Top 30%
5.0%
6
Nature Communications
4913 papers in training set
Top 34%
4.4%
7
Scientific Reports
3102 papers in training set
Top 26%
4.4%
8
Cancers
200 papers in training set
Top 1%
4.4%
50% of probability mass above
9
Journal of Proteome Research
215 papers in training set
Top 0.7%
4.1%
10
Diagnostics
48 papers in training set
Top 0.7%
2.4%
11
The Journal of Clinical Endocrinology & Metabolism
35 papers in training set
Top 0.6%
1.9%
12
Endocrinology
38 papers in training set
Top 0.3%
1.5%
13
Frontiers in Oncology
95 papers in training set
Top 2%
1.4%
14
Journal of Translational Medicine
46 papers in training set
Top 1%
1.4%
15
EMBO Molecular Medicine
85 papers in training set
Top 2%
1.4%
16
Clinical Cancer Research
58 papers in training set
Top 1%
1.3%
17
Cell Reports Medicine
140 papers in training set
Top 5%
1.3%
18
Journal of Magnetic Resonance Imaging
14 papers in training set
Top 0.5%
1.0%
19
JNCI Cancer Spectrum
10 papers in training set
Top 0.4%
1.0%
20
PeerJ
261 papers in training set
Top 11%
1.0%
21
iScience
1063 papers in training set
Top 25%
0.9%
22
eLife
5422 papers in training set
Top 52%
0.9%
23
eBioMedicine
130 papers in training set
Top 3%
0.9%
24
Cancer Research Communications
46 papers in training set
Top 0.9%
0.9%
25
BMC Cancer
52 papers in training set
Top 2%
0.8%
26
ACS Sensors
45 papers in training set
Top 1%
0.8%
27
International Journal of Molecular Sciences
453 papers in training set
Top 13%
0.8%
28
PLOS Computational Biology
1633 papers in training set
Top 24%
0.8%
29
Molecular Metabolism
105 papers in training set
Top 2%
0.8%
30
mSystems
361 papers in training set
Top 7%
0.8%