Back

Metabolomic Profiling of Dried Blood Spots for Breast Cancer Detection: A Multi-Classifier Validation Study in 2,734 Participants

Anctil, N.; Hauguel, P.; Noel, L.-P.

2026-04-27 oncology
10.64898/2026.04.24.26351695 medRxiv
Show abstract

BackgroundBreast cancer (BC) remains the most diagnosed malignancy and leading cancer-related cause of mortality in women worldwide. Although blood-based untargeted metabolomics has emerged as a promising modality for detecting early-stage BC, the clinical translation of this approach has been bottlenecked by two unresolved issues: (i) the field has almost exclusively relied on serum or plasma, which require venipuncture and cold-chain logistics, and (ii) machine-learning models reported on such data are frequently validated with protocols that are blind to analytical batch structure, producing optimistically biased performance estimates. MethodsWe present a breast cancer detection study based on dried blood spots (DBS), an analytical matrix that enables self-collection and ambient-temperature shipping. A cohort of 2,734 participants (114 biopsy-confirmed BC cases; 2,620 non-cancer controls) was profiled by untargeted LC-MS/MS on a Thermo Scientific Orbitrap IQ-X coupled to a Vanquish UHPLC. A 39-metabolite panel meeting MSI Level 1 identification criteria [1] was pre-specified a priori from the published breast-cancer metabolomics literature, frozen prior to LC-MS acquisition, and applied to the present cohort without any feature selection on the data. Six standard supervised-learning architectures (LASSO, Elastic Net, Linear SVM, PLS-DA, OPLS-DA, XGBoost) were evaluated on this pre-specified panel; OPLS-DA, whose pyopls implementation does not integrate cleanly into the repeated multi-seed batch-aware protocol, is reported only in the sex-matched subgroup analysis where a single-seed 5-fold stratified protocol permits a directly comparable fit. Per-batch control-median normalization is applied upstream, following the protocol of the companion same-lab study [2], which removes batch-specific intensity shifts at the data-preparation stage; kNN imputation, log transform, and robust scaling are then fit within each training fold. The evaluation battery comprises batch-aware StratifiedGroupKFold CV reported at single-seed (seed=42) with inter-seed SD quantified across 10 independent seeds, batch-aware nested CV, a 100-seed held-out 20%-batch validation with disjoint-batch isotonic probability calibration (30% calibration partition), PPV/NPV reporting at multiple operating points and three deployment prevalences, subgroup analyses by TNM stage and tumor grade, pathway-ablation sensitivity analysis, and a 1,000-iteration permutation test. ResultsUnder batch-aware evaluation (StratifiedGroupKFold, single-seed=42), AUC ranged from 0.914 to 0.949 across classifiers, with LASSO achieving 0.928 and XGBoost 0.949; inter-seed SD across 10 seeds was 0.002-0.006. At 95% specificity, LASSO reached 75.4% sensitivity and XGBoost 81.6%. Held-out batch validation (100 seeds) yielded mean AUC 0.912 for Elastic Net and 0.935 for XGBoost, confirming robust generalization. All 39 panel features showed high coefficient stability, and permutation testing on representative classifiers (LASSO, Linear SVM, PLS-DA) yielded p [≤] 0.001. Subgroup analyses showed weaker detection of stage IIA tumors (AUC 0.87, n=40) compared with stage IIB/IIIA (AUC 0.95), consistent with stronger metabolic signatures in more advanced disease. Bootstrap coefficient consistency of the Elastic Net classifier confirmed that all 39 panel features received a non-zero multivariate weight in >=80% of 100 stratified bootstraps. Permutation testing on the three representative classifiers subjected to this analysis (LASSO, Linear SVM, PLS-DA) confirmed significance at p [≤] 0.001 in all three cases. ConclusionsOn this cohort of diagnosed, pre-treatment breast-cancer cases, DBS LC-MS metabolomic profiling delivers classification performance (AUC 0.928 for LASSO and 0.949 for XGBoost under batch-aware GroupKFold CV at single-seed=42; held-out AUC 0.912-0.935) that is robust across classifier families and biological pathways. The DBS matrix is non-radiating, self-collectable by finger-prick, and mailable at ambient temperature. The approach complements the established venous-blood workflow while addressing a clear infrastructural gap identified over nearly a decade of preliminary work [3, 4]. Performance is weaker on stage IIA than on more advanced disease, and prospective validation in an independent asymptomatic screening cohort is required before clinical positioning as a decentralized triage modality.

Matching journals

The top 7 journals account for 50% of the predicted probability mass.

1
Analytical Chemistry
205 papers in training set
Top 0.2%
12.5%
2
Metabolites
50 papers in training set
Top 0.1%
12.5%
3
Nature Communications
4913 papers in training set
Top 22%
8.5%
4
PLOS ONE
4510 papers in training set
Top 26%
6.5%
5
Molecular & Cellular Proteomics
158 papers in training set
Top 0.5%
4.4%
6
Scientific Reports
3102 papers in training set
Top 30%
4.0%
7
Diagnostics
48 papers in training set
Top 0.4%
3.7%
50% of probability mass above
8
Journal of Proteome Research
215 papers in training set
Top 0.8%
3.6%
9
Breast Cancer Research
32 papers in training set
Top 0.2%
3.6%
10
Cancers
200 papers in training set
Top 2%
2.1%
11
International Journal of Molecular Sciences
453 papers in training set
Top 6%
1.8%
12
Journal of Translational Medicine
46 papers in training set
Top 0.9%
1.7%
13
EMBO Molecular Medicine
85 papers in training set
Top 2%
1.3%
14
Frontiers in Oncology
95 papers in training set
Top 3%
1.1%
15
ACS Sensors
45 papers in training set
Top 1%
1.0%
16
Biosensors and Bioelectronics
52 papers in training set
Top 1%
1.0%
17
PLOS Computational Biology
1633 papers in training set
Top 22%
0.9%
18
Endocrinology
38 papers in training set
Top 0.5%
0.9%
19
Journal of Magnetic Resonance Imaging
14 papers in training set
Top 0.5%
0.9%
20
The Analyst
15 papers in training set
Top 0.4%
0.9%
21
Molecular Metabolism
105 papers in training set
Top 2%
0.8%
22
Analytica Chimica Acta
17 papers in training set
Top 0.6%
0.8%
23
Clinical Proteomics
10 papers in training set
Top 0.2%
0.8%
24
Journal of Proteomics
27 papers in training set
Top 0.5%
0.7%
25
Interface Focus
14 papers in training set
Top 0.3%
0.7%
26
PeerJ
261 papers in training set
Top 15%
0.7%
27
The Journal of Clinical Endocrinology & Metabolism
35 papers in training set
Top 1%
0.7%
28
iScience
1063 papers in training set
Top 36%
0.7%
29
Cancer Research Communications
46 papers in training set
Top 1%
0.7%
30
Communications Biology
886 papers in training set
Top 28%
0.7%