Back

Metabolomic Fingerprinting from Dried Blood Spots Enables Individual Identification Across 1,257 Participants at 94% User-Level Accuracy

Hauguel, P.; Anctil, N.; Noel, L. P.

2026-04-11 bioinformatics
10.64898/2026.04.08.717286 bioRxiv
Show abstract

BackgroundConstructing digital twins in healthcare requires biological data sources that are simultaneously informative, dynamic, and practical for routine collection. Dried blood spot (DBS) sampling combined with untargeted metabolomics is well suited to meet these requirements: DBS can be self-collected at home and mailed at ambient temperature, while untargeted LC-MS/MS captures thousands of metabolites reflecting individual physiology, lifestyle, and exposures. We previously demonstrated proof-of-concept individual identification from DBS-derived metabolomic profiles in 277 volunteers (80-92% accuracy). Here, we report a large-scale validation on a substantially expanded cohort. MethodsWe collected 18,288 DBS samples from 1,257 individuals across 134 analytical batches over 15 months. Samples were self-collected at home, mailed via standard postal service, and analyzed by untargeted LC-MS/MS on a high-resolution Orbitrap platform in positive ESI mode. Our classification pipeline comprises batch-aware normalization, supervised feature selection, biological signal filtering, dimensionality reduction, and user-level majority voting across all available samples. This voting reflects the real-world use case: participants contribute multiple self-collected DBS cards over time, taken at different times of day and under varying conditions. We employed GroupKFold cross-validation with group=batch to ensure zero batch leakage between training and testing sets. ResultsIn 10-fold GroupKFold cross-validation (group=batch, zero batch leakage), our pipeline achieved 94.1% user-level identification accuracy (85.5% sample-level). In a fully held-out validation on 17 future batches -- with all feature selection, normalization, and model fitting performed exclusively on training data -- performance was even stronger: 96.1% user-level and 92.6% sample-level across 1,134 classes (chance level: 0.088%). Feature selection stability was confirmed via bootstrap analysis. We identified batch leakage as a critical methodological pitfall for the field: naive random splitting inflated accuracy by sharing 92.8% of test samples (user, batch) pairs with the training set. The top discriminative metabolites span biologically relevant pathways including amino acid metabolism, fatty acid transport, and sphingolipid biosynthesis. ConclusionsUntargeted metabolomics from dried blood spots supports batch-aware, closed-set individual identification in a single-laboratory setting, with potential relevance for longitudinal sample-to-person linkage in future digital twin workflows.

Matching journals

The top 6 journals account for 50% of the predicted probability mass.

1
Analytical Chemistry
205 papers in training set
Top 0.1%
15.5%
2
Nature Communications
4913 papers in training set
Top 9%
15.5%
3
PLOS ONE
4510 papers in training set
Top 26%
6.6%
4
Metabolites
50 papers in training set
Top 0.1%
5.1%
5
Scientific Reports
3102 papers in training set
Top 29%
4.2%
6
Alzheimer's & Dementia
143 papers in training set
Top 1%
3.8%
50% of probability mass above
7
Bioinformatics
1061 papers in training set
Top 5%
3.8%
8
Genome Medicine
154 papers in training set
Top 3%
2.2%
9
Journal of Proteome Research
215 papers in training set
Top 1.0%
2.2%
10
BMC Bioinformatics
383 papers in training set
Top 4%
1.8%
11
Clinical and Translational Science
21 papers in training set
Top 0.4%
1.8%
12
Communications Biology
886 papers in training set
Top 7%
1.8%
13
Molecular & Cellular Proteomics
158 papers in training set
Top 1.0%
1.8%
14
Computational and Structural Biotechnology Journal
216 papers in training set
Top 4%
1.7%
15
Cancer Research Communications
46 papers in training set
Top 0.8%
1.0%
16
Genome Biology
555 papers in training set
Top 6%
1.0%
17
Bioinformatics Advances
184 papers in training set
Top 4%
0.9%
18
Clinical Chemistry
22 papers in training set
Top 0.6%
0.9%
19
PLOS Computational Biology
1633 papers in training set
Top 23%
0.8%
20
Cell Reports Methods
141 papers in training set
Top 4%
0.8%
21
Journal of Clinical Microbiology
120 papers in training set
Top 1%
0.8%
22
npj Digital Medicine
97 papers in training set
Top 3%
0.8%
23
Nature Methods
336 papers in training set
Top 6%
0.8%
24
mSystems
361 papers in training set
Top 8%
0.7%
25
Advanced Science
249 papers in training set
Top 21%
0.7%
26
BMC Biology
248 papers in training set
Top 6%
0.5%
27
Journal of Translational Medicine
46 papers in training set
Top 4%
0.5%
28
The Journal of Clinical Endocrinology & Metabolism
35 papers in training set
Top 1%
0.5%
29
Briefings in Bioinformatics
326 papers in training set
Top 8%
0.5%
30
GigaScience
172 papers in training set
Top 4%
0.5%