Back

Towards A Foundation Model for Clinical Voice Biomarkers

Elemento, O.; Sigaras, A.; Colonel, J.; Hajirasouliha, I.; Ghosh, S.; Bensoussan, Y.; Bridge2AI-Voice Consortium, ; Rameau, A.

2026-05-30 health informatics
10.64898/2026.05.28.26354346 medRxiv
Show abstract

Vocal biomarkers, encompassing voice and speech, have largely been developed for individual conditions in isolation, limiting their generalizability across diseases and recording settings. To address this, we introduce VoiceFM, a contrastive model that learns general-purpose clinical voice representations by aligning audio embeddings with rich clinical metadata. Using the Bridge2AI-Voice dataset (984 primarily English-speaking adult participants, 846 used for training and 138 held out as a temporally separated validation cohort, 40,056 recordings totaling 176 hours across 5 academic medical centers), VoiceFM pairs a fine-tuned Whisper large-v2 encoder with a tabular transformer over 44 clinical features via symmetric InfoNCE loss. Linear probes on frozen VoiceFM embeddings achieve mean AUROC 0.952 +/- 0.005 across five evaluation tasks (control vs disease screening plus four disease categories), significantly outperforming Frozen Whisper (0.926 +/- 0.013, p = 0.013), Frozen HuBERT (0.885 +/- 0.017, p = 0.0009), and the contrastively trained VoiceFM-HuBERT (0.938 +/- 0.006, p = 0.012). On the 138-participant held-out cohort, VoiceFM-Whisper achieves AUROCs of 0.99 for Alzheimer's/dementia/MCI and 0.89 for airway stenosis, demonstrating that the learned representations generalize to participants the model has never seen. VoiceFM representations transfer to three external datasets without retraining and improve few-shot classification. Recording task attribution identifies a small set of speech tasks that match or exceed the full battery's performance, suggesting shorter screening protocols are feasible. Trained predominantly on English audio, VoiceFM transfers without fine-tuning to Spanish-language Parkinson's disease (PD) detection (NeuroVoz, 107 participants, AUROC 0.93 +/- 0.02), with the signal dominated by articulatory rather than phonatory features. A fine-tuned classifier achieves participant-level AUROC 0.87 (sustained 0.85, countdown 0.80) on the mPower smartphone study (585 held-out participants). Together, these results show that contrastive alignment between voice and rich clinical metadata can serve as the basis for a clinical voice foundation model, producing a single set of transferable representations that generalize across diseases, languages, recording conditions, and patients enrolled after model freeze.

Matching journals

The top 5 journals account for 50% of the predicted probability mass.

1
Nature Biomedical Engineering
42 papers in training set
Top 0.1%
12.8%
2
Nature Machine Intelligence
61 papers in training set
Top 0.1%
12.4%
3
npj Digital Medicine
97 papers in training set
Top 0.4%
12.4%
4
Scientific Reports
3102 papers in training set
Top 9%
8.5%
5
Nature Communications
4913 papers in training set
Top 33%
4.9%
50% of probability mass above
6
Science Translational Medicine
111 papers in training set
Top 0.7%
4.0%
7
Frontiers in Digital Health
20 papers in training set
Top 0.3%
3.1%
8
Communications Biology
886 papers in training set
Top 5%
2.1%
9
Med
38 papers in training set
Top 0.2%
1.9%
10
PLOS ONE
4510 papers in training set
Top 50%
1.9%
11
Nature Medicine
117 papers in training set
Top 2%
1.9%
12
Journal of Neural Engineering
197 papers in training set
Top 1%
1.8%
13
European Respiratory Journal
54 papers in training set
Top 0.9%
1.7%
14
Nature Neuroscience
216 papers in training set
Top 4%
1.7%
15
Advanced Science
249 papers in training set
Top 12%
1.5%
16
Patterns
70 papers in training set
Top 1%
1.5%
17
eBioMedicine
130 papers in training set
Top 2%
1.5%
18
NeuroImage: Clinical
132 papers in training set
Top 3%
1.3%
19
Communications Medicine
85 papers in training set
Top 0.6%
1.0%
20
Philosophical Transactions of the Royal Society B
51 papers in training set
Top 5%
0.9%
21
iScience
1063 papers in training set
Top 26%
0.9%
22
eLife
5422 papers in training set
Top 55%
0.8%
23
Cell Reports Medicine
140 papers in training set
Top 7%
0.8%
24
Bioinformatics
1061 papers in training set
Top 9%
0.8%
25
Briefings in Bioinformatics
326 papers in training set
Top 7%
0.8%
26
Science Advances
1098 papers in training set
Top 30%
0.8%
27
PLOS Digital Health
91 papers in training set
Top 3%
0.8%
28
Nature
575 papers in training set
Top 16%
0.8%
29
Computers in Biology and Medicine
120 papers in training set
Top 5%
0.6%
30
Frontiers in Aging Neuroscience
67 papers in training set
Top 4%
0.5%