Back

Corpus for Benchmarking Clinical Speech De-identification

Dai, H.-J.; Fang, L.-C.; Mir, T. H.; Chen, C.-T.; Feng, H.-H.; Lai, J.-R.; Hsu, H.-C.; Nandy, P.; Panchal, O.; Liao, W.-H.; Tien, Y.-Z.; Chen, P.-Z.; Lin, Y.-R.; Jonnagaddala, J.

2026-04-03 health informatics
10.64898/2026.03.31.26349906 medRxiv
Show abstract

Objectives Publicly available datasets dedicated to clinical speech deidentification tasks remain scarce due to privacy constraints and the complexity of speech-level annotation. To address this gap, we compiled the SREDH-AICup sensitive health information (SHI) speech corpus, a time-aligned clinical speech dataset annotated across 38 SHI categories. Methods Two publicly available English medical-domain datasets were adapted to support speech-level de-identification, including script reformulation and controlled re-recorded by 25 participants. Additional Mandarin Chinese clinical-style materials were incorporated to extend linguistic coverage. All audio data were annotated with million-level, time-aligned SHI spans using Label Studio. Inter-annotator agreement was evaluated using Cohen's kappa, following iterative calibration rounds. The resulting corpus supports both automatic speech recognition (ASR) and speech-level recognition of SHIs. Results The final dataset comprises 20 hours of annotated audio, divided into training (10 hours, 1,539 files), validation (5 hours, 775 files), and test (5 hours, 710 files) subsets, totalling 7,830 SHI entities. The language distribution reflects the composition of the selected source materials, with 19.36 hours of English and 0.89 hours of Mandarin Chinese speech. Discussion The corpus exhibits a long-tail distribution consistent with clinical documentation patterns and highlights the limited availability of Chinese medical speech resources. These characteristics underscore both the realism of the dataset and structural challenges associated with multilingual speech de-identification. Conclusion The SREDH-AICup SHI speech corpus provides a clinically grounded, time-aligned speech dataset supporting automated medical speech de-identification research and facilitating future development of multilingual speech-based privacy protection systems.

Matching journals

The top 6 journals account for 50% of the predicted probability mass.

1
Scientific Data
174 papers in training set
Top 0.1%
14.3%
2
Frontiers in Digital Health
20 papers in training set
Top 0.1%
10.1%
3
Scientific Reports
3102 papers in training set
Top 7%
10.1%
4
Philosophical Transactions of the Royal Society B
51 papers in training set
Top 0.3%
8.4%
5
Journal of Medical Internet Research
85 papers in training set
Top 0.8%
6.3%
6
Journal of Biomedical Informatics
45 papers in training set
Top 0.3%
4.8%
50% of probability mass above
7
PLOS ONE
4510 papers in training set
Top 34%
4.3%
8
Computers in Biology and Medicine
120 papers in training set
Top 0.8%
3.7%
9
PLOS Digital Health
91 papers in training set
Top 1%
1.8%
10
Frontiers in Neurology
91 papers in training set
Top 3%
1.7%
11
JAMIA Open
37 papers in training set
Top 0.9%
1.5%
12
npj Digital Medicine
97 papers in training set
Top 2%
1.5%
13
BMJ Open
554 papers in training set
Top 11%
1.2%
14
IEEE Journal of Biomedical and Health Informatics
34 papers in training set
Top 1%
1.2%
15
BMC Research Notes
29 papers in training set
Top 0.3%
1.1%
16
International Journal of Medical Informatics
25 papers in training set
Top 1%
0.9%
17
Data in Brief
13 papers in training set
Top 0.3%
0.9%
18
JAMA Pediatrics
10 papers in training set
Top 0.1%
0.9%
19
Journal of the American Medical Informatics Association
61 papers in training set
Top 2%
0.9%
20
Journal of the American Heart Association
119 papers in training set
Top 4%
0.8%
21
JMIR Formative Research
32 papers in training set
Top 2%
0.7%
22
Journal of Personalized Medicine
28 papers in training set
Top 1%
0.7%
23
iScience
1063 papers in training set
Top 32%
0.7%
24
BioData Mining
15 papers in training set
Top 1%
0.7%
25
Healthcare
16 papers in training set
Top 2%
0.7%
26
Artificial Intelligence in Medicine
15 papers in training set
Top 0.8%
0.7%
27
Brain Sciences
52 papers in training set
Top 3%
0.6%
28
Journal of Alzheimer’s Disease
39 papers in training set
Top 1%
0.6%
29
Patterns
70 papers in training set
Top 3%
0.6%
30
Informatics in Medicine Unlocked
21 papers in training set
Top 1%
0.6%