Back

Benchmarking And Datasets For Ambient Clinical Documentation: A Review Of Existing Frameworks And Metrics For AI-Assisted Medical Note Generation

Gebauer, S.

2025-01-29 health informatics
10.1101/2025.01.29.25320859 medRxiv
Show abstract

BackgroundThe increasing adoption of ambient artificial intelligence (AI) scribes in healthcare has created an urgent need for robust evaluation frameworks to assess their performance and clinical utility. While these tools show promise in reducing documentation burden, there remains no standardized approach for measuring their effectiveness and safety. ObjectiveTo systematically review existing evaluation frameworks and metrics used to assess AI-assisted medical note generation from doctor-patient conversations, and provide recommendations for future evaluation approaches. MethodsA scoping review following PRISMA guidelines was conducted across PubMed, IEEE Explore, Scopus, Web of Science, and Embase to identify studies evaluating ambient scribe technology between 2020-2025. Studies were included if they were peer-reviewed, focused on clinical ambient scribe evaluation from speaking to note production, and described an evaluation approach. Extracted data included evaluation metrics, benchmarking approaches, dataset characteristics, and model performance. ResultsSeven studies met inclusion criteria. Evaluation approaches varied widely, from traditional natural language processing metrics like ROUGE and BERTScore to domain-specific measures such as clinical accuracy and bias. Critical gaps identified include: 1) wide diversity of evaluation metrics making cross-study comparison challenging, 2) limited integration of clinical relevance in automated metrics, 3) lack of standardized approaches for crucial metrics like hallucinations and errors, and 4) minimal diversity in clinical specialties evaluated. Only two datasets were publicly available for benchmarking. ConclusionsThis review reveals significant heterogeneity in how ambient scribes are evaluated, highlighting the need for standardized evaluation frameworks. We propose recommendations for developing comprehensive evaluation approaches that combine automated metrics with clinical quality measures. Future work should focus on creating public benchmarks across diverse clinical settings and establishing consensus on critical safety and quality metrics.

Matching journals

The top 3 journals account for 50% of the predicted probability mass.

1
npj Digital Medicine
97 papers in training set
Top 0.3%
17.9%
2
Journal of Medical Internet Research
85 papers in training set
Top 0.2%
16.9%
3
Frontiers in Digital Health
20 papers in training set
Top 0.1%
16.9%
50% of probability mass above
4
JMIR Formative Research
32 papers in training set
Top 0.3%
4.1%
5
Scientific Reports
3102 papers in training set
Top 39%
3.5%
6
Journal of the American Medical Informatics Association
61 papers in training set
Top 0.8%
3.1%
7
Healthcare
16 papers in training set
Top 0.4%
2.0%
8
JMIR Medical Informatics
17 papers in training set
Top 0.6%
2.0%
9
BMJ Health & Care Informatics
13 papers in training set
Top 0.4%
1.8%
10
International Journal of Medical Informatics
25 papers in training set
Top 0.8%
1.8%
11
Journal of NeuroEngineering and Rehabilitation
28 papers in training set
Top 0.5%
1.8%
12
PLOS Digital Health
91 papers in training set
Top 1%
1.7%
13
BMJ Open
554 papers in training set
Top 9%
1.7%
14
Artificial Intelligence in Medicine
15 papers in training set
Top 0.3%
1.6%
15
BMC Medical Informatics and Decision Making
39 papers in training set
Top 2%
1.6%
16
JAMIA Open
37 papers in training set
Top 0.9%
1.6%
17
Journal of Biomedical Informatics
45 papers in training set
Top 0.9%
1.6%
18
DIGITAL HEALTH
12 papers in training set
Top 0.4%
1.3%
19
PLOS ONE
4510 papers in training set
Top 59%
1.3%
20
Biology Methods and Protocols
53 papers in training set
Top 2%
1.1%
21
JMIR mHealth and uHealth
10 papers in training set
Top 0.4%
0.9%
22
IEEE Journal of Biomedical and Health Informatics
34 papers in training set
Top 2%
0.9%
23
Computers in Biology and Medicine
120 papers in training set
Top 4%
0.9%
24
Frontiers in Public Health
140 papers in training set
Top 8%
0.7%
25
BJPsych Open
25 papers in training set
Top 0.8%
0.7%
26
Computer Methods and Programs in Biomedicine
27 papers in training set
Top 1%
0.7%
27
Cancer Medicine
24 papers in training set
Top 2%
0.7%