AI-Generated Clinical Summaries: Errors and Susceptibility to Speech and Speaker Variability

Draper, T. C.; Leake, J.; Cox, T.; Lamb-Riddell, K.; Johns, B. E.; McCormick, J.; Trowell, S.; Kiely, J.; Luxton, R.

2025-10-30 health informatics

10.1101/2025.10.29.25339041 medRxiv

Show abstract

Summary BoxO_ST_ABSWhat is already known on this topicC_ST_ABSO_LIClinical AI Scribe outputs can contain errors, and the impact of human factors (e.g. communication style, accents, speech impairments) in clinical contexts remains under-characterised. C_LI What this study addsO_LIIn controlled simulations, patient personality and accent did not significantly alter total CAIS errors, with omissions predominating and hallucinations/inaccuracies remaining low. C_LIO_LISpeech-impairment effects were highly varied, with near-perfect recognition for cleft palate and vowel disorders, whereas phonological impairment substantially reduced accuracy. C_LI How this study might affect research, practice or policyO_LISupports clinician-in-the-loop deployment with local validation across representative accents and impairment profiles, prioritising detection of clinically critical errors. C_LIO_LIRoutine governance should include subgroup performance reporting (accents, impairments) and ongoing audit of error rates. C_LI ObjectiveThe study aims to evaluate whether variability in patients communication style (personality, international English accents, and speech impairments) affects the accuracy of a Clinical AI Scribe (CAIS), and to identify where performance degrades. Method and AnalysisWe conducted simulated primary-care consultations in a purpose-built lab using trained actors. To investigate personality types, four scenarios were enacted, each with five patient-personality types. For accents, human-verified transcripts of consultations were used to generate all doctor/patient combinations of seven different accents (including a synthetic reference voice) across five scenarios. The CAIS produced SOAP-structured summaries that were compared with the transcripts. Errors were classified as omissions, factual inaccuracies, or hallucinations. For speech impairments, public recordings representing five profiles were transcribed and word-recognition accuracy was calculated. ResultsPersonality types showed no statistically significant differences in errors (all p>0.05). Extraversion had the highest total errors (median 3.5), while conscientiousness and agreeableness were lower (1.5 and 2.0, respectively). Across accents, both pairwise tests and group comparisons were non-significant for both patient and doctor voices (patients: p=0.851; doctors: p=0.98). Omissions predominated, with low rates of hallucinations and factual inaccuracies. Omissions were slightly higher for Chinese- and Indian-accented doctors (both medians 3.0). In contrast, speech impairments differed: cleft palate and vowel disorders were near-perfect, whereas phonological impairment markedly reduced recognition (p<0.001). ConclusionsUnder controlled conditions, CAIS performance was broadly stable across communication styles and most accents but remained vulnerable to specific speech characteristics, particularly phonological impairment. Future evaluations using real-world, multi-speaker clinical audio are needed to confirm performance.

AI-Generated Clinical Summaries: Errors and Susceptibility to Speech and Speaker Variability

Matching journals