Back

Language-dependent diagnostic safety of medical AI systems: a cross-lingual benchmarking and prospective clinical study

Wang, Y.; He, H.; Zhu, R.; Lu, Y.; Phadungsaksawasdi, P.; Peng, M.; Liu, Z.; Zou, K.; Zhang, Y.; Chew, S. P.; Tham, Y. C.; Khorasani, A.; Deng, H.; Cheng, C.-Y.; Yang, J.; Liu, D.

2026-05-21 health informatics
10.64898/2026.05.19.26353490 medRxiv
Show abstract

Background Patients worldwide receive healthcare in many languages, yet medical AI systems are validated almost exclusively in high-resource languages such as English and Chinese, exposing patients in other linguistic settings to unquantified diagnostic risk. Existing multilingual evaluations rely on translated research-style benchmarks that fail to capture authentic clinical work. We aimed to characterise the patient safety consequences of multilingual medical AI deployment in real-world clinical settings and to develop an auditable detection method for unsafe outputs. Methods We evaluated different language models (LLMs) and visual language models (VLMs) across four real-world clinical tasks (conversational QA, radiology report generation, glaucoma diagnosis, ICU re-intubation prediction) in five languages (English, Chinese, Malay, Thai, Persian). We developed a token-level uncertainty toolkit to localise reasoning instability, compared three inference paradigms (native-language, English chain-of-thought, back-translation pivot), and conducted a prospective study (50 dialogues, 150 physician-reviewed records). Findings LLM/VLM performance degraded consistently from high- to low-resource languages across all tasks. Key gaps included: HealthBench score declining from 0.3743 to 0.3180; radiology macro-F1 from 0.2938 to 0.2149-0.2424, consistent with selective pathology suppression; glaucoma accuracy from 50.7% to 32.7%; ICU parameter recall from 100.0% to 48.5%. Multimodal inputs amplified degradation. Qwen3 VL 235B showed attenuated decline with no resource-ordered pattern in glaucoma classification. Token-level analysis localised instability to mid-chain stages (40-70% of the normalised trajectory); perplexity-based confidence failed to flag errors (AUROC 0.41-0.66). Back-translation pivot consistently restored performance. In the prospective study, 98.7% of records required physician edits (overall modification score 53.6%); Thai-pivot correction burden (59.0%) exceeded English-pivot (50.7%, p=0.003) and Chinese-direct (51.0%, p=0.004). Interpretation Multilingual deployment produced clinically consequential failures, including missed pathology, distorted physiological extraction, and amplified multimodal misclassification, that were invisible to monolingual validation and not reliably flagged by model confidence. Pretraining data composition may contribute to multilingual safety risk. Language-specific safety auditing should precede deployment in non-dominant-language healthcare settings; the open-source detection toolkit enables this without model retraining.

Matching journals

The top 5 journals account for 50% of the predicted probability mass.

1
npj Digital Medicine
97 papers in training set
Top 0.1%
28.3%
2
Journal of the American Medical Informatics Association
61 papers in training set
Top 0.4%
7.0%
3
BMJ Health & Care Informatics
13 papers in training set
Top 0.1%
6.5%
4
Scientific Reports
3102 papers in training set
Top 22%
5.0%
5
Frontiers in Digital Health
20 papers in training set
Top 0.2%
4.1%
50% of probability mass above
6
BMJ Open
554 papers in training set
Top 6%
3.1%
7
The Lancet Digital Health
25 papers in training set
Top 0.2%
3.1%
8
PLOS Digital Health
91 papers in training set
Top 0.8%
3.1%
9
Journal of Biomedical Informatics
45 papers in training set
Top 0.6%
2.1%
10
JMIR Medical Informatics
17 papers in training set
Top 0.5%
2.1%
11
BMC Medical Informatics and Decision Making
39 papers in training set
Top 1%
1.7%
12
JCO Clinical Cancer Informatics
18 papers in training set
Top 0.4%
1.7%
13
Journal of Medical Internet Research
85 papers in training set
Top 2%
1.7%
14
JAMIA Open
37 papers in training set
Top 0.8%
1.7%
15
Healthcare
16 papers in training set
Top 0.8%
1.5%
16
International Journal of Medical Informatics
25 papers in training set
Top 0.9%
1.5%
17
eBioMedicine
130 papers in training set
Top 2%
1.3%
18
PLOS ONE
4510 papers in training set
Top 62%
1.0%
19
iScience
1063 papers in training set
Top 25%
0.9%
20
European Heart Journal - Digital Health
15 papers in training set
Top 0.5%
0.8%
21
JAMA
17 papers in training set
Top 0.3%
0.8%
22
Biology Methods and Protocols
53 papers in training set
Top 2%
0.8%
23
Computers in Biology and Medicine
120 papers in training set
Top 4%
0.8%
24
Journal of General Internal Medicine
20 papers in training set
Top 1.0%
0.8%
25
Philosophical Transactions of the Royal Society B
51 papers in training set
Top 6%
0.7%
26
British Journal of Ophthalmology
14 papers in training set
Top 0.3%
0.7%
27
Inflammatory Bowel Diseases
15 papers in training set
Top 0.3%
0.7%
28
Artificial Intelligence in Medicine
15 papers in training set
Top 0.9%
0.5%
29
Frontiers in Neurology
91 papers in training set
Top 6%
0.5%
30
JMIR Public Health and Surveillance
45 papers in training set
Top 5%
0.5%