Asymmetry between warmth and clinical substance in multilingual consumer health AI
Ariel, D.; Grumberg, L. R.; Supakul, S.; Wannasri, S.; Mitchnik, I. Y.; Lev, A.; Ariyamethanon, W.; Agbarieh, M.; Miari, S.; Laban, G.; Hasid, B.
Show abstract
The same patient question can yield different clinical quality across languages. Across 504 forum-derived patient queries in six languages and four chatbots, language-matched clinicians rated responses on five clinical dimensions (1,008 ratings; 5,040 dimension scores). Patient language outweighed chatbot identity across the four clinical-substance dimensions (composite language partial {superscript 2} 0.275 vs chatbot 0.035; robust to investigator-rating exclusion: {superscript 2} 0.260) but not for empathy ({superscript 2} 0.029): clinical substance was language-associated; warmth was relatively preserved. Catastrophic safety ratings ranged 4.3-fold by language (3.6% English, 15.5% Thai and Hebrew); 62% of catastrophic ratings exceeded the English baseline (descriptive disparity). Failures were systematic and silent: none of 24 stroke responses conveyed time-criticality framing, none of 24 CO-poisoning responses challenged the familys stress framing, and 120 sentinel responses contained no confident errors. Warmth did not discriminate clinical danger (response-level empathy AUC = 0.49): consumer health AI can deliver fluent, caring tone with degraded clinical substance.
Matching journals
The top 4 journals account for 50% of the predicted probability mass.