Large Language Models Readability Classification: A Variability Analysis of Sources and Metrics

Corrale de Matos, H. G.; Wasmann, J.-W. A.; Catalani Morata, T.; de Freitas Alvarenga, K.; Bornia Jacob, L. C.

2026-03-02 public and global health

10.64898/2026.02.20.26346638 medRxiv

Show abstract

AbstractAccurate health information is ineffective if patients cannot understand it. Large Language Model (LLM) health research values veridical precision; however, linguistic accessibility remains an under-examined component of output quality and usability. This study investigated two sources of variability in readability classification: differences across LLM systems and across readability metrics. The analysis tested 1,120 data points from seven systems in English and Portuguese, comparing baseline responses with a Wikipedia-grounded condition. Content was assessed using five standard readability metrics that measure distinct aspects of text complexity. Systems were statistically homogeneous at baseline but became significantly heterogeneous under Wikipedia grounding, indicating variability in the combination of Retrieval-Augmented Generation (differential readability effects of the same source-grounding instruction across systems). Significant metric variability was observed in all conditions, showing that readability metrics are not interchangeable. Although retrieval grounding is commonly used to improve accuracy, our findings show a trade-off: verified-source grounding can yield inconsistent readability. Therefore, evaluation protocols should use transparent, vendor-agnostic criteria, with metric-specific and language-aware thresholds, and be applied whenever models or grounding configurations change to support accessible cross-language health communication.

Large Language Models Readability Classification: A Variability Analysis of Sources and Metrics

Matching journals