Large Language Models in Radiology Reporting - A Systematic Review of Performance, Limitations, and Clinical Implications
Artsi, Y.; Klang, E.; Collins, J. D.; Glicksberg, B. S.; Korfiatis, P.; Nadkarni, G.; Sorin, V.
Show abstract
BackgroundLarge language models (LLMs) have emerged as potential tools for automated radiology reporting. However, concerns regarding their fidelity, reliability, and clinical applicability remain. This systematic review examines the current literature on LLM-generated radiology reports. MethodsWe conducted a systematic search of MEDLINE, Google Scholar, Scopus, and Web of Science to identify studies published between January 2015 and February 2025. Studies evaluating LLM-generated radiology reports were included. The study follows PRISMA guidelines. Risk of bias was assessed using the Quality Assessment of Diagnostic Accuracy Studies (QUADAS-2) tool. ResultsNine studies met the inclusion criteria. Of these, six evaluated full radiology reports, while three focused on impression generation. Six studies assessed base LLMs, and three evaluated fine-tuned models. Fine-tuned models demonstrated better alignment with expert evaluations and achieved higher performance on natural language processing metrics compared to base models. All LLMs showed hallucinations, misdiagnoses, and inconsistencies. ConclusionLLMs show promise in radiology reporting. However, limitations in diagnostic accuracy and hallucinations necessitate human oversight. Future research should focus on improving evaluation frameworks, incorporating diverse datasets, and prospectively validating AI-generated reports in clinical workflows.
Matching journals
The top 4 journals account for 50% of the predicted probability mass.