Back

From Concept to Clinic: Real World Evidence for Autonomous AI Deployment in Primary Care Telemedicine

Saenz, A. D.; Schumacher, E.; Naik, D.; Khosla, N.; Kannan, A.

2026-03-20 health informatics
10.64898/2026.03.18.26348749 medRxiv
Show abstract

Systems powered by large language models are widely used for health information and advice, yet robust evidence for their safety and effectiveness in real-world clinical care remains lacking. Most existing studies evaluate general-purpose chatbots in artificial settings, failing to account for the critical role of system design, deployment context, and integrated safety mechanisms. Here, we report, to our knowledge, the first large-scale, clinician-blinded, real-world evaluation of a multi-agent LLM-based system deployed within a nationwide U.S. primary care telemedicine platform, assessing readiness for task-specific autonomous deployment. In 2,379 real patient encounters, where users actively sought medical care and completed full visits with licensed clinicians, we compared the AI system's intake diagnoses and disposition suggestions to those of treating clinicians, who were blinded to the AI's outputs. The AI's top-1 diagnosis matched the clinician's diagnosis in 91.3% of cases overall, increasing to 96.3% among cases meeting a pre-specified safety confidence threshold, and 97.9% in common, lower-complexity conditions that met the same confidence threshold. Disposition accuracy was similarly high, with an overall error rate of 2.5% and no errors in suggestions to emergency room or home management. These results demonstrate that purposeful system architecture, rather than model capability alone, is essential for safe and effective autonomous clinical AI. We propose a staged, task-calibrated deployment framework, in which AI can be introduced autonomously for well-defined tasks with explicit safety gating and continuous monitoring, expanding scope as real-world evidence accrues. Our findings provide the first real-world evidence of readiness for safe autonomous clinical AI and offer a practical roadmap for its responsible deployment at scale.

Matching journals

The top 1 journal accounts for 50% of the predicted probability mass.