Back

Hippocrates-o1: A Guideline-Aware, Orchestrated, Self-Refining Protocol for Specialty-Specific Clinical Reasoning

Wang, B.; Schaefer, E.; Aguirre, N.; Huang, J.; Li, X.; Kolamuri, S. R.; Ying, L.; Li, X.; Chao, G.; Ang, S. S.; Vallabhajosyula, P.; Krumholz, H.; Gibbs, K. E.; Pai, S. I.; Schneider, E. B.; Cohan, A.; Ong, C. S.

2025-12-05 surgery
10.64898/2025.12.04.25341678 medRxiv
Show abstract

BackgroundClinical decision support requires language models that provide guideline-aligned, context-aware reasoning with clear justification. Many existing benchmarks emphasize multiple-choice or short-form question answering and mainly capture factual recall rather than longitudinal clinical reasoning from extended clinical notes. Hippocrates-o1 is a family of domain-tailored clinical reasoning pipelines that combine structured prompts, guideline-informed retrieval, and iterative self-refinement across oncology, general surgery, and vascular surgery. MethodsReal-world head and neck cancer cases were drawn from the MIMIC-IV-Note database, with a subset (n=20) randomly selected for detailed annotation. Six physicians adjudicated treatment phase and intent using structured criteria and rated model outputs. For each case, we generated outputs using both a general-purpose baseline model (VanillaLLM) and our oncology-specific reasoning model, Hippocrates-Karkinos-o1. Experts evaluated the outputs across five dimensions on a scale of 1 to 5: Clinical Knowledge Application, Contextual Understanding, Reasoning Transparency, Chain-of-Thought Quality, and Hallucination Audit. Overall Reasoning was the mean of domain scores. To explore whether the approach could extend beyond oncology, we also processed inguinal hernia and aortic aneurysm cases through Hippocrates-Chirurgos-o1 and Hippocrates-Angios-o1 domain adaptations. ResultsAcross paired ratings, Hippocrates-Karkinos-o1 improved Overall Reasoning from 3.40{+/-}0.90 to 4.00{+/-}0.73 (p<0.001). Domain scores increased for Clinical Knowledge Application (2.87{+/-}1.20 to 3.70{+/-}1.03), Contextual Understanding (3.48{+/-}0.95 to 3.98{+/-}0.95), Hallucination Audit (3.90{+/-}1.32 to 4.74{+/-}0.76), Reasoning Transparency (3.45{+/-}1.02 to 3.86{+/-}0.87), and Chain-of-Thought Quality (3.32{+/-}1.04 to 3.69{+/-}1.00), all p[&le;]0.001. Surgical and vascular adaptations showed parallel qualitative improvements. ConclusionsThe Hippocrates-o1 protocol improved reasoning fidelity, guideline alignment, and factual grounding relative to a general-purpose model and generalized across oncology, surgery, and vascular care. Orchestrated retrieval and self-refinement provide a reproducible template for evaluating and enhancing clinical reasoning in medical AI.

Matching journals

The top 1 journal accounts for 50% of the predicted probability mass.