How Agent Role Structure Alters Operating Characteristics of Large Language Model Clinical Classifiers: A Comparative Study of Specialist and Deliberative Multi-Agent Protocols

Anderson, C. G.

2026-02-24 health informatics

10.64898/2026.02.22.26346818 medRxiv

Show abstract

Large language models (LLMs) are increasingly deployed in structured clinical decision support, yet the architectural effects of internal role decomposition within multi-agent systems remain poorly isolated. Prior comparisons of single-agent and multi-agent prompting frequently confound workflow structure with changes in model configuration, training, or decoding. We present a controlled architectural study of role-structured inference under fixed model parameters, isolating internal role decomposition as the sole manipulated variable. Two deterministic multi-agent protocols, Generic Deliberative (GD) and Feature-Specialist (FS), are evaluated under identical base weights, decoding settings, computational budget, and adjudication logic. Across two tabular clinical benchmarks (UCI Cleveland Heart Disease and Pima Indians Diabetes), altering role structure alone systematically reshapes operating characteristics. On Cleveland, FS improves accuracy by 0.07 and macro-F1 by 0.06 relative to GD, while shifting the operating point toward higher specificity (+0.22) and lower sensitivity (-0.13), substantially reducing false positives. On Pima, architectural effects reverse direction: GD achieves the strongest macro performance (accuracy 0.68, macro-F1 0.64), whereas FS induces pronounced class asymmetry (recall 0.95 for the positive class and 0.27 for the negative class). These findings demonstrate that internal role decomposition functions as a structured inductive bias that can materially alter error distributions without modifying model parameters. Multi-agent prompt architecture should therefore be treated as an explicit mechanism for controlling sensitivity-specificity trade-offs in safety-sensitive LLM decision systems.

How Agent Role Structure Alters Operating Characteristics of Large Language Model Clinical Classifiers: A Comparative Study of Specialist and Deliberative Multi-Agent Protocols

Matching journals