clickBrick Prompt Engineering: Optimizing Large Language Model Performance in Clinical Psychiatry
Verhees, F. G.; Huth, F.; Meyer, V.; Wolf, F.; Bauer, M.; Pfennig, A.; Ritter, P.; Kather, J. N.; Wiest, I. C.; Mikolas, P.
Show abstract
BackgroundPrompt engineering has the potential to enhance large language models (LLM) ability to solve tasks through improved in-context learning. In clinical research, the use of LLMs has shown expert-level performance for a variety of tasks ranging from pathology slide classification to identifying suicidality. We introduce clickBrick, a modular prompt-engineering framework, and rigorously test its effectiveness. MethodsHere, we explore the effects of increasingly structuring prompts with the clickBrick framework for a comprehensive psychopathological assessment of 100 index patients from psychiatric electronic health records. We compare the performance of a locally-run LLM (Llama-3.1-70B-Instruct) against an expert-labelled ground truth for a variety of successively built-up prompts for the extraction of 12 transdiagnostic psychopathological criteria. Potential clinical value was explored by training linear support vector machines on outputs from the strongest and weakest prompts to predict discharge ICD-10 main diagnoses for a historical sample of 1,692 patients. OutcomesWe could reliably extract information across 12 distinct psychopathological classification tasks from unstructured clinical text with balanced accuracies spanning 71 % to 94%. Across tasks, we observed a substantially improved extraction accuracy (between +19% and +36%) using clickBrick. The comparison unveiled great variations between prompts with a reasoning prompt performing best in 7 out of 12 domains. Clinical value and internal validity were approximated by downstream classification of eventual psychiatric diagnoses for 1,692 patients. Here, clickBrick led to an improvement in overall classification accuracy from 71% to 76%. InterpretationClickBrick prompt engineering, i.e. iterative, expert-led design and testing, is critical for unlocking LLMs clinical potential. The framework offers a reproducible pathway for deploying trustworthy generative AI across mental health and other clinical fields. FundingThe German Ministry of Research, Technology and Space and the German Research Foundation. Research in contextO_ST_ABSEvidence before this studyC_ST_ABSWe searched PubMed/MEDLINE articles without language restrictions published before June 25 2025 that combined three concept blocks - "prompt engineering" or related synonyms, "large language model/LLM" or specific model names (e.g., ChatGPT, GPT-4, LLaMA), and psychiatric or mental-health terms (e.g., psychiatry, psychotherapy, depression, anxiety). Additionally, we asked ChatGPT o3 to design and execute a systematic review strategy to also capture not-yet peer-reviewed but relevant pre-prints, given only our manuscript title. After manual de-duplication and abstract screening, three out of 23 identified studies did offer at least some information on their prompting strategies and were conducted on real-world clinical data from psychotherapy transcripts (one study on multi-dimensional counselling therapy, no peer review), or on online patient portal queries (two peer-reviewed studies on (a) empathy evaluation, and (b) provider satisfaction and use of generated response with partial integration with electronic health records. Neither systematically structured their prompts in a transparent way, nor tested reasoning prompts. Beyond psychiatry, one study analyzing automated echocardiography reports did employ a comparison between two different prompts and an expert-led design strategy. A single study used structured and transparent prompt engineering to generate automated responses for simulated problem-solving therapy sessions. None of the highlighted studies reported both head-to-head comparisons of competing prompt strategies for full reproducibility, and their application in real-world care, e.g. on electronic health records. Collectively, the existing literature suggests growing interest but reveals a paucity of rigorous evidence on how prompt engineering impacts large language model performance in clinical psychiatry, particularly in real-world settings. Added value of this studyWe demonstrate reliable information extraction from electronic health records across 12 distinct psychopathological classification tasks from unstructured clinical text and substantially improved extraction accuracy (between +19% and +36%) using clickBrick, our prompt engineering framework. The rationale for such an approach is justified by the surprising identification of zero-shot, few-shot and reasoning prompts as the best performing prompts for different tasks, while a Chain-of-Thought reasoning prompt performs best in 7 out of 12 tasks. And while most studies rely on proprietary language models like openAIs ChatGPT, our locally run version of a popular open-weight model (Llama-3.1-70B-Instruct) allows for privacy safeguarding of sensitive patient data, which is essential for ethical clinical application. Implications of all the available evidenceGenerative artificial intelligence is poised to benefit psychiatric patients greatly, powering advances from therapy delivery to decision support and patient outreach. Rigorous prompt engineering with tools like clickBrick heightens their reliability and credibility, making clickBrick a cornerstone for bringing AI into everyday psychiatric care.
Matching journals
The top 5 journals account for 50% of the predicted probability mass.