Back

clickBrick Prompt Engineering: Optimizing Large Language Model Performance in Clinical Psychiatry

Verhees, F. G.; Huth, F.; Meyer, V.; Wolf, F.; Bauer, M.; Pfennig, A.; Ritter, P.; Kather, J. N.; Wiest, I. C.; Mikolas, P.

2025-06-30 psychiatry and clinical psychology
10.1101/2025.06.28.25330267 medRxiv
Show abstract

BackgroundPrompt engineering has the potential to enhance large language models (LLM) ability to solve tasks through improved in-context learning. In clinical research, the use of LLMs has shown expert-level performance for a variety of tasks ranging from pathology slide classification to identifying suicidality. We introduce clickBrick, a modular prompt-engineering framework, and rigorously test its effectiveness. MethodsHere, we explore the effects of increasingly structuring prompts with the clickBrick framework for a comprehensive psychopathological assessment of 100 index patients from psychiatric electronic health records. We compare the performance of a locally-run LLM (Llama-3.1-70B-Instruct) against an expert-labelled ground truth for a variety of successively built-up prompts for the extraction of 12 transdiagnostic psychopathological criteria. Potential clinical value was explored by training linear support vector machines on outputs from the strongest and weakest prompts to predict discharge ICD-10 main diagnoses for a historical sample of 1,692 patients. OutcomesWe could reliably extract information across 12 distinct psychopathological classification tasks from unstructured clinical text with balanced accuracies spanning 71 % to 94%. Across tasks, we observed a substantially improved extraction accuracy (between +19% and +36%) using clickBrick. The comparison unveiled great variations between prompts with a reasoning prompt performing best in 7 out of 12 domains. Clinical value and internal validity were approximated by downstream classification of eventual psychiatric diagnoses for 1,692 patients. Here, clickBrick led to an improvement in overall classification accuracy from 71% to 76%. InterpretationClickBrick prompt engineering, i.e. iterative, expert-led design and testing, is critical for unlocking LLMs clinical potential. The framework offers a reproducible pathway for deploying trustworthy generative AI across mental health and other clinical fields. FundingThe German Ministry of Research, Technology and Space and the German Research Foundation. Research in contextO_ST_ABSEvidence before this studyC_ST_ABSWe searched PubMed/MEDLINE articles without language restrictions published before June 25 2025 that combined three concept blocks - "prompt engineering" or related synonyms, "large language model/LLM" or specific model names (e.g., ChatGPT, GPT-4, LLaMA), and psychiatric or mental-health terms (e.g., psychiatry, psychotherapy, depression, anxiety). Additionally, we asked ChatGPT o3 to design and execute a systematic review strategy to also capture not-yet peer-reviewed but relevant pre-prints, given only our manuscript title. After manual de-duplication and abstract screening, three out of 23 identified studies did offer at least some information on their prompting strategies and were conducted on real-world clinical data from psychotherapy transcripts (one study on multi-dimensional counselling therapy, no peer review), or on online patient portal queries (two peer-reviewed studies on (a) empathy evaluation, and (b) provider satisfaction and use of generated response with partial integration with electronic health records. Neither systematically structured their prompts in a transparent way, nor tested reasoning prompts. Beyond psychiatry, one study analyzing automated echocardiography reports did employ a comparison between two different prompts and an expert-led design strategy. A single study used structured and transparent prompt engineering to generate automated responses for simulated problem-solving therapy sessions. None of the highlighted studies reported both head-to-head comparisons of competing prompt strategies for full reproducibility, and their application in real-world care, e.g. on electronic health records. Collectively, the existing literature suggests growing interest but reveals a paucity of rigorous evidence on how prompt engineering impacts large language model performance in clinical psychiatry, particularly in real-world settings. Added value of this studyWe demonstrate reliable information extraction from electronic health records across 12 distinct psychopathological classification tasks from unstructured clinical text and substantially improved extraction accuracy (between +19% and +36%) using clickBrick, our prompt engineering framework. The rationale for such an approach is justified by the surprising identification of zero-shot, few-shot and reasoning prompts as the best performing prompts for different tasks, while a Chain-of-Thought reasoning prompt performs best in 7 out of 12 tasks. And while most studies rely on proprietary language models like openAIs ChatGPT, our locally run version of a popular open-weight model (Llama-3.1-70B-Instruct) allows for privacy safeguarding of sensitive patient data, which is essential for ethical clinical application. Implications of all the available evidenceGenerative artificial intelligence is poised to benefit psychiatric patients greatly, powering advances from therapy delivery to decision support and patient outreach. Rigorous prompt engineering with tools like clickBrick heightens their reliability and credibility, making clickBrick a cornerstone for bringing AI into everyday psychiatric care.

Matching journals

The top 5 journals account for 50% of the predicted probability mass.

1
npj Digital Medicine
97 papers in training set
Top 0.3%
16.8%
2
Nature Medicine
117 papers in training set
Top 0.1%
16.8%
3
Acta Psychiatrica Scandinavica
10 papers in training set
Top 0.1%
9.7%
4
Frontiers in Psychiatry
83 papers in training set
Top 0.8%
4.7%
5
Journal of the American Medical Informatics Association
61 papers in training set
Top 0.8%
3.4%
50% of probability mass above
6
Frontiers in Digital Health
20 papers in training set
Top 0.4%
2.6%
7
Nature Communications
4913 papers in training set
Top 45%
2.5%
8
Nature
575 papers in training set
Top 9%
2.5%
9
European Psychiatry
10 papers in training set
Top 0.3%
1.8%
10
Computational Psychiatry
12 papers in training set
Top 0.1%
1.8%
11
Proceedings of the National Academy of Sciences
2130 papers in training set
Top 30%
1.8%
12
Psychological Medicine
74 papers in training set
Top 1%
1.6%
13
Communications Psychology
20 papers in training set
Top 0.1%
1.6%
14
PLOS Digital Health
91 papers in training set
Top 2%
1.6%
15
Scientific Reports
3102 papers in training set
Top 63%
1.4%
16
European Journal of Human Genetics
49 papers in training set
Top 0.7%
1.4%
17
JAMA Network Open
127 papers in training set
Top 3%
1.2%
18
PLOS ONE
4510 papers in training set
Top 61%
1.2%
19
Nature Human Behaviour
85 papers in training set
Top 3%
1.2%
20
BMC Medicine
163 papers in training set
Top 5%
1.1%
21
Nature Neuroscience
216 papers in training set
Top 6%
0.9%
22
Journal of Medical Internet Research
85 papers in training set
Top 5%
0.7%
23
Journal of Affective Disorders
81 papers in training set
Top 2%
0.7%
24
PLOS Computational Biology
1633 papers in training set
Top 26%
0.7%
25
eBioMedicine
130 papers in training set
Top 5%
0.7%
26
Genome Medicine
154 papers in training set
Top 10%
0.6%
27
Translational Psychiatry
219 papers in training set
Top 5%
0.6%
28
BJPsych Open
25 papers in training set
Top 0.8%
0.6%
29
JAMA Psychiatry
13 papers in training set
Top 0.7%
0.6%
30
Schizophrenia Bulletin
29 papers in training set
Top 0.7%
0.6%