Back

Combining Clinician Expertise with Prompt Engineering enhances Small Language Models Reliability for Cancer Entity Recognition in Electronic Health Records

Corso, F.; Peppoloni, V.; Mazzeo, L.; Leone, G.; Passos, L.; Miskovic, V.; Armanini, J.; Ferrarin, A.; Wiest, I. C.; Wolf, F.; Montelatici, G.; Romano', R.; Ambrosini, P.; Capoccia, T.; Natangelo, S.; Rota, S.; Andena, P.; De Ponti, M.; Russo, A.; Stasi, G.; Provenzano, L.; Spagnoletti, A.; Meazza Prina, M.; Cavalli, C.; Giani, C.; Serino, R.; Borraccino, M.; Bonalume, C.; Di Mauro, R. M.; Agosta, C.; Dumitrascu, A. D.; Di Liberti, G.; Corrao, G.; Beninato, T.; Ganzinelli, M.; Occhipinti, M.; Brambilla, M.; Proto, C.; Kather, J. N.; Pedrocchi, A. L. G.; De Braud, F.; Lo Russo, G.; Baili, P.; P

2025-10-21 oncology
10.1101/2025.10.16.25337917 medRxiv
Show abstract

Real-world data (RWD), largely stored in unstructured electronic health records (EHRs), are critical for understanding complex diseases like cancer. However, extracting structured information from these narratives is challenging due to linguistic variability, semantic complexity, and privacy concerns. This study evaluates the performance of four locally deployable and small language models (SLMs), LLaMA, Mistral, BioMistral, and MedLLaMA, for information extraction (IE) from Italian EHRs within the APOLLO 11 trial on non-small cell lung cancer (NSCLC). We examined three prompting strategies (zero-shot, few-shot, and annotated few-shot) across English and Italian, involving clinicians with varying expertise to assess prompt designs impact on accuracy. Results show that general-purpose models (e.g., LLaMA 3.1 8B) outperform biomedical models in most tasks, particularly in extracting binary features. Multiclass variables such as TNM staging, PD-L1, and ECOG were more difficult due to implicit language and lack of standardization. Few-shot prompting and native-language inputs significantly improved performance and reduced hallucinations. Clinical expertise enhanced consistency in annotation, particularly among students using annotated examples. The study confirms that privacy-preserving SLMs can be deployed locally for efficient and secure cancer data extraction. Findings highlight the need for hybrid systems combining SLMs with expert input and underline the importance of aligning clinical documentation practices with SLM capabilities. This is the first study to benchmark SLMs on Italian EHRs and investigate the role of clinical expertise in prompt engineering, offering valuable insights for the future integration of SLMs into real-world clinical workflows.

Matching journals

The top 5 journals account for 50% of the predicted probability mass.

1
npj Digital Medicine
97 papers in training set
Top 0.3%
17.2%
2
Artificial Intelligence in Medicine
15 papers in training set
Top 0.1%
12.3%
3
Journal of Biomedical Informatics
45 papers in training set
Top 0.1%
12.1%
4
Scientific Reports
3102 papers in training set
Top 15%
6.7%
5
Biology Methods and Protocols
53 papers in training set
Top 0.1%
6.3%
50% of probability mass above
6
JCO Clinical Cancer Informatics
18 papers in training set
Top 0.1%
6.2%
7
Database
51 papers in training set
Top 0.1%
4.8%
8
iScience
1063 papers in training set
Top 4%
3.8%
9
PLOS ONE
4510 papers in training set
Top 40%
3.5%
10
BMC Medical Informatics and Decision Making
39 papers in training set
Top 1%
2.0%
11
Journal of the American Medical Informatics Association
61 papers in training set
Top 1%
1.7%
12
Frontiers in Bioinformatics
45 papers in training set
Top 0.3%
1.6%
13
Computers in Biology and Medicine
120 papers in training set
Top 3%
1.3%
14
Frontiers in Oncology
95 papers in training set
Top 3%
0.9%
15
International Journal of Medical Informatics
25 papers in training set
Top 1%
0.9%
16
Proceedings of the National Academy of Sciences
2130 papers in training set
Top 44%
0.8%
17
BMC Bioinformatics
383 papers in training set
Top 7%
0.7%
18
JAMA Network Open
127 papers in training set
Top 5%
0.7%
19
JMIR Formative Research
32 papers in training set
Top 2%
0.6%
20
EClinicalMedicine
21 papers in training set
Top 1%
0.6%
21
NAR Genomics and Bioinformatics
214 papers in training set
Top 4%
0.6%