Back

Combining Clinician Expertise with Prompt Engineering enhances Small Language Models Reliability for Cancer Entity Recognition in Electronic Health Records

Corso, F.; Peppoloni, V.; Mazzeo, L.; Leone, G.; Passos, L.; Miskovic, V.; Armanini, J.; Ferrarin, A.; Wiest, I. C.; Wolf, F.; Montelatici, G.; Romano', R.; Ambrosini, P.; Capoccia, T.; Natangelo, S.; Rota, S.; Andena, P.; De Ponti, M.; Russo, A.; Stasi, G.; Provenzano, L.; Spagnoletti, A.; Meazza Prina, M.; Cavalli, C.; Giani, C.; Serino, R.; Borraccino, M.; Bonalume, C.; Di Mauro, R. M.; Agosta, C.; Dumitrascu, A. D.; Di Liberti, G.; Corrao, G.; Beninato, T.; Ganzinelli, M.; Occhipinti, M.; Brambilla, M.; Proto, C.; Kather, J. N.; Pedrocchi, A. L. G.; De Braud, F.; Lo Russo, G.; Baili, P.; P

2025-10-21 oncology

10.1101/2025.10.16.25337917 medRxiv

Show abstract

Real-world data (RWD), largely stored in unstructured electronic health records (EHRs), are critical for understanding complex diseases like cancer. However, extracting structured information from these narratives is challenging due to linguistic variability, semantic complexity, and privacy concerns. This study evaluates the performance of four locally deployable and small language models (SLMs), LLaMA, Mistral, BioMistral, and MedLLaMA, for information extraction (IE) from Italian EHRs within the APOLLO 11 trial on non-small cell lung cancer (NSCLC). We examined three prompting strategies (zero-shot, few-shot, and annotated few-shot) across English and Italian, involving clinicians with varying expertise to assess prompt designs impact on accuracy. Results show that general-purpose models (e.g., LLaMA 3.1 8B) outperform biomedical models in most tasks, particularly in extracting binary features. Multiclass variables such as TNM staging, PD-L1, and ECOG were more difficult due to implicit language and lack of standardization. Few-shot prompting and native-language inputs significantly improved performance and reduced hallucinations. Clinical expertise enhanced consistency in annotation, particularly among students using annotated examples. The study confirms that privacy-preserving SLMs can be deployed locally for efficient and secure cancer data extraction. Findings highlight the need for hybrid systems combining SLMs with expert input and underline the importance of aligning clinical documentation practices with SLM capabilities. This is the first study to benchmark SLMs on Italian EHRs and investigate the role of clinical expertise in prompt engineering, offering valuable insights for the future integration of SLMs into real-world clinical workflows.

Combining Clinician Expertise with Prompt Engineering enhances Small Language Models Reliability for Cancer Entity Recognition in Electronic Health Records

Matching journals