Back

Advancements in Multilingual Biomedical Natural Language Processing: exploring Large Language Models for Named Entity Recognition and Linking

Mazzucato, S.; Seinen, T. M.; Moccia, S.; Micera, S.; Bandini, A.; van Mulligen, E. M.

2026-01-23 health informatics
10.64898/2026.01.22.26344605 medRxiv
Show abstract

ObjectiveNamed Entity Recognition (NER) and Biomedical Entity Linking (BEL) are essential for transforming unstructured Electronic Health Records (EHRs) into structured information. However, tools for these tasks are limited in non-English biomedical texts such as Dutch and Italian. This study investigates the use of prompt-based learning with Large Language Models (LLMs) to perform multilingual NER and BEL using minimal domainspecific data, while addressing annotation preservation during corpus translation. MethodsAn English-annotated corpus from the ShARe/CLEF dataset was translated into Dutch and Italian using a strategy that embeds annotations directly into the text prior to translation and retrieves them afterwards. GPT-4o was applied in zero-shot and few-shot settings to extract biomedical entities, which were then mapped to Unified Medical Language System Concept Unique Identifiers using contextual word embeddings. Performance was evaluated with precision, recall, and F1-score, and compared with goldstandard clinician annotations. ResultsThe multilingual NER pipeline achieved strong performance, with an overall F1-score of 0.98 across languages. BEL experiments showed reliable entity normalization, with an overall accuracy of 0.91 and a mean reciprocal rank of 0.95. The combined performance of the NER and BEL achieved 0.90 supporting the utility of LLMs in standardizing biomedical concepts across languages. ConclusionPrompt-based LLMs can effectively perform NER and BEL in languages with less annotated resources, even with limited annotated training data. The proposed annotation-preserving translation method, combined with generative and discriminative LLM capabilities, provides a scalable approach to multilingual clinical information extraction. These findings highlight the potential for broader adoption of LLM-based natural language processing systems to support multilingual healthcare data harmonization. Graphical Abstract O_FIG O_LINKSMALLFIG WIDTH=200 HEIGHT=133 SRC="FIGDIR/small/26344605v1_ufig1.gif" ALT="Figure 1"> View larger version (27K): org.highwire.dtl.DTLVardef@100c385org.highwire.dtl.DTLVardef@12467d8org.highwire.dtl.DTLVardef@11d9ca5org.highwire.dtl.DTLVardef@1173e3d_HPS_FORMAT_FIGEXP M_FIG C_FIG HighlightsO_LIThis study shows the feasibility of using prompt-based learning with large language models (LLMs) to perform multilingual named entity recognition (NER) and biomedical entity linking (BEL) in Dutch and Italian, two languages with less annotated resources. C_LIO_LIAn annotation-preserving translation strategy was proposed to adapt the ShARe/CLEF eHealth corpus, enabling consistent evaluation across English, Dutch, and Italian without loss of gold-standard annotations. C_LIO_LIThe multilingual NER pipeline achieved strong overall performance (F1-score: 0.89), while BEL experiments showed reliable entity normalization (F1-score: 0.64, MRR: 0.68) to standardized clinical concepts. C_LIO_LIThe approach highlights the potential of generative and discriminative LLM capabilities for scalable multilingual clinical information extraction, supporting broader European initiatives for cross-lingual health data harmonization. C_LI

Matching journals

The top 3 journals account for 50% of the predicted probability mass.

1
Journal of Biomedical Informatics
45 papers in training set
Top 0.1%
33.2%
2
BMC Medical Informatics and Decision Making
39 papers in training set
Top 0.1%
14.5%
3
Artificial Intelligence in Medicine
15 papers in training set
Top 0.1%
6.4%
50% of probability mass above
4
Journal of the American Medical Informatics Association
61 papers in training set
Top 0.5%
4.9%
5
Biology Methods and Protocols
53 papers in training set
Top 0.2%
4.0%
6
JAMIA Open
37 papers in training set
Top 0.4%
3.9%
7
International Journal of Medical Informatics
25 papers in training set
Top 0.4%
3.6%
8
Computers in Biology and Medicine
120 papers in training set
Top 1%
3.1%
9
Frontiers in Digital Health
20 papers in training set
Top 0.4%
2.1%
10
JMIR Medical Informatics
17 papers in training set
Top 0.6%
1.9%
11
Scientific Reports
3102 papers in training set
Top 58%
1.7%
12
IEEE Journal of Biomedical and Health Informatics
34 papers in training set
Top 1%
1.5%
13
Bioinformatics
1061 papers in training set
Top 8%
1.2%
14
Journal of Medical Internet Research
85 papers in training set
Top 3%
1.2%
15
npj Digital Medicine
97 papers in training set
Top 3%
1.2%
16
Computer Methods and Programs in Biomedicine
27 papers in training set
Top 0.7%
1.0%
17
Computational and Structural Biotechnology Journal
216 papers in training set
Top 7%
1.0%
18
BMJ Health & Care Informatics
13 papers in training set
Top 0.7%
0.9%
19
PLOS Digital Health
91 papers in training set
Top 3%
0.8%
20
Informatics in Medicine Unlocked
21 papers in training set
Top 1%
0.8%
21
PLOS ONE
4510 papers in training set
Top 67%
0.8%
22
BMC Medical Research Methodology
43 papers in training set
Top 2%
0.6%
23
Patterns
70 papers in training set
Top 3%
0.6%
24
Cureus
67 papers in training set
Top 6%
0.5%