Back

Named Entity Recognition and Linking: a Portuguese and Spanish Oncological Parallel Corpus

Andrade, V. D. T.; Ruas, P.; Couto, F. M.

2021-09-22 bioinformatics
10.1101/2021.09.16.460605 bioRxiv
Show abstract

Biomedical literature is the main mean of communication for researchers to share their findings. Since biomedical literature is composed of a large collection of text expressed in natural language, the usage of text mining tools to extract information from those texts automatically is of utmost importance. The problem is that the majority of the state-of-the-art tools were not developed to deal with other languages besides English, which in biomedical literature is even more critical since a significant part of health-related texts is written in the authors native language. To address this issue, this work presents a deep learning NERL (Named Entity Recognition and Linking) system and a parallel corpus for the Spanish and Portuguese languages focused on the oncological domain. Both the system and the corpus are available at https://github.com/lasigeBioTM/ICERL_system-ICR_Corpus.

Matching journals

The top 2 journals account for 50% of the predicted probability mass.

1
Database
51 papers in training set
Top 0.1%
49.1%
2
PLOS ONE
4510 papers in training set
Top 34%
4.3%
50% of probability mass above
3
Journal of Biomedical Informatics
45 papers in training set
Top 0.3%
4.3%
4
Bioinformatics
1061 papers in training set
Top 6%
2.7%
5
BMC Bioinformatics
383 papers in training set
Top 3%
2.4%
6
Scientific Reports
3102 papers in training set
Top 49%
2.1%
7
Artificial Intelligence in Medicine
15 papers in training set
Top 0.2%
2.1%
8
BMC Medical Informatics and Decision Making
39 papers in training set
Top 1%
2.1%
9
Biology Methods and Protocols
53 papers in training set
Top 0.8%
1.7%
10
BioData Mining
15 papers in training set
Top 0.4%
1.5%
11
Scientific Data
174 papers in training set
Top 1%
1.5%
12
IEEE Access
31 papers in training set
Top 0.5%
1.3%
13
GigaScience
172 papers in training set
Top 2%
1.3%
14
JMIR Medical Informatics
17 papers in training set
Top 1%
1.3%
15
Computers in Biology and Medicine
120 papers in training set
Top 3%
1.1%
16
Computational and Structural Biotechnology Journal
216 papers in training set
Top 7%
0.9%
17
Nucleic Acids Research
1128 papers in training set
Top 15%
0.9%
18
Bioinformatics Advances
184 papers in training set
Top 4%
0.9%
19
Neuroinformatics
40 papers in training set
Top 0.8%
0.8%
20
Informatics in Medicine Unlocked
21 papers in training set
Top 1%
0.8%
21
iScience
1063 papers in training set
Top 31%
0.8%
22
JAMIA Open
37 papers in training set
Top 2%
0.7%
23
Frontiers in Physiology
93 papers in training set
Top 6%
0.7%
24
BMJ Health & Care Informatics
13 papers in training set
Top 1%
0.7%
25
PLOS Digital Health
91 papers in training set
Top 3%
0.7%
26
Bioengineering
24 papers in training set
Top 2%
0.7%