Back

Multi-label topic classification for COVID-19 literature annotation using an ensemble model based on PubMedBERT

Tian, S.; Zhang, J.

2021-10-29 bioinformatics
10.1101/2021.10.26.465946 bioRxiv
Show abstract

The BioCreative VII Track 5 calls for participants to tackle the multi-label classification task for automated topic annotation of COVID-19 literature. In our participation, we evaluated several deep learning models built on PubMedBERT, a pre-trained language model, with different strategies addressing the challenges of the task. Specifically, multi-instance learning was used to deal with the large variation in the lengths of the articles, and focal loss function was used to address the imbalance in the distribution of different topics. We found that the ensemble model performed the best among all the models we have tested. Test results of our submissions showed that our approach was able to achieve satisfactory performance with an F1 score of 0.9247, which is significantly better than the baseline model (F1 score: 0.8678) and the mean of all the submissions (F1 score: 0.8931).

Matching journals

The top 4 journals account for 50% of the predicted probability mass.

1
Database
51 papers in training set
Top 0.1%
19.0%
2
Scientific Data
174 papers in training set
Top 0.1%
19.0%
3
Nucleic Acids Research
1128 papers in training set
Top 2%
8.4%
4
Bioinformatics
1061 papers in training set
Top 4%
7.0%
50% of probability mass above
5
BMC Bioinformatics
383 papers in training set
Top 2%
4.1%
6
Scientific Reports
3102 papers in training set
Top 45%
2.7%
7
GigaScience
172 papers in training set
Top 1%
1.9%
8
PLOS ONE
4510 papers in training set
Top 53%
1.7%
9
BioData Mining
15 papers in training set
Top 0.3%
1.7%
10
Bioinformatics Advances
184 papers in training set
Top 3%
1.7%
11
Journal of Biomedical Informatics
45 papers in training set
Top 0.8%
1.7%
12
Artificial Intelligence in the Life Sciences
11 papers in training set
Top 0.1%
1.3%
13
Briefings in Bioinformatics
326 papers in training set
Top 5%
1.3%
14
IEEE Access
31 papers in training set
Top 0.5%
1.3%
15
Research Synthesis Methods
20 papers in training set
Top 0.2%
1.1%
16
Artificial Intelligence in Medicine
15 papers in training set
Top 0.5%
1.1%
17
NAR Genomics and Bioinformatics
214 papers in training set
Top 3%
1.0%
18
Genomics, Proteomics & Bioinformatics
171 papers in training set
Top 5%
0.9%
19
Journal of Proteome Research
215 papers in training set
Top 2%
0.9%
20
Proceedings of the National Academy of Sciences
2130 papers in training set
Top 41%
0.9%
21
Computers in Biology and Medicine
120 papers in training set
Top 4%
0.8%
22
Nature Communications
4913 papers in training set
Top 60%
0.8%
23
Genome Biology
555 papers in training set
Top 7%
0.8%
24
Frontiers in Genetics
197 papers in training set
Top 9%
0.8%
25
Computational and Structural Biotechnology Journal
216 papers in training set
Top 9%
0.8%
26
International Journal of Molecular Sciences
453 papers in training set
Top 16%
0.7%
27
Communications Biology
886 papers in training set
Top 25%
0.7%
28
Advanced Science
249 papers in training set
Top 23%
0.5%
29
Genomics
60 papers in training set
Top 3%
0.5%
30
Journal of the American Medical Informatics Association
61 papers in training set
Top 2%
0.5%