Back

VaxLLM: Leveraging Fine-tuned Large Language Model for automated annotation of Brucella Vaccines

Li, X.; Zheng, Y.; Hu, J.; Zheng, J.; Wang, Z.; He, Y.

2024-11-26 bioinformatics
10.1101/2024.11.25.625209 bioRxiv
Show abstract

BackgroundVaccines play a vital role in enhancing immune defense and preventing the hosts against a wide range of diseases. However, research relating to vaccine annotation remains a labor-intensive task due to the ever-increasing volume of scientific literature. This study explores the application of Large Language Models (LLMs) to automate the classification and annotation of scientific literature on vaccines as exemplified on Brucella vaccines. ResultsWe developed an automatic pipeline to automatically perform the classification and annotation of Brucella vaccine-related articles, using abstract and title. The pipeline includes VaxLLM (Vaccine Large Language Model), which is a fine-tuned Llama 3 model. VaxLLM systematically classifies articles by identifying the presence of vaccine formulations and extracts the key information about vaccines, including vaccine antigen, vaccine formulation, vaccine platform, host species used as animal models, and experiments used to investigate the vaccine. The model demonstrated high performance in classification (Precision: 0.90, Recall: 1.0, F1-Score: 0.95) and annotation accuracy (97.9%), significantly outperforming a corresponding non-fine-tuned Llama 3 model. The outputs from VaxLLM are presented in a structured format to facilitate the integration into databases such as the VIOLIN vaccine knowledgebase. To further enhance the accuracy and depth of the Brucella vaccine data annotations, the pipeline also incorporates PubTator, enabling cross comparison with VaxLLM annotations and supporting downstream analyses like gene enrichment. ConclusionVaxLLM rapidly and accurately extracted detailed itemized vaccine information from publications, significantly outperforming traditional annotation methods in both speed and precision. VaxLLM also shows great potential in automating knowledge extraction in the domain of vaccine research. AvailabilityAll data is available at https://github.com/xingxianli/VaxLLM, and the model was also uploaded to HuggingFace (https://huggingface.co/Xingxian123/VaxLLM).

Matching journals

The top 6 journals account for 50% of the predicted probability mass.

1
GigaScience
172 papers in training set
Top 0.1%
22.5%
2
Database
51 papers in training set
Top 0.1%
10.1%
3
BMC Bioinformatics
383 papers in training set
Top 2%
6.4%
4
Bioinformatics
1061 papers in training set
Top 4%
6.3%
5
Briefings in Bioinformatics
326 papers in training set
Top 1%
4.3%
6
PLOS ONE
4510 papers in training set
Top 34%
4.3%
50% of probability mass above
7
Nucleic Acids Research
1128 papers in training set
Top 5%
4.0%
8
Scientific Data
174 papers in training set
Top 0.5%
3.6%
9
Computers in Biology and Medicine
120 papers in training set
Top 1%
2.6%
10
Journal of the American Medical Informatics Association
61 papers in training set
Top 1.0%
2.4%
11
Scientific Reports
3102 papers in training set
Top 50%
2.1%
12
Computational and Structural Biotechnology Journal
216 papers in training set
Top 3%
2.1%
13
Gigabyte
60 papers in training set
Top 0.7%
1.7%
14
Journal of Medical Internet Research
85 papers in training set
Top 3%
1.7%
15
PeerJ
261 papers in training set
Top 7%
1.7%
16
Research Synthesis Methods
20 papers in training set
Top 0.1%
1.7%
17
BioData Mining
15 papers in training set
Top 0.5%
1.2%
18
Bioinformatics Advances
184 papers in training set
Top 4%
0.9%
19
F1000Research
79 papers in training set
Top 5%
0.7%
20
Genomics, Proteomics & Bioinformatics
171 papers in training set
Top 6%
0.7%
21
Proceedings of the National Academy of Sciences
2130 papers in training set
Top 44%
0.7%
22
JAMIA Open
37 papers in training set
Top 2%
0.7%
23
JMIRx Med
31 papers in training set
Top 2%
0.7%
24
PLOS Digital Health
91 papers in training set
Top 3%
0.7%
25
Bioengineering
24 papers in training set
Top 2%
0.7%
26
Journal of Translational Medicine
46 papers in training set
Top 3%
0.6%
27
Journal of Public Health
23 papers in training set
Top 1%
0.6%
28
Vaccines
196 papers in training set
Top 3%
0.6%
29
BMC Biology
248 papers in training set
Top 6%
0.6%
30
PLOS Computational Biology
1633 papers in training set
Top 27%
0.6%