Back

cadmus: a robust pipeline for scalable retrieval of full-text biomedical literature

Campbell, J.; Lain, A. D.; Simpson, T. I.

2026-05-19 bioinformatics
10.64898/2026.05.16.725623 bioRxiv
Show abstract

cadmus is an open-source Python toolkit for automated retrieval and processing of full-text biomedical literature. It utilises programmatic access to PubMed, Crossref, Europe PMC, PMC, and publisher APIs, allowing users to construct large, domain-specific corpora with minimal manual intervention. cadmus parses PDF, HTML, XML, and plain text files, standardising them for downstream biomedical text mining. During the retrieval of a Developmental Disorders Corpus (204,043 publications), it achieved an 85.2% full-text retrieval rate with institutional subscriptions and 54.4% without. To test the fidelity of retrieved full-texts, we used ScispaCy to infer the similarity of paired documents from 44,264 open-access PubMed Central files and the files retrieved from cadmus, resulting in an average cosine similarity score of 0.98. Rarefaction analyses demonstrated that full-text corpora double the coverage of unique biomedical concepts over abstracts, resulting in better access to the depth of the biomedical information available. Availability and implementationcadmus is a freely available package for non-commercial research at https://github.com/biomedicalinformaticsgroup/cadmus and released under the MIT License.

Matching journals

The top 5 journals account for 50% of the predicted probability mass.

1
Bioinformatics
1061 papers in training set
Top 0.8%
28.3%
2
Nucleic Acids Research
1128 papers in training set
Top 3%
6.5%
3
BMC Bioinformatics
383 papers in training set
Top 1%
6.4%
4
Nature Communications
4913 papers in training set
Top 32%
5.0%
5
GigaScience
172 papers in training set
Top 0.2%
5.0%
50% of probability mass above
6
Genome Biology
555 papers in training set
Top 1%
5.0%
7
Database
51 papers in training set
Top 0.1%
4.4%
8
Nature Methods
336 papers in training set
Top 2%
4.4%
9
Genome Medicine
154 papers in training set
Top 2%
4.4%
10
Bioinformatics Advances
184 papers in training set
Top 2%
2.4%
11
PLOS ONE
4510 papers in training set
Top 49%
1.9%
12
Genomics, Proteomics & Bioinformatics
171 papers in training set
Top 3%
1.7%
13
Nature Biotechnology
147 papers in training set
Top 5%
1.7%
14
Journal of the American Medical Informatics Association
61 papers in training set
Top 1%
1.5%
15
Scientific Data
174 papers in training set
Top 1%
1.3%
16
Scientific Reports
3102 papers in training set
Top 68%
1.0%
17
The American Journal of Human Genetics
206 papers in training set
Top 3%
0.9%
18
NAR Genomics and Bioinformatics
214 papers in training set
Top 3%
0.8%
19
BioData Mining
15 papers in training set
Top 0.8%
0.8%
20
Briefings in Bioinformatics
326 papers in training set
Top 6%
0.8%
21
Proceedings of the National Academy of Sciences
2130 papers in training set
Top 45%
0.7%
22
European Journal of Human Genetics
49 papers in training set
Top 1%
0.7%
23
Advanced Science
249 papers in training set
Top 21%
0.7%
24
Science
429 papers in training set
Top 22%
0.5%
25
PLOS Computational Biology
1633 papers in training set
Top 28%
0.5%