Back

Automated Calculation of the Disruption Index: A Reproducible Computational Workflow for Large-Scale Bibliometric Analyses

Braga Apolinario, A.; Vieira, K. V.; Costa, A. K. M. M.; Freitas, L. C.; Pinheiro, I. S.; Vitral, R. W. F.; Campos, M. J. d. S.

2026-02-16 bioinformatics
10.64898/2026.02.12.705484 bioRxiv
Show abstract

Bibliometric analyses have become essential for understanding scientific production and innovation dynamics; however, large-scale applications remain limited by challenges related to data extraction, preprocessing, citation network reconstruction, and reproducibility, particularly when using PubMed-indexed records. This study presents a fully automated and reproducible computational workflow for large-scale bibliometric analyses based on the Disruption Index (DI). The pipeline enables systematic retrieval of PubMed data, standardized metadata processing, construction of citation networks, and calculation of DI values within a fixed post-publication citation window. Implemented in Python, the workflow integrates automated querying, XML parsing, data consolidation, and network-based citation classification, allowing scalable and transparent analyses that are infeasible through manual approaches. In a demonstrative application focused on orthodontic literature, the pipeline processed more than 67,000 articles and reconstructed over 300,000 citation relationships, resulting in a final analytical sample of 3,234 articles with indexed references and citations. The automated framework ensures methodological transparency, facilitates replication, and substantially reduces the time and technical barriers associated with advanced bibliometric studies. By providing an open and extensible solution for calculating the Disruption Index at scale, this workflow supports robust assessments of scientific innovation and consolidation and can be readily adapted to other biomedical research domains indexed in PubMed.

Matching journals

The top 6 journals account for 50% of the predicted probability mass.

1
BMC Bioinformatics
383 papers in training set
Top 0.6%
12.6%
2
Bioinformatics
1061 papers in training set
Top 2%
12.6%
3
GigaScience
172 papers in training set
Top 0.1%
9.2%
4
PLOS ONE
4510 papers in training set
Top 25%
6.9%
5
Nucleic Acids Research
1128 papers in training set
Top 4%
4.9%
6
Journal of the American Medical Informatics Association
61 papers in training set
Top 0.6%
4.3%
50% of probability mass above
7
Genomics, Proteomics & Bioinformatics
171 papers in training set
Top 2%
4.0%
8
Scientific Reports
3102 papers in training set
Top 36%
3.6%
9
Bioinformatics Advances
184 papers in training set
Top 1%
3.6%
10
Nature Communications
4913 papers in training set
Top 42%
3.1%
11
Computational and Structural Biotechnology Journal
216 papers in training set
Top 3%
2.1%
12
PeerJ
261 papers in training set
Top 6%
1.9%
13
Database
51 papers in training set
Top 0.3%
1.8%
14
Research Synthesis Methods
20 papers in training set
Top 0.1%
1.7%
15
Scientific Data
174 papers in training set
Top 1%
1.7%
16
Advanced Science
249 papers in training set
Top 13%
1.3%
17
SoftwareX
15 papers in training set
Top 0.2%
1.2%
18
eLife
5422 papers in training set
Top 51%
1.0%
19
Computers in Biology and Medicine
120 papers in training set
Top 3%
1.0%
20
NAR Genomics and Bioinformatics
214 papers in training set
Top 3%
0.9%
21
Proceedings of the National Academy of Sciences
2130 papers in training set
Top 41%
0.9%
22
Genome Biology
555 papers in training set
Top 6%
0.9%
23
BioData Mining
15 papers in training set
Top 0.7%
0.9%
24
Bioengineering
24 papers in training set
Top 1%
0.8%
25
Briefings in Bioinformatics
326 papers in training set
Top 7%
0.8%
26
IEEE Access
31 papers in training set
Top 1%
0.7%
27
Cell Systems
167 papers in training set
Top 13%
0.7%
28
Journal of Proteome Research
215 papers in training set
Top 2%
0.7%
29
Frontiers in Artificial Intelligence
18 papers in training set
Top 0.9%
0.6%
30
Computer Methods and Programs in Biomedicine
27 papers in training set
Top 1%
0.5%