Back

nleval: A Python Toolkit for Generating Benchmarking Datasets for Machine Learning with Biological Networks

Liu, R.; Krishnan, A.

2023-01-12 bioinformatics
10.1101/2023.01.10.523485 bioRxiv
Show abstract

Over the past decades, network biology has been a major driver of computational methods developed to better understand the functional roles of each gene in the human genome in their cellular context. Following the application of traditional semi-supervised and supervised machine learning (ML) techniques, the next wave of advances in network biology will come from leveraging graph neural networks (GNN). However, to test new GNN-based approaches, a systematic and comprehensive benchmarking resource that spans a diverse selection of biomedical networks and gene classification tasks is lacking. Here, we present the Open Biomedical Network Benchmark (OBNB), a collection of benchmarking datasets derived using networks from 15 sources and tasks that include predicting genes associated with a wide range of functions, traits, and diseases. The accompanying Python package, obnb, contains reusable modules that enable researchers to download source data from public databases or archived versions and set up ML-ready datasets that are compatible with popular GNN frameworks such as PyG and DGL. Our work lays the foundation for novel GNN applications in network biology. obob will also help network biologists easily set-up custom benchmarking datasets for answering new questions of interest and collaboratively engage with graph ML practitioners to enhance our understanding of the human genome. OBNB is released under the MIT license and is freely available on GitHub: https://github.com/krishnanlab/obnb

Matching journals

The top 3 journals account for 50% of the predicted probability mass.

1
Bioinformatics
1061 papers in training set
Top 1%
22.7%
2
Bioinformatics Advances
184 papers in training set
Top 0.1%
22.7%
3
BMC Bioinformatics
383 papers in training set
Top 1%
8.3%
50% of probability mass above
4
NAR Genomics and Bioinformatics
214 papers in training set
Top 0.2%
6.4%
5
Frontiers in Genetics
197 papers in training set
Top 1%
4.3%
6
PLOS Computational Biology
1633 papers in training set
Top 11%
3.1%
7
Patterns
70 papers in training set
Top 0.5%
2.4%
8
GigaScience
172 papers in training set
Top 0.9%
2.1%
9
Nucleic Acids Research
1128 papers in training set
Top 8%
2.1%
10
Genome Research
409 papers in training set
Top 2%
1.7%
11
PLOS ONE
4510 papers in training set
Top 53%
1.7%
12
Briefings in Bioinformatics
326 papers in training set
Top 5%
1.3%
13
Scientific Reports
3102 papers in training set
Top 68%
1.1%
14
Database
51 papers in training set
Top 0.7%
0.9%
15
IEEE Transactions on Computational Biology and Bioinformatics
17 papers in training set
Top 0.5%
0.9%
16
Genome Medicine
154 papers in training set
Top 7%
0.8%
17
Computational and Structural Biotechnology Journal
216 papers in training set
Top 9%
0.8%
18
Journal of Computational Biology
37 papers in training set
Top 0.7%
0.6%
19
Genome Biology
555 papers in training set
Top 8%
0.6%
20
Nature Communications
4913 papers in training set
Top 65%
0.6%
21
iScience
1063 papers in training set
Top 40%
0.5%
22
BMC Genomics
328 papers in training set
Top 8%
0.5%
23
BioData Mining
15 papers in training set
Top 1%
0.5%
24
Cell Systems
167 papers in training set
Top 14%
0.5%