Back

Bi-level diversity optimisation for representative protein panel selection

Ou, Z.; James, K.; Charnock, S.; Wipat, A.

2026-04-21 bioinformatics
10.64898/2026.04.17.719243 bioRxiv
Show abstract

Selecting representative subsets from large protein sequence datasets is a common challenge in enzyme discovery and related tasks under limited screening capacity. In practice, candidate panels are often constructed using clustering-based redundancy reduction or manual selection guided by phylogenetic or similarity-network analyses, which do not directly optimise subset diversity and require threshold tuning or expert interpretation. Here, we present a bi-level diversity-optimisation framework for representative protein panel selection implemented using a local search heuristic that iteratively updates panel composition to improve diversity. The method formulates panel design as a combinatorial optimisation problem over pairwise distance matrices, combining a MaxMin objective to enforce minimum separation between selected sequences with a MaxSum objective to increase global dispersion. This formulation enables the direct construction of fixed-cardinality panels while remaining independent of the similarity representation used to compute pairwise distances. Benchmarking across four Pfam families shows that the bi-level formulation consistently reduces redundancy among selected sequences, lowering maximum pairwise identity by 43-46% relative to the previous MaxSum-based formulation, while maintaining comparable or improved EC-label coverage. The framework can incorporate sequence- or structure-based similarity measures, providing a flexible strategy for constructing diverse representative panels across homologous protein families.

Matching journals

The top 7 journals account for 50% of the predicted probability mass.

1
Bioinformatics
1061 papers in training set
Top 2%
14.7%
2
PLOS Computational Biology
1633 papers in training set
Top 4%
8.4%
3
Journal of Chemical Information and Modeling
207 papers in training set
Top 0.7%
8.4%
4
Nature Communications
4913 papers in training set
Top 26%
6.8%
5
BMC Bioinformatics
383 papers in training set
Top 2%
4.9%
6
Computational and Structural Biotechnology Journal
216 papers in training set
Top 1%
4.3%
7
Cell Systems
167 papers in training set
Top 3%
4.3%
50% of probability mass above
8
Briefings in Bioinformatics
326 papers in training set
Top 2%
4.0%
9
Bioinformatics Advances
184 papers in training set
Top 1%
3.7%
10
Journal of Molecular Biology
217 papers in training set
Top 0.6%
3.6%
11
Journal of Cheminformatics
25 papers in training set
Top 0.1%
3.6%
12
NAR Genomics and Bioinformatics
214 papers in training set
Top 0.9%
3.1%
13
PLOS ONE
4510 papers in training set
Top 45%
2.6%
14
Scientific Reports
3102 papers in training set
Top 53%
1.9%
15
Cell Reports Methods
141 papers in training set
Top 2%
1.8%
16
Nucleic Acids Research
1128 papers in training set
Top 11%
1.7%
17
Protein Science
221 papers in training set
Top 1%
1.5%
18
International Journal of Molecular Sciences
453 papers in training set
Top 11%
1.2%
19
Proceedings of the National Academy of Sciences
2130 papers in training set
Top 39%
1.1%
20
Journal of Proteome Research
215 papers in training set
Top 2%
0.9%
21
Proteins: Structure, Function, and Bioinformatics
82 papers in training set
Top 0.9%
0.8%
22
Communications Biology
886 papers in training set
Top 21%
0.8%
23
Chemical Science
71 papers in training set
Top 2%
0.8%
24
Genomics, Proteomics & Bioinformatics
171 papers in training set
Top 6%
0.7%
25
Frontiers in Molecular Biosciences
100 papers in training set
Top 5%
0.7%
26
Nature Biotechnology
147 papers in training set
Top 8%
0.7%
27
Nature Methods
336 papers in training set
Top 7%
0.6%