Back

PDBe-SIFTS: an open-source tool for Structure Integration with Function, Taxonomy, and Sequences, featuring improved alignment, scoring scheme, and accelerated search

Bellaiche, A.; Choudhary, P.; Nair, S.; Harrus, D.; Yu, C. W.-H.; Tanweer, S. A.; Evans, G. L.; Lo, S. W.; Martin, M.; Fleming, J. R.; Velankar, S.

2026-05-04 bioinformatics
10.64898/2026.04.30.721839 bioRxiv
Show abstract

Structure Integration with Function, Taxonomy and Sequences (SIFTS) provides residue-level mappings between UniProt Knowledgebase sequences and Protein Data Bank structures and has historically been generated through internal Protein Data Bank in Europe (PDBe) pipelines. Here, PDBe-SIFTS is presented as a fully open-source, locally deployable implementation of this mapping framework. The pipeline combines fast, scalable sequence search using MMseqs2, an improved bounded scoring scheme for ranking candidate mappings, and residue-level mapping refinement based on backbone connectivity. PDBe-SIFTS is distributed as a Python package with command-line tools for 1) building a sequence search database, 2) identifying the best sequence-structure match, 3) one-to-one mapping at the residue level, and 4) generating SIFTS annotations in PDBx/mmCIF format. Benchmarking on the complete Protein Data Bank archive showed that MMseqs2 reduced archive-scale UniProtKB searches from hours with BLASTP to minutes, approximately 22-36 times faster, while curated mappings were recovered at top rank in 93.1% of cases. The remaining discrepancies mainly involved biologically ambiguous cases such as highly conserved proteins, chimeric constructs, or closely related orthologs. These results show that PDBe-SIFTS enables fast mapping, improving structural coherence in residue-level alignments while delivering the most up-to-date and accurate mappings, comparable to expert curation. Tool: https://github.com/PDBeurope/SIFTS Quick start notebook with example: https://github.com/PDBeurope/SIFTS/tree/master/notebooks Broader audience statementMatching protein sequences to their three-dimensional structures, and mapping annotations across both, is essential for understanding protein function, interactions, and molecular mechanisms. This integrated view enables richer interpretation of biological data and underpins advances in drug discovery, disease research, and protein engineering. PDBe-SIFTS provides an open and functional framework for structure-sequence mapping, allowing researchers and databases to run, inspect, and extend these mappings locally, while benefiting from faster searches, transparent scoring, and structurally informed residue-level alignments. Graphical abstract O_FIG O_LINKSMALLFIG WIDTH=200 HEIGHT=110 SRC="FIGDIR/small/721839v1_ufig1.gif" ALT="Figure 1"> View larger version (25K): org.highwire.dtl.DTLVardef@5e6ea6org.highwire.dtl.DTLVardef@1b2754dorg.highwire.dtl.DTLVardef@1334f9forg.highwire.dtl.DTLVardef@1b083a1_HPS_FORMAT_FIGEXP M_FIG C_FIG

Matching journals

The top 5 journals account for 50% of the predicted probability mass.

1
Computational and Structural Biotechnology Journal
216 papers in training set
Top 0.1%
14.9%
2
Journal of Chemical Information and Modeling
207 papers in training set
Top 0.4%
14.5%
3
Bioinformatics
1061 papers in training set
Top 3%
10.6%
4
Journal of Molecular Biology
217 papers in training set
Top 0.2%
6.5%
5
Protein Science
221 papers in training set
Top 0.2%
4.9%
50% of probability mass above
6
Journal of Cheminformatics
25 papers in training set
Top 0.1%
3.6%
7
Briefings in Bioinformatics
326 papers in training set
Top 2%
3.6%
8
Communications Biology
886 papers in training set
Top 5%
2.1%
9
PLOS Computational Biology
1633 papers in training set
Top 13%
2.1%
10
Nature Communications
4913 papers in training set
Top 51%
1.7%
11
Advanced Science
249 papers in training set
Top 11%
1.7%
12
Communications Chemistry
39 papers in training set
Top 0.3%
1.7%
13
Nucleic Acids Research
1128 papers in training set
Top 12%
1.5%
14
Bioinformatics Advances
184 papers in training set
Top 3%
1.3%
15
ACS Omega
90 papers in training set
Top 2%
1.3%
16
PLOS ONE
4510 papers in training set
Top 58%
1.3%
17
SoftwareX
15 papers in training set
Top 0.2%
1.2%
18
BMC Bioinformatics
383 papers in training set
Top 5%
1.2%
19
Cell Reports Methods
141 papers in training set
Top 4%
1.0%
20
Proteins: Structure, Function, and Bioinformatics
82 papers in training set
Top 0.7%
1.0%
21
Structure
175 papers in training set
Top 3%
1.0%
22
Journal of Structural Biology
58 papers in training set
Top 1%
0.9%
23
Patterns
70 papers in training set
Top 2%
0.9%
24
Journal of Proteome Research
215 papers in training set
Top 2%
0.8%
25
Database
51 papers in training set
Top 0.8%
0.8%
26
The Journal of Physical Chemistry Letters
58 papers in training set
Top 2%
0.8%
27
International Journal of Molecular Sciences
453 papers in training set
Top 15%
0.8%
28
Frontiers in Molecular Biosciences
100 papers in training set
Top 5%
0.8%
29
Genomics, Proteomics & Bioinformatics
171 papers in training set
Top 6%
0.7%
30
The Journal of Physical Chemistry B
158 papers in training set
Top 2%
0.7%