ProtSpace: Protein Universe in Your Browser
Senoner, T.; Vahidi, P.; Olenyi, T.; Senoner, F.; Sisman, G.; Kahl, E.; Rost, B.; Koludarov, I.
Show abstract
AO_SCPLOWBSTRACTC_SCPLOWProtein Language Models (pLMs) generate per-protein embeddings that encode functional, structural, and evolutionary information, yet the relationships captured in these representations remain difficult to explore systematically. ProtSpace (https://protspace.app) is a web application for interactive visualization of pLM embedding spaces, enabling hypothesis generation directly in the browser without installation. Unlike traditional network-based tools that exclusively visualize amino acid sequence similarity, ProtSpace explores embedding spaces, revealing relationships often not captured by traditional comparisons. Users provide protein sequences or pre-computed embeddings through a Google Colab notebook or the Python CLI; the pipeline applies dimensionality reduction, retrieves 38 annotation types spanning UniProt, InterPro, NCBI Taxonomy, TED structural domains, and sequence-based predictors served via Biocentral, and produces a portable binary file for the browser-based viewer. WebGL-accelerated rendering supports interactive exploration of over 570,000 proteins. Distinctive features include per-point pie charts for multi-label annotations and integrated 3D structure viewing through AlphaFold2 predictions. All computation happens on the users machine, ensuring data privacy. We demonstrate the utility of ProtSpace through a progressive zoom-in across biological scales: from global proteome organization of Swiss-Prot, through cross-species comparison revealing conserved and lineage-specific families, to functional hypothesis generation within the beta-lactamase superfamily. ProtSpace is freely available at https://protspace.app under the Apache 2.0 license. KO_SCPLOWEYC_SCPLOWO_SCPCAP C_SCPCAPO_SCPLOWPOINTSC_SCPLOWO_LIProtSpace is a free, open-source web application that visualizes protein Language Model (pLM) embeddings as interactive maps, scaling to 570,000 proteins entirely client-side. C_LIO_LIA zero-installation Google Colab notebook and a Python CLI prepare visualization-ready bundles from FASTA files, UniProt queries, or pre-computed HDF5 embeddings, automatically retrieving 38 annotation types from five sources (UniProt, InterPro, NCBI Taxonomy, TED structural domains, and Biocentral sequence predictors) alongside custom CSV metadata. C_LIO_LIApplication examples demonstrate that embedding visualizations generate testable biological hypotheses at multiple scales, from proteome-wide organization through species-level comparison to family-level functional discovery, and that these are complementary to traditional sequence-based analyses. C_LI
Matching journals
The top 3 journals account for 50% of the predicted probability mass.