Database
◐ Oxford University Press (OUP)
Preprints posted in the last 30 days, ranked by how well they match Database's content profile, based on 51 papers previously published here. The average preprint has a 0.04% match score for this journal, so anything above that is already an above-average fit.
Heavner, B. D.; Wheeler, M. M.; Bengtsson, J. D.; Carvalho, C. M. B.; Cheung, W. A.; Conomos, M. P.; Delot, E. C.; DiTroia, S.; Ganesh, V. S.; Gogarten, S. M.; Grochowski, C. M.; Jhangiani, S. N.; King, C. H.; LeMaster, C.; Marvin, C. T.; Marwaha, S.; Miller, D. E.; O'Donnell-Luria, A.; Pais, L.; Patterson, K.; Qi, G.; Richardson, M.; Smail, C.; Stilp, A. M.; Tong, C. C.; Ungar, R. A.; Weisburd, B.; Bamshad, M. J.; Bernstein, J. A.; Eichler, E. E.; Gibbs, R. A.; Lupski, J. R.; May, S. J.; Montgomery, S. B.; Pastinen, T.; Posey, J.; Rehm, H. L.; Shojaie, A.; Talkowski, M. E.; Vilain, E.; Wei, C
Show abstract
Rare disease research and diagnosis rely on the integration of genomic and phenotypic data generated across diverse clinical sites; however, the absence of widely adopted standards for representing genomic data and associated metadata has limited data interoperability, reuse, and cross-study analysis. The Genomics Research to Elucidate the Genetics of Rare Diseases (GREGoR) Consortium was established to investigate challenging rare disease cases and evaluate emerging multi-omic technologies for clinical translation. To support coordinated data integration across distributed research sites, we developed a common Consortium Data Model in partnership with domain experts to standardize the capture of participant-, family-, phenotype- and assay-level metadata, with a particular emphasis on using a modular architecture to support linking of multiple data versions from multiple omic technologies to a single individual and attribution of a genetic finding to the specific technology used for its initial discovery. Adoption of the GREGoR Data Model has enabled continued generation and public release of a harmonized, analysis-ready Consortium Dataset. The most recent release includes phenotypic, family and multi-omic data from 12,292 participants in 5,029 families. Other rare disease data sharing efforts are beginning to adopt this data model which will facilitate cross consortium analyses and empower rare disease research. This work demonstrates that a collaborative, flexible, and scalable data model can enable large-scale rare disease research, facilitate cross-center data harmonization, and enable data interoperability.
Campbell, J.; Lain, A. D.; Simpson, T. I.
Show abstract
cadmus is an open-source Python toolkit for automated retrieval and processing of full-text biomedical literature. It utilises programmatic access to PubMed, Crossref, Europe PMC, PMC, and publisher APIs, allowing users to construct large, domain-specific corpora with minimal manual intervention. cadmus parses PDF, HTML, XML, and plain text files, standardising them for downstream biomedical text mining. During the retrieval of a Developmental Disorders Corpus (204,043 publications), it achieved an 85.2% full-text retrieval rate with institutional subscriptions and 54.4% without. To test the fidelity of retrieved full-texts, we used ScispaCy to infer the similarity of paired documents from 44,264 open-access PubMed Central files and the files retrieved from cadmus, resulting in an average cosine similarity score of 0.98. Rarefaction analyses demonstrated that full-text corpora double the coverage of unique biomedical concepts over abstracts, resulting in better access to the depth of the biomedical information available. Availability and implementationcadmus is a freely available package for non-commercial research at https://github.com/biomedicalinformaticsgroup/cadmus and released under the MIT License.
Nakagawa, S.; Yamamoto, A.
Show abstract
Cross-national alignment of branded food databases is essential for international nutritional epidemiology but lacks standardized methods. Existing approaches - including food ontologies, domain-specific fine-tuned language models, and manual expert mapping - require either substantial infrastructure or do not scale to thousands of items. We propose an unsupervised evaluation framework for large language model (LLM)-based food database alignment that requires no ground-truth labels. Using the Japan Branded Food Database (JBFD; 9,519 items, 71 mid-level categories) and USDA FoodData Central (448 categories) as a case study, we introduce two complementary metrics: weighted centroid distance (nutritional proximity between matched category pairs) and dominant category share (structural consistency of category-level assignments). We then conducted a systematic ablation study across eight input conditions (A-H), varying combinations of product name, nutrient profile, and semantic category label. Results showed that nutrient-only inputs yielded poor structural consistency despite low centroid distances, while semantic category labels achieved the highest dominant category share (89.3%) but introduced circularity due to their LLM-derived origin. Among circularity-free conditions, product name combined with minimal nutrient information (energy, protein, salt; condition E) achieved the best balance of centroid distance (0.471) and dominant category share (65.8%). Model comparison across Claude Haiku, Sonnet, and Opus confirmed that NO_MATCH rates were consistent across model sizes (12-14%), suggesting that prompt design contributes more to alignment quality than model scale. These findings provide practical guidance for input design in LLM-based food database alignment without ground-truth annotation.Sonnet 4.6
kurozumi, a.; otsuka, n.; Masamichi, I.; kawakami, t.; Isagawa, T.; kodera, s.; takeda, n.
Show abstract
BackgroundAlthough advances in next-generation sequencing have accelerated the identification of genetic variants in cardiomyopathy, interpreting variants of uncertain significance (VUS) remains a clinical challenge. Evo 2 is a high-resolution genomic artificial intelligence model capable of predicting pathogenicity across large sequence contexts and enabling mechanistic interpretation; however, its application in cardiovascular genetics is limited. Here, we evaluated the utility of Evo 2 for assessing the pathogenicity and underlying mechanisms of cardiomyopathy-associated variants. MethodsWe used Evo 2 to predict the pathogenicity of single-nucleotide variants in cardiomyopathy-related genes listed on ClinVar. We assessed the ability of the model to identify characteristic structural features in both coding and noncoding regions using internal representation such as embeddings, and to infer the molecular mechanisms of variants within these regions. ResultsEvo 2 demonstrated high predictive accuracy for pathogenicity, achieving an AUROC of 0.983 and an AUPRC of 0.915. Notably, sparse autoencoders (SAEs) from embeddings identified features corresponding to higher-order structural features, including coiled-coil and actin-binding domains characteristic of cardiomyopathy-related proteins, and accurately detected mutations known to disrupt these domains. The model recognized the binding motif of the cardiac-enriched transcription factor TBX5 with SAEs and accurately predicted a single-nucleotide polymorphism affecting TBX5 binding affinity after supervised fine-tuning. ConclusionsEvo 2 demonstrated strong performance for both predicting pathogenicity and extracting biological features of cardiomyopathy-associated variants. It may represent a powerful emerging tool for evaluating VUS in cardiovascular medicine.
Bansal, N.; Parsodkar, A. P.; Pathak, A.; Narayanan, M.
Show abstract
Identifying causal relationships and distinguishing them from associations is a central scientific endeavor with many applications; knowing causal links between genes and diseases, for instance, can focus drug discovery on curing diseases beyond just symptom management. Despite several studies on automatically extracting relations between entities from large biomedical literature corpora like PubMed, only a few studies extract causal relations from abstracts and even fewer summarize corpus-level evidence for causal links. Recently, Large Language Models (LLMs) have been increasingly deployed to summarize biomedical information and extract relations; however, there is a distinct lack of explicit benchmarking comparing these generalized LLM-based methods against specialized, domain-aware frameworks for corpus-wide causal inference. In this work, we develop a method to infer Corpus-Wide Causal Score (CWCS) of a gene-disease (G-D) pair by integrating two pieces of evidence: (i) network-based causal signals in a prior gene regulatory network, quantified as a CWCS-Net score using an existing multilayer network centrality algorithm; and (ii) corpus-wide literature evidence, quantified as a CWCS-TD (TD for Truth Discovery) score using a newly-developed TD algorithm. Our CWCS-TD (scoring) algorithm jointly and iteratively estimates causal scores for multiple G-D pairs while modeling the reliability of PubMed abstracts co-mentioning them; and represents an advance in the field of TD algorithms due to its incorporation of bibliometric features of publications to address the challenge of sparsity of abstracts that assert a G-D causal relation. Using OMIM as an external expert-curated reference to evaluate classifications of G-D pairs as causal or not, our CWCS method achieved a causal class F1 score of 0.600 across ten diseases, outperforming both LLMs, GPT-4o and MMed-Llama 3 (this performance trend also persists when using area under the precision-recall curve as the evaluation metric). Both LLMs exhibit high recall accompanied by comparatively low precision, resulting in lower causal class F1 scores (0.505 for GPT-4o and 0.522 for MMed-Llama 3) due to large number of false positive predictions. Taken together, these evaluations and other ablation studies show the promise of our carefully designed algorithm in collating and integrating evidence of biomedical causal relations from both network- and literature-based sources, thereby supporting its broader applicability.
Upadhayaya, R.; Pradhan, M. M.; Metzger, V. T.; Malec, S. A.
Show abstract
BackgroundVariable selection for causal inference from observational biomedical data is challenging, as overlooking confounders or conditioning on colliders leads to biased estimates. While vast causal knowledge exists in biomedical literature, manually extracting this information for principled variable selection is impractical at scale. MethodsWe developed CausalKnowledgeTrace, a Python-based computational framework with Django web interface that systematically leverages structured causal knowledge from the Semantic MEDLINE Database (SemMedDB) to inform variable selection in causal studies. The system implements a six-stage analysis pipeline using NetworkX for graph operations, including graph parsing, basic analysis, comprehensive cycle detection, systematic generic node removal, post-removal analysis, and formal causal inference with bias detection. ResultsAnalysis of the hypertension-Alzheimers relationship across three degree neighborhoods (1-3) demonstrated systematic scaling of causal complexity: 361-866 variables, 429-1,442 relationships, with graph densities of 0.0033-0.0019. The analysis revealed complex cyclic structures with 54-606 baseline cycles across degree levels. Processing times ranged from 0.3-1.0 seconds for all three degrees, demonstrating computational efficiency for complex biomedical networks. Key confounders identified across all degrees included inflammation, diabetes, insulin resistance, obesity, and ischemia. In the third degree of graph, the pipeline structurally identified 39 confounders, 11 mediators, and 3 colliders from the causal graph. Among the key identified confounders and mediators--including obesity, oxidative stress, ischemia, and vascular diseases--all were found to have strong supporting evidence in established epidemiological and pathophysiological literature. ConclusionsCausalKnowledgeTrace provides a scalable, evidence-based approach to causal graph construction that systematically identifies confounders and bias structures often missed by conventional approaches. The Python-Django architecture enables both standalone analysis and integration into larger computational workflows, representing a significant advance in computational support for causal inference in biomedical research. Statement of SignificanceO_ST_ABSProblem or IssueC_ST_ABSSelecting proper confounders and variables for causal inference from observational biomedical datasets is challenging and often biased by limited expertise or manual review. What is Already KnownExisting approaches rely on domain experts, statistical variable screening, or manual construction of causal graphs, but these often overlook literature-documented confounders and complex biases. What this Paper AddsThis paper introduces an automated, literature-based framework for synthesizing and validating causal graphs, identifying critical variables and complex bias structures, such as M-bias and butterfly bias, with full evidentiary traceability. Who would benefit from the new knowledge in this paper?Epidemiologists, biomedical researchers, informaticians, and clinical investigators seeking reliable and transparent causal modeling for observational studies.
Bhagwat, N.; Wang, M.; Dugre, M.; Pfarr, J.-K.; Dai, A.; Urchs, S.; McPherson, B.; Gau, R.; van Heese, E. M.; d'Angremont, E.; Laansma, M. A.; Prasad, S.; Sanz-Robinson, J.; Torabi, M.; Jahanpour, A.; Danyluik, M.; Joubert, A.; Macdonald, A.; Waller, L.; Stewart, A.; Joulot, M.; Dickie, E.; Devenyi, G. A.; Bouix, S.; Bollmann, S.; Jahanshad, N.; Thompson, P. M.; Burgos, N.; Chakravarty, M. M.; Halchenko, Y. O.; van der Werf, Y. D.; Poline, J.-B.
Show abstract
Neuroimaging data management and processing are tedious and error-prone, prompting reproducibility concerns. Globally, studies with heterogeneous infrastructure and governance policies lead to eclectic data processing and sharing, necessitating standardization of data workflows to ensure reusability and comparability of multi-centric datasets. The Nipoppy neuroinformatics framework facilitates such standardization by combining specification, protocol, and software to manage study-level data workflows. With its adoption, researchers can share standardized, derived datasets enabling efficient, reproducible, and inclusive research.
Ramola, R.; De Paolis Klauza, M. C.; Piovesan, D.; Peng, Y.; Joshi, P.; Mehdiabadi, M.; Quaglia, F.; Pancsa, R.; Chemes, L. B.; Ahmadi, M.; Ahn, H.; Altenhoff, A. M.; Asgari, E.; Aspromonte, M. C.; Atalay, V.; Babbi, G.; Baldazzi, D.; Barot, M. M.; Ben-Hur, A.; Benso, A.; Berenberg, D.; Bjorne, J.; Boecker, F.; Boldi, P.; Bonello, J.; Bordin, N.; Borole, P.; Ebrahimpour Boroojeny, A.; Cao, R.; Di Carlo, S.; Casadio, R.; Casiraghi, E.; Chang, J.-M.; Chen, C.; Chen, T.-M.; Cheng, J.; Chiu, S.; Dalkiran, A.; Davidovic, R. S.; Dessimoz, C.; Diao, R.; Djeddi, W. E.; Dogan, T.; Flannery, S. T.; Font
Show abstract
BackgroundThe Critical Assessment of Functional Annotation (CAFA) is a community effort held to understand the field of computational protein function prediction. Every three years, since 2010, the organizers initiate an experiment to collect function predictions on a large set of proteins and then evaluate the performance of predicting methods on a subset of proteins that have accumulated experimental annotations between the submission deadline and the evaluation time. CAFA provides an independent and rigorous assessment of the current state of the art, thus leveling the playing field, highlighting successes, revealing bottlenecks, and offering a forum for the exchange of ideas in protein science. Here, we report the results of the fourth CAFA experiment (CAFA4). ResultsCAFA4 featured the participation of 148 methods from 70 research groups on a total of 46,205 unique proteins over a 5-year annotation accumulation phase, the longest in any CAFA. In a comparison across CAFA2-CAFA4 methods, the prediction of Gene Ontology (GO) terms has clearly improved across all three GO aspects and traditional evaluation settings. While not achieving the first rank, several CAFA2 and CAFA3 methods featured in the top ten methods in many evaluations, suggesting that earlier methods still hold relevance. The performance is weaker in the newly introduced "partial knowledge" evaluation category (proteins with experimental annotations before submission deadline that gained additional annotations in the same GO aspect during the annotation accumulation phase), highlighting the need for a new class of methods. The rankings of the methods were stable over the years in traditional evaluation settings, but less so in the new partial knowledge evaluation. Overall, the field continues to progress with some influx of new participants. Sustained efforts will be necessary to substantially advance it.
Raposo, P.; Martinez Marin, J. S.; Kim, G.; Insana, G.; Jyothi, D.; Luo, J.; Tunstall, T.; Consortium, U.; Orchard, S.; Steinegger, M.; Martin, M.
Show abstract
MotivationThe ongoing revolution in genome sequencing is delivering an unprecedented number of genome assemblies to global repositories, resulting in an overwhelming amount of data imported to UniProt in the form of proteomes. To manage this growth sustainably, there is a need for a systematic workflow to select the best proteomes. ResultsWe propose a novel pipeline for cellular organisms to select the best Reference Proteomes, i.e. those that best represent the protein space of a species. The pipeline uses a clustering algorithm based on MMseqs2 to select the minimum number of Reference Proteomes whilst maximising the representation of the protein space for each species. Additionally, we aligned our viral Reference Proteomes with the exemplar genome set defined by the International Committee on Taxonomy of Viruses. Because this method ensures that all species are represented with at least one Reference Proteome, the UniProt Knowledgebase increased the number of Reference Proteomes of 36% and covering 34% more species in the Tree of Life. The UniProt Knowledgebase will mainly retain proteins from Reference Proteomes and therefore this method reduces the overall number of proteins by 43%, leading to a more concise yet representative knowledgebase. Availability and Implementationhttps://www.uniprot.org/proteomes Contactraposo@ebi.ac.uk Supplementary informationSupplementary data are available at Bioinformatics online.
Frongia Mancini, D.; Alabed, H. B. R.; Pellegrino, R. M.
Show abstract
Background/ObjectivesHuman plasma lipidomics provides valuable information on dietary and metabolic phenotypes, but the interpretation of high-dimensional lipid datasets remains challenging. We developed the Nutritional-Metabolic Lipid Profile (NMLP) module within LipidOne to translate plasma lipidomics data into interpretable nutritional-metabolic indices, functional categories, visual outputs, and biological statements. Subjects/MethodsNMLP calculates lipid indices reflecting cardiometabolic lipid status, fatty acid remodelling, overall lipid quality, oxidative protection, and omega-3/essential fatty acid status. The module was applied to three human plasma lipidomics public datasets: a randomized crossover glycemic-load feeding study, a eucaloric high-fat diet intervention in normal-weight women, and a large public dataset stratified by insulin sensitivity. ResultsAcross datasets, NMLP converted complex lipidomic matrices into coherent nutritional-metabolic profiles. In the glycemic-load study, the module highlighted metabolic lipid shifts not captured by standard clinical lipid panels, mainly involving cardiometabolic lipid status, oxidative protection, and fatty acid remodelling. In the high-fat diet intervention, NMLP tracked temporal lipid remodelling across pre-diet, on-diet, and post-diet states, consistent with metabolic adaptation to increased dietary fat exposure. In the insulin-sensitivity dataset, insulin-resistant subjects showed a storage-oriented lipid phenotype characterized by increased neutral lipid storage indices and altered lipid quality and oxidative-protection features. Category-level clustering further revealed heterogeneous nutritional-metabolic states within insulin-resistant subjects. ConclusionsNMLP provides a deeper and clearer interpretative framework for human plasma lipidomics in nutrition and metabolic health research. By translating lipid species into functional indices and category-level readouts, the module may facilitate the use of lipidomics in clinical nutrition, metabolic phenotyping, and precision nutrition studies. NMLP is freely accessible as part of the online LipidOne platform.
Muneeb, M.; Ascher, D. B.
Show abstract
MotivationIdentifying phenotype-associated genes is a common first step in polygenic risk score construction, enrichment testing, target prioritisation and variant interpretation, but relevant evidence is distributed across heterogeneous databases with different interfaces, formats and evidence models. ResultsWe present PhenotypeToGeneDownloaderR, a phenotype-guided R/Python pipeline for automated gene retrieval, harmonisation, symbol validation and cross-source summary analysis. Given a phenotype term, the pipeline queries integrated biological databases, standardises per-source outputs, combines gene lists, validates retrieved symbols against the NCBI human gene reference and generates summary tables and visualisations. Across 13 clinically relevant phenotypes and 13 databases, PhenotypeToGeneDownloaderR generated 136,487 raw gene retrievals, with at least one source returning genes for every phenotype. Across all 13 phenotypes, 100,175 of 114,345 combined input symbols were retained after direct or synonym-based validation, corresponding to an 87.6% validation rate. Cross-source overlap was low, supporting the complementarity of integrated evidence sources. Against an HPO/ClinVar/OMIM-derived gold standard, the pipeline recovered 1,039 of 1,056 known phenotype-associated genes, corresponding to 98.4% recall. PhenotypeToGeneDownloaderR provides a lightweight, reproducible upstream framework for generating candidate gene sets for downstream prioritisation and interpretation. Availability and implementationPhenotypeToGeneDownloaderR is implemented in R and Python, released under the MIT licence, and available at https://github.com/MuhammadMuneeb007/PhenotypeToGeneDownloaderR. Supplementary informationSupplementary data are available online.
Al Sium, S. M.; Banu, T. A.; Goswami, B.; Naser, S. R.; Habib, M. A.; Akter, S.; Ara, M. H.; Al Din, S. M. S.; Nafisa, A.; Nayem, M. R.; Rabbi, M. F. A.; Sarkar, M. M. H.; Khan, M. S.
Show abstract
Background: Population-relevant BRCA1/BRCA2 data from Bangladesh are scarce, creating challenges for hereditary breast and ovarian cancer variant interpretation, counseling, and follow-up testing. We examined a clinically referred Bangladeshi cohort to characterize assay-derived BRCA1/BRCA2 short variants, sequencing-depth performance, and copy-number findings in a conservative pilot framework. Methods: Twenty-three de-identified blood-derived DNA samples were assessed using a targeted BRCA1/BRCA2 next-generation sequencing workflow. Downstream analysis used assay-generated short-variant, coverage, and CNV outputs, with coordinates reported on hg19/GRCh37. Short variants were evaluated from high-confidence PASS/VCC-H calls, and CNV review incorporated both target-region and amplicon-level copy-number patterns. Results: After removal of four low-VAF review observations, the primary germline-compatible dataset comprised 304 short-variant observations representing 34 unique variants. Both BRCA1 and BRCA2 contributed comparable variant burdens, while the overall profile was mainly composed of missense and synonymous changes. Six sample-specific heterozygous BRCA1 truncating candidates were observed, including five frameshift variants and one stop-gain variant. Protein-level mapping placed these events across the central-to-C-terminal portion of BRCA1. Sequencing depth was consistently high across the targeted regions, with all 4,255 amplicon-sample measurements exceeding 280x and 99.91% reaching at least 500x. Copy-number analysis highlighted one candidate BRCA1 multi-exon deletion-like event involving exons 15-20 in BCSIR-BRCA-21, with unresolved partial exon 14 involvement. Conclusions: This study provides an initial Bangladesh-focused targeted BRCA1/BRCA2 dataset and identifies candidate short-variant and CNV findings for validation. These findings should be interpreted as analytical candidates only and require confirmatory testing and expert clinical curation before any clinical application. The cohort is referral-enriched and should not be used to infer population prevalence.
Smith, B. S.; Smith, L. A.; Lee, J.-H.; Cahill, J. A.; Graim, K.
Show abstract
A plethora of studies have identified shared molecular mechanisms involved in tumor development across humans and other mammalian species. While these two-species analyses advance understanding of human disease, extending them across many species would provide evolutionary insight into molecular mechanisms driving human cancers. However, this expansion requires knowledge transfer and harmonization across species. Genomic differences between species, including variation in genome annotation quality, have historically hindered multi-species large-scale atlas creation. To overcome these challenges, we present Paipu, a comprehensive pipeline designed to streamline querying, preprocessing, harmonization, and retrieval of large-scale RNA-seq data and associated metadata from the NCBI Sequence Read Archive (SRA). Paipu facilitates multi-species analysis by creating a harmonized atlas from user-defined search terms and species. It consists of three components: reference genome preparation, SRA metadata retrieval, and RNA-seq data processing. We apply Paipu to 188 cancer-related terms in 239 non-human mammalian species, creating a harmonized atlas of 3,484 RNA-seq samples spanning 17 species and 35 cancers. This pan-mammalian pan-cancer atlas enables myriad comparative genomics analyses that leverage genetic variation to better understand rare human cancers. As such, Paipu serves as a resource for cross-species cancer genomics and supports atlas creation for any set of species and search terms. Graphical Abstract
Karthik, A. S. P.; Das, A. B.
Show abstract
We developed a lightweight codon-based DNA Transformer equipped with multi-head self-attention and an adaptive classifier head, which achieves exon intron classification with high accuracy and also has moderate accuracy in CDS classification and splice site recognition. We named this model as ExIT (Exon-Intron Transformer). We have implemented codon tokenization for this model. This has been validated on the human genome with external validation from the chimpanzee genome. Further benchmarking has implied that our model is better than the existing models in the above tasks.
Ranallo-Benavidez, T. R.; Chen, Y.-A.; Potapova, T. A.; Alanko, J. N.; Loucks, H.; Lucas, J.; Human Pangenome Reference Consortium, ; Guarracino, A.; Puglisi, S. J.; MARCHET, C.; Miga, K. H.; Gerton, J. L.; Barthel, F. P.
Show abstract
The pangenome era is producing long-read sequencing data and complete genome assemblies (1-3) at a pace that current annotation methods cannot match. Existing tools were each built for a single feature class (repeats, centromeric satellites, or genes) and falter precisely where the genome is most variable and harbours clinically important variation: the centromeres, subtelomeres, and acrocentric short arms. Here we present KaryoScope, an alignment-free method to annotate an assembly at base-pair resolution across any desired feature classes in a single pass, completing in minutes on a standard workstation. Applied to the Human Pangenome Reference Consortium Release 2 assemblies (3), KaryoScope identifies the SST1 macrosatellite as the recurrent sequence at Robertsonian translocation fusion points (4, 5), delivers the first pangenome-wide census of D4Z4 macrosatellite structural diversity at the 4q and 10q subtelomeres relevant to facioscapulohumeral muscular dystrophy (6), and reveals previously uncharacterised centromere structural polymorphism, including chromosome-specific satellite loss and megabase-scale rearrangement validated by fluorescence in situ hybridization. A pre-built KaryoScope database for the human genome is distributed alongside the tool, and additional databases can be built for any reference genome or annotation source. Together, these capabilities bring the most variable regions of the genome within reach for comparative, clinical, and pangenome-scale analysis. KaryoScope is available at https://github.com/barthel-lab/KaryoScope.
Ganz, M.; Norgaard, M.; Pernet, C.; Matheson, G. J.; Galassi, A.; Ceballos, E. G.; Wighton, P.; Bilgel, M.; Eierud, C.; Gonzalez-Escamilla, G.; Buckholtz, J.; Blair, R.; Markiewicz, C. J.; Hardcastle, N.; Greve, D. N.; Thomas, A. G.; Poldrack, R. A.; Calhoun, V. D.; Innis, R. B.; Knudsen, G. M.
Show abstract
Molecular neuroimaging with positron emission tomography (PET) and single-photon emission computed tomography (SPECT) enables quantification of specific molecular targets in the living brain. Despite its scientific impact, molecular neuroimaging research has historically faced challenges due to high costs, small sample sizes, laboratory-specific analysis pipelines, and limited large-scale data sharing. These factors have hindered reproducibility and the broader reuse of valuable PET datasets. The OpenNeuroPET initiative was established to address these barriers by developing standards, infrastructure, and open-source tools for organizing, sharing, and analyzing molecular neuroimaging data. Through collaborations across Europe and North America, OpenNeuroPET has supported the PET extension of the Brain Imaging Data Structure (PET-BIDS), providing a standardized framework for PET datasets and metadata. Building on PET-BIDS, tools such as PET2BIDS, ezBIDS, and BIDSCoin facilitate data conversion and curation. In parallel, OpenNeuro now hosts PET-BIDS datasets for open sharing, while complementary platforms such as PublicnEUro enable GDPR-compliant controlled access. Emerging open-source workflows and BIDS applications further support automated, reproducible PET preprocessing and quantitative analysis, promoting harmonized processing across centers. Together, these developments mark an important step toward an open molecular neuroimaging ecosystem in which datasets, software, and workflows can be transparently shared, reused, and scaled for collaborative research.
Cook, P. R.; Marenduzzo, D.; Valei, Z.
Show abstract
Existing databases of interphase chromosome conformations typically store three-dimensional coordinates of genomic segments. However, since interphase chromatin is highly dynamic, such databases are dominated by transient configurations and unstructured regions, whose positions vary continuously between cells and over time, unlike folded proteins such as globin, which adopt similar structures in every cell. These drawbacks motivated the inception of a database based on strion (a portmanteau of a string capturing structure and function). A strion concisely describes the structure and activity of all transcription units in one cell, by retaining only functionally relevant positional information. Sets of strions describing structures in different cells sampled at different times are compiled into a super-strion. Then, 46 super-strions summarise the range of structure and activity of a human cell type, including information on all transcription units, how often each co-fires and co-clusters with others in transcription factories/hubs, enhancer interactomes and small-world expression networks. Graphical abstract O_FIG O_LINKSMALLFIG WIDTH=200 HEIGHT=200 SRC="FIGDIR/small/724942v1_ufig1.gif" ALT="Figure 1"> View larger version (38K): org.highwire.dtl.DTLVardef@13a1263org.highwire.dtl.DTLVardef@18d2c78org.highwire.dtl.DTLVardef@162865corg.highwire.dtl.DTLVardef@1631d65_HPS_FORMAT_FIGEXP M_FIG C_FIG
Doneva, S. E.; Ellendorff, T. R.; Schneider, G.; Held, L.; von Wyl, V.; Simpson, I.; Sick, B.; Ineichen, B. V.
Show abstract
BackgroundLarge-scale estimates of animal-to-human drug translation and the study characteristics associated with successful translation remain limited. The expanding preclinical literature also challenges manual evidence synthesis. We developed a natural language processing (NLP) pipeline to structure and link preclinical and clinical evidence at scale. MethodsIn this retrospective meta-research study, we analysed more than 500,000 neuroscience-related animal drug studies from PubMed and linked them to clinical trial and regulatory approval data. NLP methods extracted drug, disease, and experimental design characteristics from abstracts and full texts. Translation was defined as progression to completed phase III/IV trials or regulatory approval. Logistic regression assessed associations between preclinical study characteristics and successful translation. FindingsAmong 291,624 drug entities identified in animal studies, 6{middle dot}7% entered clinical development and 3{middle dot}1% reached phase III/IV trials or regulatory approval. At the drug-disease level, 4{middle dot}4% entered clinical development and 1{middle dot}9% achieved translation. Restricting analyses to successfully linked ontology entities increased estimates to 11{middle dot}3% and 4{middle dot}1%, respectively. Male-only animal studies predominated, whereas reporting of randomisation, blinding, and sample size calculations remained limited. Testing across multiple species and reporting blinding were associated with higher odds of successful translation. InterpretationOnly a minority of interventions tested in animals progress to advanced clinical development or regulatory approval. Greater species diversity and blinding were associated with improved translational success. NLP-based evidence synthesis may support scalable evaluation of translational research and identification of potentially modifiable research practices. FundingSwiss National Science Foundation, UZH Digital Entrepreneurship Fellowship, Universities Federation for Animal Welfare. Research in contextO_ST_ABSEvidence before this studyC_ST_ABSWe searched the literature for studies quantifying large-scale animal-to-human translation and factors associated with successful translation. Existing work was mainly limited to specific diseases, interventions, or manually curated datasets, and large-scale linkage of animal and clinical evidence remained limited. Added value of this studyWe developed a natural language processing pipeline linking more than 500,000 animal studies to clinical trial and regulatory approval data. The study provides large-scale estimates of translation and identifies experimental characteristics associated with successful translation. Implications of all the available evidenceThe findings suggest that only a minority of interventions tested in animals progress to advanced clinical development or regulatory approval. Greater species diversity and reporting of blinding were associated with improved translation. Automated evidence synthesis may support more systematic evaluation of translational research practices.
Bellaiche, A.; Choudhary, P.; Nair, S.; Harrus, D.; Yu, C. W.-H.; Tanweer, S. A.; Evans, G. L.; Lo, S. W.; Martin, M.; Fleming, J. R.; Velankar, S.
Show abstract
Structure Integration with Function, Taxonomy and Sequences (SIFTS) provides residue-level mappings between UniProt Knowledgebase sequences and Protein Data Bank structures and has historically been generated through internal Protein Data Bank in Europe (PDBe) pipelines. Here, PDBe-SIFTS is presented as a fully open-source, locally deployable implementation of this mapping framework. The pipeline combines fast, scalable sequence search using MMseqs2, an improved bounded scoring scheme for ranking candidate mappings, and residue-level mapping refinement based on backbone connectivity. PDBe-SIFTS is distributed as a Python package with command-line tools for 1) building a sequence search database, 2) identifying the best sequence-structure match, 3) one-to-one mapping at the residue level, and 4) generating SIFTS annotations in PDBx/mmCIF format. Benchmarking on the complete Protein Data Bank archive showed that MMseqs2 reduced archive-scale UniProtKB searches from hours with BLASTP to minutes, approximately 22-36 times faster, while curated mappings were recovered at top rank in 93.1% of cases. The remaining discrepancies mainly involved biologically ambiguous cases such as highly conserved proteins, chimeric constructs, or closely related orthologs. These results show that PDBe-SIFTS enables fast mapping, improving structural coherence in residue-level alignments while delivering the most up-to-date and accurate mappings, comparable to expert curation. Tool: https://github.com/PDBeurope/SIFTS Quick start notebook with example: https://github.com/PDBeurope/SIFTS/tree/master/notebooks Broader audience statementMatching protein sequences to their three-dimensional structures, and mapping annotations across both, is essential for understanding protein function, interactions, and molecular mechanisms. This integrated view enables richer interpretation of biological data and underpins advances in drug discovery, disease research, and protein engineering. PDBe-SIFTS provides an open and functional framework for structure-sequence mapping, allowing researchers and databases to run, inspect, and extend these mappings locally, while benefiting from faster searches, transparent scoring, and structurally informed residue-level alignments. Graphical abstract O_FIG O_LINKSMALLFIG WIDTH=200 HEIGHT=110 SRC="FIGDIR/small/721839v1_ufig1.gif" ALT="Figure 1"> View larger version (25K): org.highwire.dtl.DTLVardef@5e6ea6org.highwire.dtl.DTLVardef@1b2754dorg.highwire.dtl.DTLVardef@1334f9forg.highwire.dtl.DTLVardef@1b083a1_HPS_FORMAT_FIGEXP M_FIG C_FIG
Wolff, K.; de Oliveira, J. A. V. S.; Fuerstenberg, L.; Hagedorn, M.; Garz, B.; Borchert, M.; Pucker, B.
Show abstract
BackgroundUrtica dioica, also known as stinging nettle, is a widespread plant that can indicate high nitrogen availability in the soil. It is probably best known for the pain caused by touching it. U. dioica is also recognized as a medicinal plant with reports claiming applicability against numerous diseases. ResultsA highly continuous genome sequence was constructed based on nanopore long read sequencing data. The total assembly size is 1.1 Gbp with an N50 of 40.7 Mbp. RNA-seq data and hints from other species were integrated to produce a high quality annotation of the protein encoding genes. This genomic resource enabled the identification of genes involved in the flavonoid biosynthesis. A particular focus was on anthocyanin biosynthesis genes as these are crucial for high light and nitrogen deprivation stress response, which is revealed by redding of the leaves. ConclusionThis genomic resource provides the basis for future studies unraveling the biosynthesis pathways underlying various medically important compounds produced by stinging nettles.