Database — Latest Matching Preprints

1

Ignet 2.0 and Vignet: An Ontology-Driven Web Platform for Biomedical Gene Interaction Discovery and Visualization

Asaduzzaman, S.; Bansal, B.; Combs, P.; Zhang, J.; Rehana, H.; McGregor, B.; He, Y.; Hur, J.

2026-06-06 bioinformatics 10.64898/2026.06.02.729682 medRxiv

Top 0.1%

39.5%

Show abstract

BackgroundThe expansion of biomedical literature demands systematic ontology-guided discovery of gene interactions, vaccine mechanisms, drug associations, and adverse events. Existing platforms such as STRING, DisGeNET, and PubTator fall short of providing a unified, freely accessible system that integrates ontology-based semantic interaction classification, vaccine-focused heterogeneous network construction, and Artificial Intelligence-assisted evidence retrieval. ResultsIgnet 2.0 and Vignet are freely accessible dual-platform systems that combine PubMed literature mining, BioBERT-based interaction scoring for millions of gene-gene co-occurrence pairs and integrate three biomedical ontologies and one curated drug resource, Interaction Network Ontology (INO), Vaccine Ontology (VO), Human Disease Ontology (HDO), and DrugBank. Ignet 2.0 supports gene interaction discovery, gene set enrichment retrieval of BioBERT-scored GenePair evidence, and AI-assisted summarization through BioSummarAI. Vignet extends these features with VO-guided Vaccine Exploration, VacPair interaction scoring, and the creation of vaccine, gene, drug, and disease networks in VacNet. A public Representational State Transfer Application Programming Interface (REST API) and Model Context Protocol (MCP) endpoint enable real-time integration, fostering trust in biomedical knowledge discovery. ConclusionIgnet 2.0 and Vignet are scalable, ontology-guided biomedical knowledge platforms that facilitate evidence-based gene interaction analysis, vaccine-focused semantic exploration, and AI-assisted knowledge discovery. Their real-time PubMed data integration ensures up-to-date insights; however, users should consider validation processes and potential lags in incorporating the latest experimental data, which may affect the reliability of immediate data. AvailabilityIgnet 2.0: https://ignet.org/ignet; Vignet: https://ignet.org/vignet/

2

IID-KG: An ontology-aligned literature-derived knowledge graph for infectious and immune-mediated diseases

PAN, F.; Zhang, Y.; Wang, J.; Liu, M.-C.; Sui, X.; Yue, H.; Zhang, J.

2026-05-26 bioinformatics 10.64898/2026.05.21.727015 medRxiv

Top 0.1%

22.7%

Show abstract

Infectious and immune-mediated diseases (IIDs) represent a broad and rapidly expanding biomedical literature domain in which scalable evidence extraction, disease ontology refinement, and interpretable knowledge integration are essential for biomedical discovery. We constructed an IID-specific biomedical knowledge graph (IID KG) from PubMed abstracts and PMC full-text articles by integrating nested named entity recognition, ontology-guided identifier assignment, full-text relation extraction, and relation-resolution strategies. A gold-standard corpus of 500 PubMed abstracts and 8 PMC full-text articles was manually annotated for nested biomedical entities across six entity types. The resulting models were applied to 30,128,068 PubMed abstracts and 1,385,500 IID-related PMC full-text articles. A unified IID ontology was developed from 411,341 disease terms using hierarchical text classification, large language model-based refinement, ontology cross-referencing, and expert review, yielding 179,657 confirmed MeSH mappings. The final IID KG contains approximately 1,837,513 unique entities and 16,295,390 unique relations across eight relation types. The resource was released publicly together with repurposing workflows, supporting ontology-aligned literature mining, disease mechanism analysis, and drug-repurposing hypothesis generation for IID research.

3

Pulmonary Hypertension Engine for Linked Experiments (PHELEX): a platform for the re-analysis of public transcriptomic data related to pulmonary hypertension in both animal models, and humans.

Nandani, T.; Ott, B. P.; Balaratnam, P.; Archer, S. L.; Durbin, J.; Hindmarch, C. C. T.

2026-05-01 genomics 10.64898/2026.04.28.721394 medRxiv

Top 0.1%

22.3%

Show abstract

Pulmonary hypertension (PH) is a vasculopathy that results in elevated mean pulmonary arterial pressures over 20mmHg. Despite significant advances in research, PH still has a high mortality rate, and there is currently no cure for the disease. As with all biomedical fields, PH researchers have embraced the power of next generation technologies such as microarrays and RNA sequencing. Most of these data can be found on public repositories, which is usually a requirement for publication. While these repositories are rich sources of data, they require intermediate to advanced bioinformatics skills to access, download, and make these data useful. Here we present Pulmonary Hypertension Engine for Linked Experiments (PHELEX), which represents a comprehensive catalogue of all RNA sequencing data related to PH that is currently available on the Gene Expression Omnibus (GEO), hosted by the US National Centre for Biotechnology Information (NCBI). We identified 2,278 bulk RNA sequencing samples from human, mouse and rat, and built a searchable tool based on the metadata that is associated with each sample. PHELEX is a functional tool that allows selected studies to be highlighted, and parsed through Confidence, an analysis tool we have created, which will model the data based on user-defined classifiers, perform differential gene expression and pathway analysis, and present these data using standard graphics, and text-file results. PHELEX also allows PH researchers to cross-cut between discrete studies, facilitating de novo understanding of these data. As a robust searchable repository of genomic data, we hope that PHELEX will accelerate PH innovation and discovery, by allowing researchers to mine existing genomic data and thus better understand the molecular signatures that underpin PH.

4

Developing a Specialized Dravet Syndrome Ontology for Rare Disease Informatics and AI Applications

Golnari, P.; Prantzalos, K.; Upadhyaya, D. P.; Buchhalter, J.; Sahoo, S. S.

2026-07-04 neurology 10.64898/2026.07.01.26357055 medRxiv

Top 0.1%

18.7%

Show abstract

Dravet syndrome (DS) is a severe developmental and epileptic encephalopathy whose clinical and research representation requires integration of heterogeneous knowledge spanning seizures, development, behavior, SUDEP/autonomic risk, genetics, comorbidities, electrophysiology, pharmacology, and drug responsiveness. We report the development of a DS-focused ontology created by expert-guided specialization of a previously published epilepsy ontology. Scope expansion was defined through a scientific advisory board, structured review meetings, and iterative ontology curation in OWL. The resulting resource reorganized DS content across nine major domains and expanded the publicly released ontology from the pre-extension baseline to the current BioPortal version. Beyond structural growth, the ontology was assessed through expert-guided curation and downstream task-based reuse, including two published ontology-enabled LLM studies and an ongoing ontology-derived DS knowledge graph and AI assistant platform. These results suggest that disease-focused ontology specialization can provide durable infrastructure for DS data harmonization, knowledge representation, and AI-enabled translational informatics.

5

CellExLink: End-to-end cell-type recognition and normalization in biomedical text

Nabijiang, A.; Shahriyari, L.

2026-05-29 bioinformatics 10.64898/2026.05.26.728013 medRxiv

Top 0.1%

18.4%

Show abstract

Since cells are the main components of many biological and biomedical studies, cell-type extraction is an important task in biomedical text mining. However, current biomedical text-mining systems either do not explicitly support cell-type extraction, provide limited support for Cell Ontology normalization, or show limited performance in end-to-end cell-type extraction. These limitations can affect downstream tasks that depend on reliable cell-type information. Here, we present CellExLink, an end-to-end biomedical natural language processing pipeline designed specifically for cell-type recognition and Cell Ontology normalization in biomedical text. The pipeline is designed to improve extraction accuracy and practical usability in literature-mining workflows, while accounting for computational efficiency in its recognition and normalization design. We evaluate CellExLink across heterogeneous biomedical corpora and compare it with established and recent biomedical text-mining tools. The results show that CellExLink provides reliable cell-type recognition, Cell Ontology normalization, and end-to-end extraction across these corpora. By addressing the need for reliable end-to-end cell-type recognition and Cell Ontology normalization, CellExLink can support downstream tasks such as curation, search, relation extraction, and knowledge graph construction. Author summaryCell types are central to biomedical research, but biomedical papers often use different names, abbreviations, and synonyms for the same cell type. This variation makes it difficult for automated processes to collect and compare cell-type information across papers. Reliable automated extraction is important because literature mining requires consistent cell-type identification before evidence from different studies can be searched, integrated, or reused. Existing off-the-shelf biomedical text-mining tools provide useful functionality, but their ability to support cell-type extraction remains limited and inconsistent. To address this gap, we developed CellExLink, a pipeline that finds cell-type entities in biomedical text and links them to standard Cell Ontology identifiers. We evaluated the pipeline on several biomedical corpora and compared it with existing tools that support cell-type extraction. Across these evaluations, CellExLink showed clear accuracy gains in both detecting cell-type entities and assigning correct standard identifiers. Together, these gains make CellExLink a powerful tool for extracting reliable standardized cell-type information from large collections of papers, supporting literature curation, relation extraction, knowledge graph construction, and studies of cell-type-specific roles in diseases, drug responses, and biological pathways.

6

An extensible laboratory information management system for data harmonization across research centers: The ICTS-Dashboard

King, C. H.; De Dios, I.; Barrick, R.; Berger, S.; Almalvez, M.; Auriga, L.; Delot, E. C.; Xiao, C.; LoTempio, J.; Vilain, E.

2026-06-02 health informatics 10.64898/2026.05.31.26354439 medRxiv

Top 0.1%

12.6%

Show abstract

Background: Collaborative research programs increasingly require infrastructure capable of integrating heterogeneous participant, sample, and experimental data while meeting evolving research needs. Existing tools, including clinical EHRs, REDCap, generic research information management systems, and bespoke database builds, were not designed to operationalize project-specific data models. The Institute for Clinical and Translational Science (ICTS) at the University of California, Irvine (UCI) ICTS-Dashboard fills this need by providing a general purpose research information management system. Methods: We describe the ICTS-Dashboard, built as an open-source, schema-driven platform in which database structure, server-side validation, representational state transfer application programming interfaces (REST APIs), web-based forms, and reproducible exports are all generated from a single versioned java script object notation (JSON) Schema set. The backend is implemented in Django, Django REST Framework, and PostgreSQL; the frontend in React. We instantiate the platform with the Genomics Research to Elucidate the Genetics of Rare Diseases (GREGoR) Data Model and extend it with two case studies: a locally developed biobank table for biospecimen logistics, and an embedded adaptation of the RAG-HPO retrieval-augmented phenotype curation tool. Results: The ICTS-Dashboard deployed at the UCI-GREGoR site supports 37 schema-derived tables and 250 documented API endpoints. It holds metadata for 2,558 participants, 1,237 families, 5,517 biobank entries, 2,466 sequenced biospecimens, and 289 genetic findings, and supports quarterly external data submissions regenerated directly from the database. The biobank extension adds entities the consortium does not standardize while preserving foreign-key linkage to rare disease records; the RAG-HPO module adds curator-mediated phenotype normalization against 19,389 indexed HPO terms. Both were integrated without modifying the GREGoR data model. Conclusion: A version-controlled, machine-readable data model can serve not only as a data sharing standard but as the operational backbone of a research program when paired with schema-governed tooling. The Dashboard's architecture is not intrinsic to a data model or to rare disease; any collaborative research program with a structured, versioned model can adopt the same pattern to reduce implementation overhead and improve reproducibility, harmonization, and findable, accessible, interoperable, and reproducible (FAIR)-aligned accessibility.

7

cadmus: a robust pipeline for scalable retrieval of full-text biomedical literature

Campbell, J.; Lain, A. D.; Simpson, T. I.

2026-05-19 bioinformatics 10.64898/2026.05.16.725623 medRxiv

Top 0.1%

12.3%

Show abstract

cadmus is an open-source Python toolkit for automated retrieval and processing of full-text biomedical literature. It utilises programmatic access to PubMed, Crossref, Europe PMC, PMC, and publisher APIs, allowing users to construct large, domain-specific corpora with minimal manual intervention. cadmus parses PDF, HTML, XML, and plain text files, standardising them for downstream biomedical text mining. During the retrieval of a Developmental Disorders Corpus (204,043 publications), it achieved an 85.2% full-text retrieval rate with institutional subscriptions and 54.4% without. To test the fidelity of retrieved full-texts, we used ScispaCy to infer the similarity of paired documents from 44,264 open-access PubMed Central files and the files retrieved from cadmus, resulting in an average cosine similarity score of 0.98. Rarefaction analyses demonstrated that full-text corpora double the coverage of unique biomedical concepts over abstracts, resulting in better access to the depth of the biomedical information available. Availability and implementationcadmus is a freely available package for non-commercial research at https://github.com/biomedicalinformaticsgroup/cadmus and released under the MIT License.

8

geneXplore: An Interactive Browser for X Chromosome-Wide Association Study Results

Cook, N.; Boulais-Richard, J.; Zeng, Y.; Yang, C.; Budde, J.; Taliun, D.; Gagliano Taliun, S. A.; Cruchaga, C.; Belloy, M. E.

2026-07-14 neurology 10.64898/2026.07.14.26357489 medRxiv

Top 0.1%

11.2%

Show abstract

Summary: The X chromosome comprises approximately 5% of the human genome and encodes over 800 protein-coding genes, many of which exhibit sex-differentiated expression patterns due to escape from X chromosome inactivation (XCI) mechanisms. Despite its relevance to sex differences in complex traits, the X chromosome is routinely excluded from genome-wide association studies due to analytical challenges, and when analyzed, the impact of escape from XCI or sex is limitedly explored. No dedicated, publicly accessible browser for X chromosome-wide association study (XWAS) summary statistics currently exists, creating a barrier to systematic investigation of X-linked contributions to human traits. Here, we present geneXplore, an interactive web browser based on the PheWeb2 implementation, tailored for XWAS summary statistics across 1,944 phenotypes while distinguishing random XCI (rXCI), escape from XCI (eXCI), and sex-stratified analyses. Users can explore results via interactive plots (Manhattan and Miami, PheWAS and LocusZoom), searchable tables and access to cross-database lookup, with full summary statistics available for download. Availability and Implementation: geneXplore is freely available at https://genexplore.wustl.edu/ with no registration required and will be maintained for a minimum of two years following publication. Source code is available at https://github.com/Belloy-Lab/geneXplore_XWAS_Browser under an MIT license.

9

RD-OMICS: An Integrative Multi-Omics Data Inventory in Rare Diseases

Sun, S.; Wang, H.; Mathe, E. A.; Zhu, Q.

2026-07-03 bioinformatics 10.64898/2026.06.29.735296 medRxiv

Top 0.1%

8.9%

Show abstract

Rare diseases (RD) impact over 30 million individuals in the United States, yet fewer than 5% of the identified conditions have FDA-approved treatments. Progress in RD research is hindered by small patient cohorts, biological heterogeneity, and the fragmented, inconsistently annotated publicly available omics data, which limits integrative analysis and translational discovery. Here, we present RD-OMICS, a data inventory with integrated and structured RD omics data from Gene Expression Omnibus (GEO), in the form of a knowledge graph. We developed a metadata harmonization pipeline that combines rule-based mapping and large language model (LLM)-assisted semantic categorization. The graph-based data model was defined to integrate different types of data including disease conditions, experiments, samples, platforms, projects, and publications into a centralized inventory graph. In this preliminary study, 11,049 GEO series for 126 rare diseases were processed and integrated into RD-OMICS, which includes 375,930 individual biospecimen samples, 1,578 sequencing and array platforms, 10,938 biological projects. Case studies demonstrate the use of RD-OMICS in supporting rare disease research, omics cohort construction, and transcriptome-based drug repurposing for amyotrophic lateral sclerosis (ALS). RD-OMICS provides a scalable foundation for transforming fragmented omics data into a structured, harmonized and interoperable resource, facilitating therapeutic development and other translational discoveries in rare diseases.

10

Revised Adaptive Immune Receptor Data in the Immune Epitope Database

Scheffer, L.; Richardson, E. M.; Vita, R.; Zarebski, L.; Blazeska, N.; Wheeler, D. K.; Cantrell, J. R.; Deleuran, S. N.; Lees, W. D.; Christley, S.; Corrie, B.; Cowell, L. G.; Sette, A.; Peters, B.

2026-06-06 bioinformatics 10.64898/2026.06.03.728549 medRxiv

Top 0.1%

8.2%

Show abstract

The Immune Epitope Database (IEDB, iedb.org) is a freely available resource that catalogs experimentally defined immune epitopes and - if available - the immune receptors that recognize them. Currently, the IEDB records [~]185, 000 T cell receptors and [~]5, 000 B cell receptors/antibodies with experimentally verified epitope specificity. Because these receptor data were manually curated from [~]3, 300 references spanning decades, nomenclature inconsistencies present challenges for computational analyses and user queries. To support integrated analysis of the entire dataset, we revised the IEDB receptor data standardization and validation pipeline to flag and correct inaccuracies. Anomalous receptors from over 800 studies were flagged for re-curation. The updated receptor dataset shows greater conformity through consistent gene nomenclature formatting and harmonized CDR sequence delimitation. Taking advantage of the increased receptor data consistency, the IEDB web interface was expanded to include receptor search features directly on the homepage, support V/J gene and species options in the refined receptor search, and allow direct data export in the Adaptive Immune Receptor Repertoire (AIRR) format. We anticipate that the improved receptor data quality will simplify bioinformatics analyses, and facilitate integration of IEDB data into cross-repository data resources, such as the AIRR Knowledge Commons.

11

An AI-Powered Trisomy 21 Research Assistant

NANDI, S.; Sundararajan, Z.; Subirana-Granes, M.; Espinosa, J. M.; Pividori, M.; Sullivan, K. D.; Galbraith, M. D.; Costello, J.

2026-06-11 bioinformatics 10.64898/2026.06.08.730893 medRxiv

Top 0.1%

7.2%

Show abstract

Down syndrome, caused by trisomy 21, increases the risk of diverse co-occurring conditions. With more than 34,000 related publications indexed in PubMed as of early 2026, keeping pace with this expanding literature is challenging. While general-purpose large language models are widely used for information retrieval, they often rely on broad training data rather than specific evidence. Retrieval-augmented generation (RAG) improves rigor and reliability of responses by linking model outputs to source texts. In research, source texts are peer-reviewed articles. Standard implementations treat all manuscript sections equally, allowing background text to rank as highly as experimental results. To focus model outputs on experimentally supported responses, we developed the T21 Research Assistant, a section-aware RAG system that prioritizes Results sections to ground responses in primary experimental evidence. The system draws exclusively from 1,789 open-access Down syndrome publications from PubMed Central, including 327 NIH INCLUDE-funded studies, and uses a multistage pipeline for query validation, retrieval, reranking, synthesis, and citation verification. Built on NVIDIA Nemotron models, it generates structured, cited responses. Evaluation using expert-curated questions demonstrated strong performance, achieving a BERTScore F1 of 0.712 and recall of 0.758, comparable to or exceeding leading proprietary and open-source models. T21 Research Assistant is available at: https://bioinformatics.cuanschutz.edu/t21-res-assi/

12

HeartBioPortal 3.0: an integrated cardiovascular genomics knowledge environment for molecular, clinical and population-scale interpretation

Vand, K.; Badia, N.; Khomtchouk, B.; Janga, S. C.

2026-07-01 cardiovascular medicine 10.64898/2026.06.28.26356792 medRxiv

Top 0.1%

7.2%

Show abstract

Cardiovascular genomics is producing rapidly expanding genetic, molecular, phenotypic and clinical data, yet relevant evidence remains fragmented across resources and difficult to translate into actionable biological and ultimately translational knowledge. HeartBioPortal (HBP) is a browser-based cardiovascular knowledge environment that was developed to address this problem by organizing omics, variant, phenotype and clinical evidence centered around gene queries. Here we describe HBP 3.0, a major update that expands both the data architecture and interpretive interface. This update introduces DataHub, a reproducible data-engineering layer for source ingestion, standardization, variant-centered aggregation, provenance tracking and compact serving artifacts. The release integrates cardiovascular clinical practice guideline context through a graph-backed clinical knowledge layer; incorporates cardiovascular summary statistics from the Million Veteran Program and public aggregate resources; expands source-preserving population frequency, variant annotation and structural-variant; and adds gene profile, drug-discovery and protein-context layers. HBP 3.0 incorporates 594.3 million allele-frequency observations across 18.1 million rsIDs, 3.04 million exon-enriched structural-variant records, 66.9 thousand protein isoforms with 3.26 million non-exon protein feature annotations, 17,128 gene-drug records, and a clinical guideline knowledge graph with 42,895 entities and 106,304 relationships. The redesigned gene dossier view combines phenotype filtering, annotation composition, persistent selected-detail panels and exportable chart data in one workflow. HBP 3.0 is designed to help cardiovascular and eventually cardiometabolic researchers move from a genetic or genomic signal to biological knowledge and potentially clinical and therapeutic context while preserving source provenance and interpretive boundaries. Database URL: https://www.heartbioportal.com/

13

Deterministic retrieval recovers biomedical associations lost by language models

Halder, A.; Singh, M.; Kesarwani, R.; Mathew, B.; Bhattacharya, N.; Chikhaliya, O.; Motwani, D.; Peela, S. C. M.; Samanta, S.; Muddemmanavar, P.; Farooq, M.; Ahuja, G.; Sengupta, D.

2026-04-29 bioinformatics 10.64898/2026.04.25.720782 medRxiv

Top 0.1%

7.2%

Show abstract

Large language model (LLM)-based retrieval systems miss biomedical associations through output truncation, synonym mismatch and run-to-run variability, but the magnitude of this loss remains unclear. We present BioChirp, an open-source framework that uses LLMs for query interpretation and candidate filtering, combining multi-source consensus entity resolution with deterministic graph-based retrieval. Across four major biomedical databases, BioChirp recovered more associations with higher reproducibility than conventional LLM-based retrieval approaches.

14

CpG Atlas: A centralized multi-layer database and AI interface for DNA methylation research

Armstrong, J. F.; Wahi, S.; Borrus, D.; Sehgal, R.; Rizvi, S.; Zhang, S.; Jacques, M.; Eynon, N.; van Dijk, D.; Higgins-Chen, A.

2026-06-03 bioinformatics 10.64898/2026.05.30.729020 medRxiv

Top 0.1%

6.8%

Show abstract

DNA methylation research has vastly expanded over the past decade, producing a wealth of epigenome-wide association studies, biomarker algorithms such as epigenetic clocks, technical performance analyses, and functional annotations for CpG sites. However, these resources remain fragmented across dozens of databases and supplementary files within manuscripts, forcing researchers to spend time and effort on data cleaning and integration prior to meaningful analyses. No single resource currently unifies this information into a centralized, easy-to-query framework. Here, we present CpG Atlas, a curated relational database that integrates 18 distinct annotation layers encompassing over 1.2 million CpG sites across all four generations of Illumina methylation arrays (HM450K, EPIC v1, EPIC v2, and MSA). Built on a snowflake schema with a canonical probe identifier hub implemented in SQL, CpG Atlas consolidates over 800,000 CpG-trait associations, results from Mendelian randomization analyses, CpG membership across 81 epigenetic clocks, array manifest information, and probe reliability data. It further includes specialized layers such as solo-WCGW, CoRSIVs, PRC2 binding, transposon and retroelement annotations, tissue-specific differentially methylated positions across 17 tissues, and hallmarks of aging and cancer. To maximize utility and ease of use, the database is paired with an interactive web tool and a natural language-to-SQL query interface, enabling users to quickly perform complex multi-dimensional queries. Detailed documentation about every data source and table is also provided, facilitating the identification and interpretation of relevant studies. We demonstrate the utility of CpG Atlas through two case studies: a systematic enrichment analysis revealing distinct functional signatures across 16 epigenetic clocks, and an iterative biomarker discovery workflow for IBD that leverages cross-layer integration. Because it is readily scalable simply by adding or updating tables in the database, CpG Atlas provides a continuously evolving and extensible infrastructure for the epigenetics community that supports collaborative research, interpretable biomarker development, and integrative analyses across the growing landscape of epigenetic data.

15

trAIt: Species-by-Trait Data Retrieval using Large Language Models

Balaji, S.; Martinson, K. A.; Schellenberger, J. S.; Koley, J.; Inman, C. M.; Hofmann, H. A.; Young, R. L.; Harpak, A.

2026-06-24 bioinformatics 10.64898/2026.06.19.732660 medRxiv

Top 0.1%

6.8%

Show abstract

Biological research often requires information about species traits. Manual literature collation can be time-consuming and miss parts of the literature. To address this gap, we developed trAIt, a publicly available software for the retrieval of characteristics of species from scientific literature catalogued in the Europe PubMed Central (PubMed) database. trAIt provides a graphical user interface (GUI) in which users specify species and characteristics of interest. Leveraging a large language model (LLM), trAIt retrieves relevant papers, combines their content through a consensus-based summarization model, and outputs a species-by-characteristic table. For a case study involving frog species, trAIt recovered 47.1% of trait-species combinations in 2.75 hours, while an expert curator independently recovered 62.4% over months. The consensus-based summarization substantially aids accuracy compared to single-source extraction. Across three case studies of vertebrate taxa, an expert confirmed the accuracy of 70.9% of trait-species entries recovered by trAIt. We observed considerable variation across taxa in trAIts accuracy, which is possibly due to heterogeneity in open-access literature availability and inconsistencies in species and trait terminology. In sum, our analysis suggests that LLM-based tools can accelerate biological data synthesis but should be used to support domain experts research, rather than replace their judgment.

16

Building an Interoperable Rare Disease Multi-omic Resource: The GREGoR Data Model and Dataset

Heavner, B. D.; Wheeler, M. M.; Bengtsson, J. D.; Carvalho, C. M. B.; Cheung, W. A.; Conomos, M. P.; Delot, E. C.; DiTroia, S.; Ganesh, V. S.; Gogarten, S. M.; Grochowski, C. M.; Jhangiani, S. N.; King, C. H.; LeMaster, C.; Marvin, C. T.; Marwaha, S.; Miller, D. E.; O'Donnell-Luria, A.; Pais, L.; Patterson, K.; Qi, G.; Richardson, M.; Smail, C.; Stilp, A. M.; Tong, C. C.; Ungar, R. A.; Weisburd, B.; Bamshad, M. J.; Bernstein, J. A.; Eichler, E. E.; Gibbs, R. A.; Lupski, J. R.; May, S. J.; Montgomery, S. B.; Pastinen, T.; Posey, J.; Rehm, H. L.; Shojaie, A.; Talkowski, M. E.; Vilain, E.; Wei, C

2026-05-19 genomics 10.64898/2026.05.15.725546 medRxiv

Top 0.1%

6.5%

Show abstract

Rare disease research and diagnosis rely on the integration of genomic and phenotypic data generated across diverse clinical sites; however, the absence of widely adopted standards for representing genomic data and associated metadata has limited data interoperability, reuse, and cross-study analysis. The Genomics Research to Elucidate the Genetics of Rare Diseases (GREGoR) Consortium was established to investigate challenging rare disease cases and evaluate emerging multi-omic technologies for clinical translation. To support coordinated data integration across distributed research sites, we developed a common Consortium Data Model in partnership with domain experts to standardize the capture of participant-, family-, phenotype- and assay-level metadata, with a particular emphasis on using a modular architecture to support linking of multiple data versions from multiple omic technologies to a single individual and attribution of a genetic finding to the specific technology used for its initial discovery. Adoption of the GREGoR Data Model has enabled continued generation and public release of a harmonized, analysis-ready Consortium Dataset. The most recent release includes phenotypic, family and multi-omic data from 12,292 participants in 5,029 families. Other rare disease data sharing efforts are beginning to adopt this data model which will facilitate cross consortium analyses and empower rare disease research. This work demonstrates that a collaborative, flexible, and scalable data model can enable large-scale rare disease research, facilitate cross-center data harmonization, and enable data interoperability.

17

SPECTER-Based Semantic Triage of Biomedical Literature for Systematic Reviews in Mutational Signature Analysis

Bituin, R. C.; Bokani, A.

2026-07-09 bioinformatics 10.64898/2026.07.06.736558 medRxiv

Top 0.1%

5.5%

Show abstract

Systematic reviews in computational biology require screening large heterogeneous bibliographic sets, especially when topics span computational methods, cancer genomics and statistical modelling. This paper presents a reproducible semantic triage pipeline that combines SPECTER scientific-document embeddings, research-question similarity, proposal-summary similarity and domain keyword coverage to rank candidate studies for systematic review screening. The pipeline was evaluated on 2,231 Covidence records, including 120 final included studies (prevalence = 5.38%), against keyword-only, TF-IDF, BM25, MiniLM, PubMedBERT and SPECTER-only baselines. SPECTER-hybrid achieved the highest average precision (AP = 0.546), recovered 50% of included studies after screening 4.48% of records, and produced an 11.16-fold enrichment over prevalence. Ablation analysis showed that semantic-keyword combinations consistently outperformed single-signal variants. These findings suggest that citation-informed hybrid ranking can support literature triage while retaining human reviewers as final decision-makers.

18

HESTA: a curated and reusable database for the human early organogenesis spatiotemporal transcriptome atlas

Xu, Z.; Wang, W.; Li, Y.; Zhang, Y.; Chen, J.; Du, W.; Yang, T.

2026-05-29 bioinformatics 10.64898/2026.05.28.728391 medRxiv

Top 0.1%

5.5%

Show abstract

BackgroundHuman organogenesis is orchestrated by precise spatiotemporal gene expression. Mapping these dynamic processes requires transcriptomic data that preserve native anatomical context across continuous developmental stages. ResultsWe present a spatiotemporal transcriptome database of human embryogenesis, profiling 77 sagittal sections from 13 euploid embryos (CS12-CS23) using Stereo-seq, yielding 14,744,703 bin50 spots. The atlas annotates 50 organs and maps 198 molecularly distinct substructures, complemented by 607,093 snRNA-seq cells. The database features a Spatial Exploration module for locating sections and visualizing spatial distributions of organs and substructures, and an Organ Atlas module for visualizing gene expression, regulon activities, and pathway enrichment at the single-organ level across embryos. ConclusionsThis database provides an interactive resource to access spatial gene expression, substructures, and regulatory networks across 50 developing human organs, supporting further research into the mechanisms of human organogenesis.

19

An integrated resource for systems-level analysis of aging hallmarks and associated genes

Tiwari, R.; Balaji, M.; Chivukula, N.; Sil, P.; Samal, A.

2026-06-02 bioinformatics 10.64898/2026.05.29.728838 medRxiv

Top 0.1%

5.4%

Show abstract

Aging is a complex biological process involving progressive cellular dysfunction, tissue decline, and increased susceptibility to multiple chronic diseases. A systemic view of aging through its established hallmarks provides a structured framework to understand this complexity and drive therapeutic discovery. Towards this, we present AgingHallmarksDB, an interactive web platform that enables systems-level analysis of hallmark-associated gene sets. Aging-related genes were first curated from seven established resources, and those present in at least 2 of these resources were considered as consensus aging-related genes. Using functional annotations derived from GO, KEGG, and Reactome, a total of 3111 genes were mapped to the 11 aging hallmarks, of which 2593 were supported by additional experimental or manually curated evidence, with 1089 of these forming the consensus set. Further, AgingHallmarksDB supplements gene annotations with tissue or cell type class specificity, exosomal profiles, and regulatory interactions. The platform allows users to interactively perform systems-level hallmark enrichment analysis across multiple condition-associated gene sets, while seamlessly integrating functional annotations and complex regulatory interactions to elucidate mechanistic hallmark-gene associations. The utility of the resource was explored through hallmark enrichment and network proximity analysis of gene sets corresponding to 11 chronic age-related diseases and PM2.5-associated skin transcriptome to explore relationships between aging hallmarks and disease mechanisms or environmental aging-related signatures. Overall, AgingHallmarksDB will support longevity research by enabling aging hallmark centered analysis, and the resource is accessible at https://cb.imsc.res.in/aginghallmarksdb/.

20

The role of long-range transcriptional regulation in interpretation of non-coding variants associated with human disease

Mandic, K.; Hrsak, D.; Uljanic, F.; Lenhard, B.; Baresic, A.

2026-06-17 genomics 10.64898/2026.06.15.731051 medRxiv

Top 0.1%

5.0%

Show abstract

Genome-wide association studies (GWAS) are the key tools for the discovery of associations between single nucleotide polymorphisms (SNPs) and phenotypic traits and have been successfully applied to many diseases and disorders. However, a great challenge is to find the gene affected by the non-coding fraction of SNPs, especially if the gene is distal in terms of genomic distance. In this study, we present a novel approach, named targPred, which utilises genomic regulatory blocks (GRBs) for inference of a connection between a certain SNP/locus and the target gene located in the same GRB, in a more robust and generalisable manner. We identified that many disease traits such as cancer and psychiatric disease have a propensity for long-range regulation. Furthermore, we showcased a childhood obesity locus which is connected to the distal BDNF gene. Finally, we propose a new web-based service based on enhancer-promoter association, to facilitate finding the causal genes for a wide array of traits and conditions.