Database — Latest Matching Preprints

1

Developing a Specialized Dravet Syndrome Ontology for Rare Disease Informatics and AI Applications

Golnari, P.; Prantzalos, K.; Upadhyaya, D. P.; Buchhalter, J.; Sahoo, S. S.

2026-07-04 neurology 10.64898/2026.07.01.26357055 medRxiv

Top 0.1%

18.7%

Show abstract

Dravet syndrome (DS) is a severe developmental and epileptic encephalopathy whose clinical and research representation requires integration of heterogeneous knowledge spanning seizures, development, behavior, SUDEP/autonomic risk, genetics, comorbidities, electrophysiology, pharmacology, and drug responsiveness. We report the development of a DS-focused ontology created by expert-guided specialization of a previously published epilepsy ontology. Scope expansion was defined through a scientific advisory board, structured review meetings, and iterative ontology curation in OWL. The resulting resource reorganized DS content across nine major domains and expanded the publicly released ontology from the pre-extension baseline to the current BioPortal version. Beyond structural growth, the ontology was assessed through expert-guided curation and downstream task-based reuse, including two published ontology-enabled LLM studies and an ongoing ontology-derived DS knowledge graph and AI assistant platform. These results suggest that disease-focused ontology specialization can provide durable infrastructure for DS data harmonization, knowledge representation, and AI-enabled translational informatics.

2

geneXplore: An Interactive Browser for X Chromosome-Wide Association Study Results

Cook, N.; Boulais-Richard, J.; Zeng, Y.; Yang, C.; Budde, J.; Taliun, D.; Gagliano Taliun, S. A.; Cruchaga, C.; Belloy, M. E.

2026-07-14 neurology 10.64898/2026.07.14.26357489 medRxiv

Top 0.1%

11.2%

Show abstract

Summary: The X chromosome comprises approximately 5% of the human genome and encodes over 800 protein-coding genes, many of which exhibit sex-differentiated expression patterns due to escape from X chromosome inactivation (XCI) mechanisms. Despite its relevance to sex differences in complex traits, the X chromosome is routinely excluded from genome-wide association studies due to analytical challenges, and when analyzed, the impact of escape from XCI or sex is limitedly explored. No dedicated, publicly accessible browser for X chromosome-wide association study (XWAS) summary statistics currently exists, creating a barrier to systematic investigation of X-linked contributions to human traits. Here, we present geneXplore, an interactive web browser based on the PheWeb2 implementation, tailored for XWAS summary statistics across 1,944 phenotypes while distinguishing random XCI (rXCI), escape from XCI (eXCI), and sex-stratified analyses. Users can explore results via interactive plots (Manhattan and Miami, PheWAS and LocusZoom), searchable tables and access to cross-database lookup, with full summary statistics available for download. Availability and Implementation: geneXplore is freely available at https://genexplore.wustl.edu/ with no registration required and will be maintained for a minimum of two years following publication. Source code is available at https://github.com/Belloy-Lab/geneXplore_XWAS_Browser under an MIT license.

3

RD-OMICS: An Integrative Multi-Omics Data Inventory in Rare Diseases

Sun, S.; Wang, H.; Mathe, E. A.; Zhu, Q.

2026-07-03 bioinformatics 10.64898/2026.06.29.735296 medRxiv

Top 0.1%

8.9%

Show abstract

Rare diseases (RD) impact over 30 million individuals in the United States, yet fewer than 5% of the identified conditions have FDA-approved treatments. Progress in RD research is hindered by small patient cohorts, biological heterogeneity, and the fragmented, inconsistently annotated publicly available omics data, which limits integrative analysis and translational discovery. Here, we present RD-OMICS, a data inventory with integrated and structured RD omics data from Gene Expression Omnibus (GEO), in the form of a knowledge graph. We developed a metadata harmonization pipeline that combines rule-based mapping and large language model (LLM)-assisted semantic categorization. The graph-based data model was defined to integrate different types of data including disease conditions, experiments, samples, platforms, projects, and publications into a centralized inventory graph. In this preliminary study, 11,049 GEO series for 126 rare diseases were processed and integrated into RD-OMICS, which includes 375,930 individual biospecimen samples, 1,578 sequencing and array platforms, 10,938 biological projects. Case studies demonstrate the use of RD-OMICS in supporting rare disease research, omics cohort construction, and transcriptome-based drug repurposing for amyotrophic lateral sclerosis (ALS). RD-OMICS provides a scalable foundation for transforming fragmented omics data into a structured, harmonized and interoperable resource, facilitating therapeutic development and other translational discoveries in rare diseases.

4

HeartBioPortal 3.0: an integrated cardiovascular genomics knowledge environment for molecular, clinical and population-scale interpretation

Vand, K.; Badia, N.; Khomtchouk, B.; Janga, S. C.

2026-07-01 cardiovascular medicine 10.64898/2026.06.28.26356792 medRxiv

Top 0.1%

7.2%

Show abstract

Cardiovascular genomics is producing rapidly expanding genetic, molecular, phenotypic and clinical data, yet relevant evidence remains fragmented across resources and difficult to translate into actionable biological and ultimately translational knowledge. HeartBioPortal (HBP) is a browser-based cardiovascular knowledge environment that was developed to address this problem by organizing omics, variant, phenotype and clinical evidence centered around gene queries. Here we describe HBP 3.0, a major update that expands both the data architecture and interpretive interface. This update introduces DataHub, a reproducible data-engineering layer for source ingestion, standardization, variant-centered aggregation, provenance tracking and compact serving artifacts. The release integrates cardiovascular clinical practice guideline context through a graph-backed clinical knowledge layer; incorporates cardiovascular summary statistics from the Million Veteran Program and public aggregate resources; expands source-preserving population frequency, variant annotation and structural-variant; and adds gene profile, drug-discovery and protein-context layers. HBP 3.0 incorporates 594.3 million allele-frequency observations across 18.1 million rsIDs, 3.04 million exon-enriched structural-variant records, 66.9 thousand protein isoforms with 3.26 million non-exon protein feature annotations, 17,128 gene-drug records, and a clinical guideline knowledge graph with 42,895 entities and 106,304 relationships. The redesigned gene dossier view combines phenotype filtering, annotation composition, persistent selected-detail panels and exportable chart data in one workflow. HBP 3.0 is designed to help cardiovascular and eventually cardiometabolic researchers move from a genetic or genomic signal to biological knowledge and potentially clinical and therapeutic context while preserving source provenance and interpretive boundaries. Database URL: https://www.heartbioportal.com/

5

trAIt: Species-by-Trait Data Retrieval using Large Language Models

Balaji, S.; Martinson, K. A.; Schellenberger, J. S.; Koley, J.; Inman, C. M.; Hofmann, H. A.; Young, R. L.; Harpak, A.

2026-06-24 bioinformatics 10.64898/2026.06.19.732660 medRxiv

Top 0.1%

6.8%

Show abstract

Biological research often requires information about species traits. Manual literature collation can be time-consuming and miss parts of the literature. To address this gap, we developed trAIt, a publicly available software for the retrieval of characteristics of species from scientific literature catalogued in the Europe PubMed Central (PubMed) database. trAIt provides a graphical user interface (GUI) in which users specify species and characteristics of interest. Leveraging a large language model (LLM), trAIt retrieves relevant papers, combines their content through a consensus-based summarization model, and outputs a species-by-characteristic table. For a case study involving frog species, trAIt recovered 47.1% of trait-species combinations in 2.75 hours, while an expert curator independently recovered 62.4% over months. The consensus-based summarization substantially aids accuracy compared to single-source extraction. Across three case studies of vertebrate taxa, an expert confirmed the accuracy of 70.9% of trait-species entries recovered by trAIt. We observed considerable variation across taxa in trAIts accuracy, which is possibly due to heterogeneity in open-access literature availability and inconsistencies in species and trait terminology. In sum, our analysis suggests that LLM-based tools can accelerate biological data synthesis but should be used to support domain experts research, rather than replace their judgment.

6

SPECTER-Based Semantic Triage of Biomedical Literature for Systematic Reviews in Mutational Signature Analysis

Bituin, R. C.; Bokani, A.

2026-07-09 bioinformatics 10.64898/2026.07.06.736558 medRxiv

Top 0.1%

5.5%

Show abstract

Systematic reviews in computational biology require screening large heterogeneous bibliographic sets, especially when topics span computational methods, cancer genomics and statistical modelling. This paper presents a reproducible semantic triage pipeline that combines SPECTER scientific-document embeddings, research-question similarity, proposal-summary similarity and domain keyword coverage to rank candidate studies for systematic review screening. The pipeline was evaluated on 2,231 Covidence records, including 120 final included studies (prevalence = 5.38%), against keyword-only, TF-IDF, BM25, MiniLM, PubMedBERT and SPECTER-only baselines. SPECTER-hybrid achieved the highest average precision (AP = 0.546), recovered 50% of included studies after screening 4.48% of records, and produced an 11.16-fold enrichment over prevalence. Ablation analysis showed that semantic-keyword combinations consistently outperformed single-signal variants. These findings suggest that citation-informed hybrid ranking can support literature triage while retaining human reviewers as final decision-makers.

7

K9HeartCircDB: A circRNA Atlas of Tachypacing-Induced Canine Dilated Cardiomyopathy

Chinmaya, C.;Sinha, T.;Nisini, N.;Wang, T.;Natarajaseenivasan, S.;Berretta, R.;Rai, A.;Panda, A.;Elrod, J.;Kishore, R.;Houser, S.;Recchia, F.;Garikipati, V.

2026-06-22 Systems Biology 10.64898/2026.06.16.732655 medRxiv

Top 0.2%

3.6%

Show abstract

Cardiovascular disease (CVD) remains a leading cause of death worldwide. Dilated cardiomyopathy (DCM), a major cause of heart failure (HF), exhibits ventricular dilation, impaired systolic/diastolic function, arrythmias, and adverse cardiac remodeling. While genetic causes of DCM have been extensively studied, non-genetic and acquired forms of DCM-like HF are less well characterized, especially with respect to non-coding RNA regulation. Circular RNAs (circRNAs) are stable, covalently closed non-coding RNAs that regulate cellular function via sequestering miRNAs, RNA-binding proteins, or translation. Their role in canine HF that recapitulates features of non-genetic DCM remains largely unexplored. To address this, we developed K9HeartCircDB (https://www.k9heartcircdb.com/), a publicly accessible database that catalogs circRNAs expressed in canine left ventricular (LV) tissues under tachypacing-induced HF, a model of non-genetic DCM-like disease, and healthy control conditions. The online interface enables users to query and explore circRNAs based on exon composition, predicted miRNA binding sites, protein-coding potential, siRNA targets, and primer design for experimental validation. By providing an integrated and user-friendly platform for canine heart circRNA exploration, K9HeartCircDB offers a valuable resource to facilitate mechanistic and advance translational studies on non-genetic DCM-like disease.

8

UKBAnalytica: an integrated R package for scalable phenotyping and reproducible epidemiological analysis within the UK Biobank Research Analysis Platform

He, N.; Mo, K.; Yu, G.; He, F.

2026-06-22 epidemiology 10.64898/2026.06.19.26356057 medRxiv

Top 0.2%

3.5%

Show abstract

UK Biobank provides longitudinal health-related data for approximately 500,000 participants, and its Research Analysis Platform (RAP) has shifted large-scale analyses toward secure cloud-based computation. However, many existing tools address only specific steps of the analytical workflow, leaving a need for an integrated framework that connects multi-source disease phenotyping, survival-ready cohort construction, and downstream analysis on the RAP. Here, we present UKBAnalytica, an extensible R package for scalable phenotyping and integrated analysis of UK Biobank data within the RAP environment. It currently includes 52 predefined baseline variables and a built-in library of 331 curated disease definitions. These definitions are based on multiple UK Biobank data sources, including ICD-10, ICD-9, self-reported conditions, death registry records, algorithmically defined outcomes, and OPCS-4 procedure codes. UKBAnalytica distinguishes prevalent and incident cases, constructs follow-up time, generates analysis-ready survival datasets, and summarizes participant flow. Beyond phenotype construction, UKBAnalytica provides integrated modules for epidemiological analysis, omics analysis, and machine-learning-based modeling and interpretation. By linking endpoint definition with downstream modeling under a consistent data structure, UKBAnalytica reduces repetitive scripting and improves analytical transparency. Furthermore, we demonstrate the package's practical utility through a case study on chronic obstructive pulmonary disease (COPD) proteomics. The findings align closely with previously reported conclusions, underscoring the robustness and reliability of our analytical framework. This phenotype-centered framework complements existing UK Biobank tools and facilitates reproducible RAP-based biomedical research. UKBAnalytica is freely available at https://github.com/Hinna0818/UKBAnalytica.

9

Multiscale harmonization and semantic integration of biomedical data enable biological insights through immersive exploration

Bueckle, A.; Zhu, C.; Wong, A. Y. H.; Enninful, A.; Miao, Y.; Farzad, N.; Pedersen, M.; Mattison, C.; Sloan, N.; Mares, J.; Xing, C.; Herr, B. W.; Khare, J.; Kumar, Y. R.; Parekh, K.; Chavan, S.; Luby, P.; Patel, U.; Hickey, J. W.; Bader, G. D.; Phatnani, H.; Menon, V.; Fan, R.; Sorger, P.; Snyder, M.; Boerner, K.

2026-07-11 bioinformatics 10.64898/2026.07.07.737090 medRxiv

Top 0.3%

2.8%

Show abstract

The Human Reference Atlas (HRA) enables multiscale data exploration and visualization. We present "HRA: Powers of Ten," a virtual reality (VR) application for integrating, harmonizing, and visualizing data within the HRA Organ Gallery. It enables immersive navigation from a whole-body view of 81 organs to datasets across 5 organs, 5 assay types, and 4 spatial scales using a Multiscale Elevator System. The application, data, and code are available open-source.

10

AptCancerDB: A Curated Knowledgebase and Translational Discovery Platform for Anticancer Aptamers

Bajiya, N.; Singh, S.; Raghava, G. P. S.

2026-07-09 cancer biology 10.64898/2026.07.02.735999 medRxiv

Top 0.3%

2.4%

Show abstract

Aptamers are emerging as important molecular recognition ligands in oncology, playing significant roles in cancer diagnostics, targeted therapies, drug delivery systems, and molecular imaging. Numerous aptamers have advanced to clinical trials, indicating their potential for real-world applications; however, existing databases fail to capture that. To bridge this critical gap, we developed AptCancerDB (https://webs.iiitd.edu.in/raghava/aptcancerdb/), a comprehensive, manually curated database of experimentally verified anticancer aptamers. The current release contains 1,941 entries collected from studies published between 2000 and 2025, covering 29 cancer types, approximately 200 cancer cell lines, and direct links to 22 clinical trials. Each entry is annotated with sequence information, target details, cancer type, cell line, SELEX methodology, affinity determination data, chemical modifications, and biological activities. The dataset is dominated by 82.7% ssDNA, reflecting its superior stability and ease of synthesis, while only 16.6% is ssRNA and appears primarily in studies targeting complex intracellular or protein-protein interactions. To facilitate structural analysis, predicted secondary structures, dot-bracket notations, specific structural elements, and minimum free energy values were also included. AptCancerDB integrates a MySQL backend with an ArcadeDB/OpenCypher-based Knowledge Graph, enabling exploration of relationships among aptamers, targets, cancer types, cell lines, and functional applications. The platform provides advanced search and browsing facilities, BLASTn-based similarity searching, and GC Calculator. Built on a modern, responsive frontend (React/TypeScript/Tailwind CSS), the platform includes a REST API for data retrieval. By integrating fragmented experimental data into a unified cancer-focused resource, AptCancerDB serves as a valuable resource for comparative analysis, aptamer discovery, and the development of next-generation aptamer-based diagnostics and therapeutics. HighlightsO_LICurated knowledge base of experimentally validated anticancer aptamers. C_LIO_LIAptCancerDB contain therapeutic, tumor-homing and cell-penetrating aptamers. C_LIO_LISummarizes clinical progress and translational trends in anticancer aptamer research. C_LIO_LISupports rational aptamer design using molecular, functional, and clinical annotations C_LIO_LIDisease-focused resource for cancer diagnosis, therapy, and drug delivery C_LI TeaserAptCancerDB maintains experimentally validated anticancer aptamers relevant to diagnosis, drug delivery, and therapy.

11

PhenoXtract: combining Large Language Model and Knowledge Graph embedding to extract phenotypes from clinical descriptions

Berardelli, S.; BRIERE, G.; Loire, B.; De Paoli, F.; Gazzo, A. M.; Limongelli, I.; Magni, P.; Zucca, S.; Baudot, A.

2026-06-26 genomics 10.64898/2026.06.22.733382 medRxiv

Top 0.3%

2.4%

Show abstract

Motivation: Standardized phenotypic descriptions are essential for accurate diagnosis, yet clinicians and researchers face challenges in manually extracting and mapping phenotypes from scientific literature or patient clinical records to the Human Phenotype Ontology. Recent advances in deep learning offer new opportunities for automation. We developed PhenoXtract, a novel phenotype extraction approach that combines Large Language Models and Knowledge Graph embedding. PhenoXtract is a multistep pipeline that takes clinical descriptions as input, extracts candidate phenotype entities using large language models, and maps them to terms from an enriched version of the Human Phenotype Ontology, processed as a knowledge graph. Results: Evaluation against expert-curated ground-truth datasets show a recall of 0.70 and precision of 0.85 for PhenoXtract, demonstrating concordance with manually extracted phenotypes, with a computation time of 10-20 seconds for each text analyzed. Moreover, PhenoXtract surpasses rule-based and deep learning-based state-of-the-art tools in two out of the three ground-truth datasets evaluated. These results suggest that hybrid approaches combining Large Language Models and Knowledge Graph embeddings represent a promising direction for automated clinical phenotyping at scale.

12

VirProtRAG: Literature-grounded viral protein function annotation with retrieval-augmented generation

Guan, J.; Shang, J.; Peng, C.; Sun, Y.

2026-07-04 bioinformatics 10.64898/2026.07.03.736267 medRxiv

Top 0.3%

2.4%

Show abstract

Viruses play indispensable roles in ecosystems and human health, yet deciphering their molecular functions remains challenging. Many viral protein annotations are incomplete or poorly characterized. Existing tools typically predict functional categories without linking to verifiable evidence, hindering the credibility of functional interpretation. Here, we present VirProtRAG, a viral protein function annotation framework that integrates information retrieval with evidence-grounded knowledge generation. It introduces three task-adapted components: a hybrid retrieval module combining keyword-based and semantic dense retrieval to maximize literature coverage, synonym-expanded and rank-aware retrieval with reciprocal rank fusion for improved search effectiveness, and literature quality and evidence-oriented re-ranking to enhance reliability and interpretability. Results show that hybrid retrieval strategy performed best, with quality and evidence features further enhancing re-ranking. Compared with direct LLM prompting without retrieved literature, it consistently improves generation performance, underscoring the critical role of external knowledge. Finally, we built a searchable database comprising all 17,484 reviewed Swiss-Prot viral proteins, supporting both sequence- and text-based queries. VirProtRAG introduced 32.53% non-overlapping function annotations beyond existing expert curation, and independently supported 56.34% of sequence-inferred function points with retrieved literature. Case studies further demonstrate its capability to augment and refine the characterization of previously unannotated or poorly understood viral proteins.

13

Healthcare Big Data Platform for Linking National Databases in Korea: System Development and Research Applications

Kim, Y.; Lee, Y.; Jeong, J.

2026-07-13 health informatics 10.64898/2026.07.09.26357705 medRxiv

Top 0.3%

2.4%

Show abstract

Public healthcare databases in South Korea have been distributed across disparate government agencies, requiring researchers to navigate multiple, separate institutional approval processes for data linkage. To address the systemic inefficiency, the Healthcare Big Data Linkage Platform (HCDL), jointly administered by NECA and KHIS, was established to integrate 13 databases from 10 public institutions through a Trusted Third Party (TTP)-based linkage methodology and a centralized one-stop review process. Of 311 projects submitted between 2022 and 2025, 190 (61.1%) were approved with annual applications increasing 2.4-fold over the study period. The average number of databases per project exceeded three, reflecting a surging demand for integrated clinical data. Nationwide healthcare data from HIRA and NHIS were the most frequently requested databases (93.2% and 82.1% of approved projects, respectively), and co-occurrence pattern analysis further confirmed that both formed the core of the research ecosystem in combination with vital status, lifestyle, and cancer diagnosis data. By consolidating multi-institutional review and enabling equitable data access, the HCDL has emerged as a core infrastructure for data-driven and precision medicine research in South Korea.

14

Bamsnap-LRS: an automated batch visualization tool for long-read sequencing alignments

Chen, W.; Yang, C.; Qiu, L.; Hu, J.; Zhou, Y.

2026-06-25 bioinformatics 10.64898/2026.06.21.733121 medRxiv

Top 0.3%

2.4%

Show abstract

Summary: Long-read sequencing (LRS) has become essential for genome assembly, structural variations (SVs) detection, haplotype phasing and transcript isoform characterization. However, these applications often require manual inspection of read alignment for validation. Existing visualization tools are either interactive genome browsers that are difficult to scale to large datasets or batch-oriented tools that are not optimized for the unique alignment patterns of long-read data. We developed Bamsnap-LRS, an automated command-line tool for high-throughput LRS alignment visualization. It supports long-read-specific features, phased SNP inspection, and publication-ready batch figure generation within a unified framework for genomic, transcriptomic, and haplotype-aware analyses. Availability and Implementation: All codes and examples are freely available at https://github.com/comery/Bamsnap-LRS.

15

High resolution Streptococcus pyogenes core genome MLST and LIN coding scheme for outbreak detection

Ryan, Y.; Jolley, K. A.; Hearn, H.; Parfitt, K. M.; Platt, S.; Lamagni, T.; Moganeradj, K.

2026-07-11 bioinformatics 10.64898/2026.07.10.737715 medRxiv

Top 0.4%

2.1%

Show abstract

Streptococcus pyogenes is a globally important pathogen responsible for at least 500,000 deaths a year, causing significant burden on healthcare systems. It is the causative agent for ailments such as impetigo and strep throat to septicaemia and necrotizing fasciitis. Assessment of genetic relatedness for the detection of outbreaks within communities or healthcare facilities is vital in decreasing the propagation of S. pyogenes within these settings, alongside epidemiological data. As the volume of isolates being sequenced increases year on year, more scalable and sharable methodologies of assessing genetic relatedness are required by reference laboratories and for international collaboration. LIN codes, applied to core genome MLST (cgMLST) represent a method which is extensible to large scale whole genome sequencing (WGS) while still being sufficiently sensitive to detect outbreak clusters. Here we present a novel cgMLST and LIN code scheme, hosted by PubMLST, enabling international collaboration and global tracking of variants, that is highly scalable and usable for all. The schemes are available at https://pubmlst.org/organisms/streptococcus-pyogenes. Data SummaryGenome sequences and metadata are available at https://pubmlst.org/organisms/streptococcus-pyogenes. PubMLST and ENA accessions and metadata can additionally be found in the supplementary data. Raw reads for UKHSA sequences are available in ENA study PRJEB115996. Impact StatementStreptococcus pyogenes is a globally relevant pathogen capable of causing invasive and non-invasive disease across a multitude of settings. Assessment of genetic relatedness is an increasingly important aspect of managing outbreaks, requiring solutions that are scalable, high resolution and comparable across laboratories. Here we present a high resolution core genome multi locus sequence typing (MLST) and associated life identification number (LIN) code scheme, The schemes were developed using a combination of 4,916 UKHSA and 2,391 publicly available S. pyogenes isolates in order to cover a wide range of EMM types both within the UK and globally. These new schemes enable high resolution typing of S. pyogenes isolates, suitable for analysis of lineages to genomic epidemiology in outbreak detection and management. Both cgMLST and LIN code schemes are available on PubMLST as an open access resource for the public health and academic communities and can enable both intra laboratory and global coordination.

16

Thematic Shifts in Early-High-Impact Cancer Genomics and Diagnostics Research: A Bibliometric and Semantic Analysis

Su, Z.; Li, T.

2026-07-09 bioinformatics 10.64898/2026.07.04.736459 medRxiv

Top 0.4%

2.1%

Show abstract

Cancer genomics and diagnostics is a rapidly evolving field in which identifying which topics attract early citation prominence can inform laboratory investment, clinical translation, and research strategy. We developed a bibliometric framework to identify and characterize the most influential recent publications in this domain across two consecutive annual cohorts. Using a mathematically exact threshold-expansion algorithm, we ranked over 10,000 OpenAlex-indexed research articles per cohort by 18-month post-publication citation count. Large language model (LLM)-based topical relevance filtering yielded 50 substantively on-topic papers per cohort (100 total). LLM-based concept extraction and a two-stage, embedding-guided normalization pipeline produced 1,853 canonical concepts organized into 103 parent themes, enabling structured cross-cohort comparison of paper-level concept prevalence. The most cited papers in both cohorts were large-scale genomic infrastructure resources rather than single-disease mechanistic studies. Between consecutive cohorts, normalized frequencies increased most for whole-genome sequencing, tumor microenvironment biology, molecular biomarkers, and cancer pharmacotherapy, while liquid biopsy-related themes showed the largest declines. These findings indicate that early citation impact in cancer genomics is shifting toward integrative, population-scale, and microenvironment-aware research, and demonstrate that LLM-augmented citation ranking provides a replicable, semantically enriched lens for monitoring thematic evolution in precision oncology. A web interface for exploring the results is available at https://pri.pepkio.com/.

17

Aggregating data to accelerate personalized therapy in heart failure (ADAPT-HF)

Roeder, C.; Goerg, C.; Talebi, A.; Stevens, L. M.; Scholtens, D. M.; Rasmussen-Torvik, L. P.; Alagna, L. M.; Shah, S. J.; Hall, J. L.; Das, A. K.; Jhund, P. S.; Kao, D. P.

2026-07-16 health informatics 10.64898/2026.07.13.26357501 medRxiv

Top 0.4%

2.1%

Show abstract

Background: Increased public access to data from disparate sources provides opportunities to study and validate predictive and subphenotype models in heterogeneous disease conditions using aggregated individual patient data. Robust, explicit, and transparent harmonization of data elements is critical to ensure interpretability, reproducibility, and generalizability of secondary and retrospective analyses. Methods & Results: We designed and implemented ADAPT (Aggregating Data to Accelerate Personalized Therapy), a scalable framework using multiple software packages (R, SQL, BigQuery) that enables rapid, explicit harmonization of structured data elements from randomized trials and observational studies using a standard spreadsheet interface. User-specified criteria are applied to primary study data to produce harmonized longitudinal datasets comprised of demographics, medical history, quantitative observations, repeated measures, and clinical outcomes. We demonstrate this functionality using 26 clinical studies found in the National Heart, Lung, and Blood Institute BioLINCC resource. We illustrate the scalability of ADAPT to the order of billions of datapoints using administrative clinical data in a cloud-computing platform. We also present examples of collaborators using ADAPT for independent harmonization tasks for secondary analyses and democratization of publicly available data. Conclusion: ADAPT is a disease-agnostic, extensible, and scalable platform to support robust, transparent harmonization of structured research data using interfaces accessible to a variety of researchers regardless of programming ability. It extends FAIR principles beyond research data to also represent harmonization analyses by improving Findability of harmonization decisions, Accessibility of methods to other stakeholders, Interoperability with independent analyses and datasets, and Reusability through efficient implementation in a variety of analysis environments.

18

Large-scale automated detection reveals pervasive sex imbalance in biomedical research

Valtadoros, L. E.; Hicks, P.; Yuan, H.; Ahmadian, M.; Johnson, K. A.; Krishnan, A.

2026-07-14 genomics 10.64898/2026.07.13.738332 medRxiv

Top 0.4%

2.0%

Show abstract

Sex is a critical biological variable that impacts disease risk, progression, and treatment response across virtually every organ system. However, decades of biomedical research have relied primarily on male study subjects, leaving large gaps in our understanding of female-specific disease biology. Quantifying the extent of this imbalance across thousands of disease areas and millions of publicly available biological samples has remained computationally intractable. Here, we present a multimodal computational framework that infers the biological sex of [~]230,000 publicly available human transcriptome samples and links inferred sex labels to disease terms extracted from [~]9,000 associated study records and [~]5,000 publication abstracts to quantify sex imbalance at scale. Applying this approach revealed that the majority of disease terms with the largest research-derived sex imbalance are skewed toward male representation, including areas with no known biological justification for that imbalance. After adjusting for global sex-specific disease prevalence to isolate biologically unjustified imbalance, up to 58% of all disease terms showed male-leaning association. Diseases including glioblastoma, cirrhosis, idiopathic pulmonary fibrosis, and schizophrenia emerged as critically understudied in females despite affecting both sexes comparably. These findings provide a principled, data-driven basis for prioritizing compensatory research efforts and offer a reusable framework for ongoing monitoring of sex representation in the biomedical literature. HighlightsO_LISkewed male and female study subject representation in biomedical research is the result of decades of studies conducted without adequate female representation. C_LIO_LIWe developed an automated, multimodal framework to estimate the sex imbalance across thousands of disease terms using metadata from [~]230,000 transcriptomics samples and their associated [~]9,000 studies and [~]5,000 publications. C_LIO_LIOur approach identifies non-sex-specific disease research areas that have been studied using an unbalanced sex demographic. These areas need compensatory and balanced studies to understand sex differences. C_LI

19

Pharmacological Stratification of Public Bioactivity Databases: A Reusable, OECD-Anchored Curation and Benchmarking Framework Demonstrated for Opioid Receptors

Nael, M.; Alakonda, L.; Ghosh, A.; Ward, S. J.; Liu-Chen, L.-Y.; Rajadhyaksha, A. M.; Abou-Gharbia, M.; Elokely, K. M.

2026-06-24 bioinformatics 10.64898/2026.06.18.732083 medRxiv

Top 0.4%

1.8%

Show abstract

Public bioactivity databases are heterogeneous not only in measurement type, where binding affinities and functional potencies are reported on different scales, but in pharmacology: the same compound and target can carry agonist, antagonist, or inhibitor records measured through binding displacement, cAMP, {beta}-arrestin, or [35S]GTP{gamma}S readouts that quantify different biological events. Pooling these records produces models whose output is detached from any coherent pharmacological claim. Prior work has standardized bioactivity at scale and quantified the noise from mixing measurement types, but pharmacological mechanism and assay-readout class have not been treated as a primary axis of large-scale curation. This study presents an auditable, OECD-anchored framework that stratifies public records by action type and assay readout before modeling, converting heterogeneous data into externally validated, interpretable QSAR tasks that compose with existing standardization resources rather than replacing them. The framework is demonstrated on the four opioid receptors (MOR, DOR, KOR, and nociceptin/orphanin FQ, NOP). Four public sources were reconciled into 72,148 merged records and 50,977 curated measurements spanning 19,585 compounds, each carrying auditable attributes for source agreement, endpoint meaning, pharmacology class, assay readout, and trust tier. Receptor-level binding tasks formed a compact benchmark with strong locked external performance, including KOR pK (R2 = 0.79, n = 798) and DOR pK (R2 = 0.77, n = 736). Pharmacology- and readout-resolved functional endpoints yielded externally validated strata that pooled labels would obscure, including a MOR antagonist functional-inhibition endpoint (R2 = 0.86, n = 110) and agonist potency endpoints for DOR, KOR, and MOR (R2 up to 0.81). Comparison against a fully pooled baseline shows that pooled models either match stratified models on coherent endpoints or reach a deceptively high R2 on functional-IC50 endpoints by training predominantly on binding-displacement records, so the pooled number predicts affinity rather than functional activity. SHAP attribution indicates that binding and functional potency encode partially distinct structure-activity signals. The dataset contract, not model performance alone, defines the validity and scope of a QSAR claim, and stratification is a precondition for a functional model to support a defensible claim. Curation logic, derived tables, frozen data, and reproducibility artifacts are released.

20

EpidBot: A Natural Language Platform for Generalized Epidemic Intelligence

Braga, J. S.; Coelho, F. C.; Laiate, B.

2026-06-22 public and global health 10.64898/2026.06.18.26355985 medRxiv

Top 0.4%

1.7%

Show abstract

Public health professionals have access to more data than ever before. Yet answering a relatively simple epidemiological question often requires navigating multiple databases, formats, software tools, and reporting systems. As a result, valuable data often remain locked behind technical barriers, making it harder for public health professionals to turn information into decisions. We developed EpidBot to simplify this process. EpidBot is a platform that allows users to retrieve, analyze, visualize, model, and report epidemiological data through natural language interaction. By connecting multiple public health data sources within a single environment, the platform enables users to conduct analyses that would traditionally require several independent tools and specialized technical skills. Rather than functioning solely as a search interface, EpidBot supports complete analytical workflows. Users can explore surveillance data, compare trends across locations and time periods, generate maps and visualizations, construct epidemiological models, and produce structured technical reports while maintaining full visibility of data sources and analytical procedures. To show what this looks like in practice, we present representative use cases, including the automatic generation of a mathematical model for Ebola virus disease in the Democratic Republic of the Congo. From a single user request, EpidBot assembled evidence from published sources, generated and calibrated a compartmental transmission model, identified key transmission drivers, evaluated intervention scenarios, and produced a technical report with quantitative findings and policy-relevant recommendations. EpidBot shows how natural language interaction can reduce the technical barriers that often separate public health professionals from the analyses they need to perform. By bringing data access, analysis, modeling, visualization, and reporting into a single environment, the platform helps transform information into evidence while preserving transparency and reproducibility.