Back

Database

Oxford University Press (OUP)

Preprints posted in the last 90 days, ranked by how well they match Database's content profile, based on 51 papers previously published here. The average preprint has a 0.04% match score for this journal, so anything above that is already an above-average fit.

1
Detecting Manuscripts Related to Computable Phenotypes Using a Transformer-based Language Model

Chae, J.; Heise, D. A.; Connatser, K.; Honerlaw, J.; Maripuri, M.; Ho, Y.-L.; Fontin, F.; Tanukonda, V.; Cho, K.

2026-03-16 bioinformatics 10.64898/2026.03.12.711165 medRxiv
Top 0.1%
32.8%
Show abstract

ObjectiveThe demand for a comprehensive phenomics library, which requires identifying computable phenotype definitions and associated metadata from an ever-expanding biomedical literature, presents a significant, labor-intensive, and unscalable challenge. To address this, we introduce a transformer-based language model specifically designed for identifying biomedical texts containing computable phenotypes and piloted its use in the Centralized Interactive Phenomics Resource (CIPHER) platform. Materials and MethodsWe fine-tuned a BioBERT model using a labeled dataset of 396 manuscripts. The model incorporates our novel sliding-window approach to effectively overcome token-length limitations, thereby enabling accurate classification of full-length manuscripts. For scalable deployment and continuous refinement, we developed a cohesive framework that integrates a web-based user interface, a control server, and a classification module. ResultsThe staged approach for model development yielded a final model with 95% accuracy. The web-based user interface was deployed on the CIPHER platform and enables user feedback for model retraining. DiscussionWe developed a model and user interface which are currently in use by data curators to identify computable phenotype definitions from the literature. ConclusionThrough this system, users can submit literature, assess classification results, and provide feedback directly influencing future model training, thereby offering an efficient and adaptive solution for accelerating phenotype-driven literature curation.

2
Application of large language models to the annotation of cell lines and mouse strains in genomics data

Rogic, S.; Mancarci, B. O.; Xu, B.; Xiao, A.; Yang, C.; Pavlidis, P.

2026-03-07 bioinformatics 10.64898/2026.03.05.709906 medRxiv
Top 0.1%
28.2%
Show abstract

Accurate, consistent and comprehensive metadata are essential for the reuse of functional genomics data deposited in repositories such as the Gene Expression Omnibus (GEO), however, achieving this often requires careful manual curation that is time-consuming, costly and prone to errors. In this paper, we evaluate the performance of Large Language Models (LLMs), specifically OpenAIs GPT-4o, as an assistive tool for entity-to-ontology annotation of two commonly encountered descriptors in transcriptomic experiments - mouse strains and cell lines. Using over 9,000 manually curated experiments from the Gemma database and over 5,000 associated journal articles, we assess the models ability to identify relevant free-text entries and map them to appropriate ontology terms. Using zero-shot prompting and retrieval-augmented generation (RAG) to incorporate domain-specific ontology knowledge, GPT-4o correctly annotated 77% of mouse strain and 59% of cell line experiments, and uncovered manual curation errors in Gemma for over 200 experiments. GPT-4o substantially outperformed a regular expression-based string-matching method, which correctly annotated only 6% of mouse strain experiments due to low precision. Model errors often arose from typographical mistakes or inconsistent naming in the GEO record or publication, and resembled those made by human curators. Along with annotations, our approach requests that the model output supporting context and quotes from the sources. These were typically accurate and enabled rapid curator verification. These findings suggest that LLMs are not ready to fully replace manual curators, but can already effectively support them. A human-in-the-loop workflow, in which LLMs annotations are provided to human curators for validation, may improve the efficiency and quality of large-scale biomedical metadata curation.

3
Relation Extraction for Diet, Non-Communicable Disease and Biomarker Associations (RECoDe): A CoDiet study

Choi, D.; Gu, Y.; Zong, K.; Lain, A. D.; Zaikis, D.; Rowlands, T.; Rei, M.; CoDiet Consortium, ; Beck, T.; Posma, J. M.

2026-03-05 bioinformatics 10.64898/2026.03.03.709244 medRxiv
Top 0.1%
18.9%
Show abstract

Diet plays a critical role in human health, with growing evidence linking dietary habits to disease outcomes. However, extracting structured dietary knowledge from biomedical literature remains challenging due to the lack of dedicated relation extraction datasets. To address this gap, we introduce RECoDe, a novel relation extraction (RE) dataset designed specifically for diet, disease, and related biomedical entities. RECoDe captures a diverse set of relation types, including a broad spectrum of positive association patterns and explicit negative examples, with over 5,000 human-annotated instances validated by up to five independent annotators. Furthermore, we benchmark various natural language processing (NLP) RE models, including BERT-based architectures and enhanced prompting techniques with locally deployed large language models (LLMs) to improve classification performance on underrepresented relation types. The best performing model gpt-oss-20B, a local LLM, achieved an F1-score of 64% for multi-class classification and 92% for binary classification using a hierarchical prompting strategy with a separate reflection step built in. To demonstrate the practical utility of RECoDe, we introduce the Contextual Co-occurrence Summarisation (Co-CoS) framework, which aggregates sentence-level relation extractions into document-level summaries and further integrates evidence across multiple documents. Co-CoS produces effect estimates consistent with established dietary knowledge, demonstrating its validity as a general framework for systematic evidence synthesis. AvailabilityThe code, models, and data will be made freely available upon acceptance.

4
User-driven development and evaluation of an agentic framework for analysis of large pathway diagrams

Corradi, M.; Djidrovski, I.; Ladeira, L.; Staumont, B.; Verhoeven, A.; Sanz Serrano, J.; Rougny, A.; Vaez, A.; Hemedan, A.; Mazein, A.; Niarakis, A.; de Carvalho e Silva, A.; Auffray, C.; Wilighagen, E.; Kuchovska, E.; Schreiber, F.; Balaur, I.; Calzone, L.; Matthews, L.; Veschini, L.; Gillespie, M. E.; Kutmon, M.; Koenig, M.; van Welzen, M.; Hiroi, N.; Lopata, O.; Klemmer, P.; Overall, R.; Hofer, T.; Satagopam, V.; Schneider, R.; Teunis, M.; Geris, L.; Ostaszewski, M.

2026-03-12 bioinformatics 10.64898/2026.03.10.710813 medRxiv
Top 0.1%
18.2%
Show abstract

As biomedical knowledge keeps growing, resources storing available information multiply and grow in size and complexity. Such resources can be in the format of molecular interaction maps, which represent cellular and molecular processes under normal or pathological conditions. However, these maps can be complex and hard to navigate, especially to novice users. Large Language Models (LLMs), particularly in the form of agentic frameworks, have emerged as a promising technology to support this exploration. In this article, we describe a user-driven process of prototyping, development, and user testing of Llemy, an LLM-based system for exploring these molecular interaction maps. By involving domain experts from the very first prototyping in the form of a hackathon and collecting both fine-grained and general feedback on more refined versions, we were able to evaluate the perceived utility and quality of the developed system, in particular for summarising maps and pathways, as well as prioritise the development of future features. We recommend continued user-driven development and benchmarking to keep the community engaged. This will also facilitate the transition towards open-weight LLMs to support the needs of the open research environment in an ever-changing technology landscape.

5
Pulmonary Hypertension Engine for Linked Experiments (PHELEX): a platform for the re-analysis of public transcriptomic data related to pulmonary hypertension in both animal models, and humans.

Nandani, T.; Ott, B. P.; Balaratnam, P.; Archer, S. L.; Durbin, J.; Hindmarch, C. C. T.

2026-05-01 genomics 10.64898/2026.04.28.721394 medRxiv
Top 0.1%
10.2%
Show abstract

Pulmonary hypertension (PH) is a vasculopathy that results in elevated mean pulmonary arterial pressures over 20mmHg. Despite significant advances in research, PH still has a high mortality rate, and there is currently no cure for the disease. As with all biomedical fields, PH researchers have embraced the power of next generation technologies such as microarrays and RNA sequencing. Most of these data can be found on public repositories, which is usually a requirement for publication. While these repositories are rich sources of data, they require intermediate to advanced bioinformatics skills to access, download, and make these data useful. Here we present Pulmonary Hypertension Engine for Linked Experiments (PHELEX), which represents a comprehensive catalogue of all RNA sequencing data related to PH that is currently available on the Gene Expression Omnibus (GEO), hosted by the US National Centre for Biotechnology Information (NCBI). We identified 2,278 bulk RNA sequencing samples from human, mouse and rat, and built a searchable tool based on the metadata that is associated with each sample. PHELEX is a functional tool that allows selected studies to be highlighted, and parsed through Confidence, an analysis tool we have created, which will model the data based on user-defined classifiers, perform differential gene expression and pathway analysis, and present these data using standard graphics, and text-file results. PHELEX also allows PH researchers to cross-cut between discrete studies, facilitating de novo understanding of these data. As a robust searchable repository of genomic data, we hope that PHELEX will accelerate PH innovation and discovery, by allowing researchers to mine existing genomic data and thus better understand the molecular signatures that underpin PH.

6
CAPRINI-M: An AI-curated Cardiac-Specific Atlas of Protein Interactions in Mice

Gjerga, E.; Wiesenbach, P.; Goerner, C.-A.; Zhang, Y.; Pelz, K.; List, M.; Dieterich, C.

2026-03-09 bioinformatics 10.64898/2026.03.06.710104 medRxiv
Top 0.1%
7.5%
Show abstract

MotivationProtein-protein interactions are fundamental to cardiovascular disease biology, but the corresponding knowledge is dispersed across the literature and heterogeneous databases, making systematic curation time-consuming. Moreover, many existing PPI resources may be biased and lack detailed information on structural interaction interfaces or associated thermodynamic parameters. ResultsWe present CAPRINI-M (CArdiac PRotein INteractions In Mice), a web-based tool hosting an AI-curated atlas of cardiac protein interactions. We mined 9,105 cardiobiology manuscripts and used open-source LLMs (LLaMA-3.3 70B) to extract 11,189 protein-protein interactions. We then used AlphaFold3 to infer interaction interfaces, estimate thermodynamic properties related to complex stability, and predict the likelihood that each protein pair forms a complex. In our benchmarking analysis, CAPRINI-M showed stronger performance than the comparator PPI resources tested here. Predicted interaction favourability also agreed with published experimental evidence, with lower predicted Gibbs free energy associated with experimentally preferred binding partners. Overall, CAPRINI-M provides a more comprehensive, mechanistically annotated view of cardiovascular disease-relevant protein-protein interactions by integrating literature evidence with structural, interface-level, and stability-related information. AvailabilityThe CAPRINI-M web application is available at https://shiny.dieterichlab.org/app/caprinim. The source code used in this study is linked in the manuscripts Availability section.

7
Building an Interoperable Rare Disease Multi-omic Resource: The GREGoR Data Model and Dataset

Heavner, B. D.; Wheeler, M. M.; Bengtsson, J. D.; Carvalho, C. M. B.; Cheung, W. A.; Conomos, M. P.; Delot, E. C.; DiTroia, S.; Ganesh, V. S.; Gogarten, S. M.; Grochowski, C. M.; Jhangiani, S. N.; King, C. H.; LeMaster, C.; Marvin, C. T.; Marwaha, S.; Miller, D. E.; O'Donnell-Luria, A.; Pais, L.; Patterson, K.; Qi, G.; Richardson, M.; Smail, C.; Stilp, A. M.; Tong, C. C.; Ungar, R. A.; Weisburd, B.; Bamshad, M. J.; Bernstein, J. A.; Eichler, E. E.; Gibbs, R. A.; Lupski, J. R.; May, S. J.; Montgomery, S. B.; Pastinen, T.; Posey, J.; Rehm, H. L.; Shojaie, A.; Talkowski, M. E.; Vilain, E.; Wei, C

2026-05-19 genomics 10.64898/2026.05.15.725546 medRxiv
Top 0.1%
6.7%
Show abstract

Rare disease research and diagnosis rely on the integration of genomic and phenotypic data generated across diverse clinical sites; however, the absence of widely adopted standards for representing genomic data and associated metadata has limited data interoperability, reuse, and cross-study analysis. The Genomics Research to Elucidate the Genetics of Rare Diseases (GREGoR) Consortium was established to investigate challenging rare disease cases and evaluate emerging multi-omic technologies for clinical translation. To support coordinated data integration across distributed research sites, we developed a common Consortium Data Model in partnership with domain experts to standardize the capture of participant-, family-, phenotype- and assay-level metadata, with a particular emphasis on using a modular architecture to support linking of multiple data versions from multiple omic technologies to a single individual and attribution of a genetic finding to the specific technology used for its initial discovery. Adoption of the GREGoR Data Model has enabled continued generation and public release of a harmonized, analysis-ready Consortium Dataset. The most recent release includes phenotypic, family and multi-omic data from 12,292 participants in 5,029 families. Other rare disease data sharing efforts are beginning to adopt this data model which will facilitate cross consortium analyses and empower rare disease research. This work demonstrates that a collaborative, flexible, and scalable data model can enable large-scale rare disease research, facilitate cross-center data harmonization, and enable data interoperability.

8
Deterministic retrieval recovers biomedical associations lost by language models

Halder, A.; Singh, M.; Kesarwani, R.; Mathew, B.; Bhattacharya, N.; Chikhaliya, O.; Motwani, D.; Peela, S. C. M.; Samanta, S.; Muddemmanavar, P.; Farooq, M.; Ahuja, G.; Sengupta, D.

2026-04-29 bioinformatics 10.64898/2026.04.25.720782 medRxiv
Top 0.1%
6.2%
Show abstract

Large language model (LLM)-based retrieval systems miss biomedical associations through output truncation, synonym mismatch and run-to-run variability, but the magnitude of this loss remains unclear. We present BioChirp, an open-source framework that uses LLMs for query interpretation and candidate filtering, combining multi-source consensus entity resolution with deterministic graph-based retrieval. Across four major biomedical databases, BioChirp recovered more associations with higher reproducibility than conventional LLM-based retrieval approaches.

9
From expansion to consolidation: two decades ofGene Ontology evolution

Pitarch, B.; Pazos, F.; Chagoyen, M.

2026-03-06 bioinformatics 10.64898/2026.03.04.709507 medRxiv
Top 0.1%
4.8%
Show abstract

The Gene Ontology (GO) is a long-standing, community-maintained knowledge resource that underpins the functional annotation of gene products across numerous biological databases. Released regularly, GO and its associated annotations form a large, continuously evolving dataset whose temporal dynamics have direct consequences for data reuse, versioning, and reproducibility. Because analytical results derived from GO are inherently tied to specific ontology and annotation releases, a systematic understanding of how GO changes over time is essential for transparent interpretation and long-term reuse of GO-based analyses. Here, we present a comprehensive temporal characterization of the Gene Ontology and its annotations spanning 21 years of publicly available releases. Treating successive ontology and annotation versions as longitudinal research data, we quantify changes in ontology structure, term composition, relationships, and annotation content across time and across three representative annotation resources. Our analysis reveals sustained growth of GO over its lifetime, accompanied by marked structural reorganization, particularly affecting high-level, general ontology terms. Notably, across multiple structural and annotation metrics, we identify a transition toward increased stability beginning around 2017, consistent with a maturation phase of the resource. This work provides a reference framework for researchers who rely on GO releases for data integration, benchmarking, and reproducible functional analysis.

10
cadmus: a robust pipeline for scalable retrieval of full-text biomedical literature

Campbell, J.; Lain, A. D.; Simpson, T. I.

2026-05-19 bioinformatics 10.64898/2026.05.16.725623 medRxiv
Top 0.1%
4.4%
Show abstract

cadmus is an open-source Python toolkit for automated retrieval and processing of full-text biomedical literature. It utilises programmatic access to PubMed, Crossref, Europe PMC, PMC, and publisher APIs, allowing users to construct large, domain-specific corpora with minimal manual intervention. cadmus parses PDF, HTML, XML, and plain text files, standardising them for downstream biomedical text mining. During the retrieval of a Developmental Disorders Corpus (204,043 publications), it achieved an 85.2% full-text retrieval rate with institutional subscriptions and 54.4% without. To test the fidelity of retrieved full-texts, we used ScispaCy to infer the similarity of paired documents from 44,264 open-access PubMed Central files and the files retrieved from cadmus, resulting in an average cosine similarity score of 0.98. Rarefaction analyses demonstrated that full-text corpora double the coverage of unique biomedical concepts over abstracts, resulting in better access to the depth of the biomedical information available. Availability and implementationcadmus is a freely available package for non-commercial research at https://github.com/biomedicalinformaticsgroup/cadmus and released under the MIT License.

11
A telomere-to-telomere (T2T) pig genome assembly reveals Y chromosome diversity and structural variations of Wuzhishan pigs

Ren, Y.; Wang, F.; Li, X.; Liu, G.; Sun, R.; Zheng, X.; Zhang, Y.; Lin, R.; Lu, X.; Chen, L.; Xin, W.; Fei, Y.; Chao, Z.

2026-04-27 genomics 10.64898/2026.04.23.720499 medRxiv
Top 0.1%
4.0%
Show abstract

BackgroudWuzhishan (WZS) pigs are native to Hainan Province of China, and serve as both important agricultural resources and biomedical models. Although the published WZS pig genome (T2T-pig1.0) even achieving telomere-to telomere (T2T) completeness, substantial genetic diversity still exists within the same pig breed, another WZS pig genome named WZS-T2T was assembled in this study. ResultsMultiple sequencing data were used to assemble genome, and finally yielded a [~]2.68 Gb telomere-to-telomere genome, with N50 length [~]142.87 Mb, and annotated protein coding genes of 23,100. Compared to T2T-pig1.0, QV and BUSCO value was higher, and the Y chromosome (ChrY) length was longer in WZS-T2T than that of T2T-pig1.0. ChrY of two WZS pigs shared 11 genes, including sex differentiation-related genes of SHOX, PRKX, and DDX3X, and SRY; however, energy metabolism gene SLC25A4 and the macrophage-related receptor gene CSF2RA of ChrY were specific to WZS-T2T. An inversion SV on chromosome 10 with length [~]33.86 Mb was identified between two WZS pigs, and three proofs were proposed for proving the accuracy sequence orientation of WZS-T2T.The genetic diversity was consistent with LD decay speed in population different analysis. WZS pigs exhibited higher genetic diversity than other four pig populations (Tunchang pigs, Yuxi black pigs, Large White pig, and Duroc pigs) examined in this study, and presented slower LD decay compared to other four breeds. ConclusionsTherefore, WZS-T2T provided a higher-quality assembly, and potential advantages of both agricultural production and biomedical targets for WZS pigs.

12
Clarified an rDNA Gene Unit Pattern with (CTTT)n and (CT)n Microsatellites Aggregation Ahead of and Behind the Gene in Human Genome

Shen, J.; Tang, S.; Xia, Y.; Qin, J.; Xu, H.; Tan, Z.

2026-03-24 genetics 10.64898/2026.03.22.713381 medRxiv
Top 0.2%
3.7%
Show abstract

BackgroundConventional models of human ribosomal DNA (rDNA) array organization have historically depended on transcription-centric boundaries, partitioning the unit into a [~]13 kb rDNA transcription region and a monolithic [~]31 kb intergenic spacer (IGS). While our previous identification of Duplication Segment Units (DSUs) mapped these arrays based on an intuitive analysis of the microsatellite density landscape of the complete reference human genome, our present deep mining of this landscape has revealed a more accurate rDNA Gene Unit Pattern. Methods & ResultsIn this study, we conducted a deep mining analysis of our previously established microsatellite density landscape of the T2T-CHM13 assembly, focusing specifically on nucleolar organizing regions (NORs). We suggest a more accurate rDNA Gene Unit Pattern containing a (CTTT)n microsatellite aggregation ahead of the rDNA gene and a (CT)n microsatellite aggregation behind the gene, rather than a pattern featuring an IGS region inserted between two rDNA genes. ConclusionsA correct rDNA gene pattern of the human genome probably includes a (CTTT)n microsatellite aggregation ahead of the gene and a (CT)n microsatellite aggregation behind it, which possibly constitute cis- and trans-regulating regions; the (CTTT)n and (CT)n microsatellite aggregations may provide two different local stable DNA structures for regulatory protein binding.

13
Multi-Agent Orchestration for Knowledge Extraction and Retrieval: AI Expert System for GPCRs

spieser, j. C.; Kogan, P.; Yang, J.; meller, j.; Patra, K.; shamsaei, B.

2026-04-14 bioinformatics 10.64898/2026.04.10.696782 medRxiv
Top 0.2%
3.5%
Show abstract

We present GPCR-Nexus, an AI-driven platform for integrated exploration of G protein-coupled receptor (GPCR) biology that unifies structured databases with unstructured scientific literature. The system combines a GPCR-ligand knowledge graph with vector-based semantic retrieval to enable comprehensive, up-to-date information access. Central to GPCR-Nexus is a multi-agent architecture in which specialized components coordinate query planning, evidence retrieval, validation, and synthesis. This design ensures that generated responses are grounded in verifiable sources while maintaining coherence across heterogeneous data modalities. By jointly leveraging curated databases and primary literature, GPCR-Nexus enables context-aware reasoning over molecular interactions, functional mechanisms, and disease associations. The platform produces citation-backed outputs with traceable evidence, addressing limitations of conventional database queries and standalone language models. We detail the system architecture, data integration strategy, and agent orchestration framework, and demonstrate its utility through representative query scenarios. GPCR-Nexus provides a scalable approach to combining structured and unstructured biomedical knowledge using agent-based AI, offering improved accuracy, interpretability, and coverage. This work establishes a foundation for trustworthy, AI-assisted knowledge synthesis in GPCR research and drug discovery.

14
Benchmarking 80 binary phenotypes from the openSNP dataset using deep learning algorithms and polygenic risk score tools

Muneeb, M. -; Ascher, D.; Myung, Y.; Feng, S.; Henschel, A.

2026-03-09 bioinformatics 10.64898/2026.03.06.710126 medRxiv
Top 0.2%
3.5%
Show abstract

Genotype-phenotype prediction plays a crucial role in identifying disease-causing single nucleotide polymorphisms and precision medicine. In this manuscript, we benchmark the performance of various machine/deep learning algorithms and polygenic risk score tools on 80 binary phenotypes extracted from the openSNP dataset. After cleaning and extraction, the genotype data for each phenotype is passed to PLINK for quality control, after which it is transformed separately for each of the considered tools/algorithms. To compute polygenic risk scores, we used the quality control measures for the test data and the genome-wide association studies summary statistic file, along with various combinations of clumping and pruning. For the machine learning algorithms, we used p-value thresholding on the training data to select the single nucleotide polymorphisms, and the resulting data was passed to the algorithm. Our results report the average 5-fold Area Under the Curve (AUC) for 29 machine learning algorithms, 80 deep learning algorithms, and 3 polygenic risk scores tools with 675 different clumping and pruning parameters. Machine learning outperformed for 44 phenotypes, while polygenic risk score tools excelled for 36 phenotypes. The results give us valuable insights into which techniques tend to perform better for certain phenotypes compared to more traditional polygenic risk scores tools.

15
TFBSpedia: a comprehensive human and mouse transcription factor binding sites database

Li, S.; Chou, E.; Wang, K.; Boyle, A. P.; Sartor, M. A.

2026-03-06 bioinformatics 10.64898/2026.03.04.709638 medRxiv
Top 0.2%
3.5%
Show abstract

Mapping the genomic locations and patterns of transcription factor binding sites (TFBS) is essential for understanding gene regulation and advancing treatments for diseases driven by DNA modifications, including epigenetic changes and sequence variants. Although several TFBS databases exist, no study has systematically benchmarked these databases across different sequencing technologies and computational algorithms. In this study, we addressed this gap by constructing a TFBS database that integrates all available ENCODE cell line ATAC-seq and Cistrome Data Browser ChIP-seq datasets, comprising 11.3 million human and 1.87 million mouse TFBS. We also integrated previously published TFBS resources (Factorbook, Unibind, RegulomeDB, and ENCODE_footprint) and found each contains a substantial fraction of unique TFBS predictions, highlighting significant discrepancies among existing resources. To assess the accuracy of the combined TFBS regions, we assembled ten independent genomic annotation datasets for evaluation and found that TFBS regions predicted by multiple databases are more likely to represent true and biologically meaningful binding sites. For each predicted TFBS region, we define two scores: the confidence score reflects prediction reliability, while the importance score represents biological functional relevance. Finally, we introduce TFBSpedia, a lightweight and efficient search engine that enables rapid retrieval of TFBS regions and comprehensive annotation information across the integrated databases.

16
Identifying genes associated with phenotypes using machine and deep learning

Muneeb, M.; Ascher, D.

2026-03-07 bioinformatics 10.64898/2026.03.05.709665 medRxiv
Top 0.2%
3.5%
Show abstract

Identifying disease-associated genes enables the development of precision medicine and the understanding of biological processes. Genome-wide association studies (GWAS), gene expression data, biological pathway analysis, and protein network analysis are among the techniques used to identify causal genes. We propose a machine-learning (ML) and deep-learning (DL) pipeline to identify genes associated with a phenotype. The proposed pipeline consists of two interrelated processes. The first is classifying people into case/control based on the genotype data. The second is calculating feature importance to identify genes associated with a particular phenotype. We considered 30 phenotypes from the openSNP data for analysis, 21 ML algorithms, and 80 DL algorithms and variants. The best-performing ML and DL models, evaluated by the area under the curve (AUC), F1 score, and Matthews correlation coefficient (MCC), were used to identify important single-nucleotide polymorphisms (SNPs), and the identified SNPs were compared with the phenotype-associated SNPs from the GWAS Catalog. The mean per-phenotype gene identification ratio (GIR) was 0.84. These results suggest that SNPs selected by ML/DL algorithms that maximize classification performance can help prioritise phenotype-associated SNPs and genes, potentially supporting downstream studies aimed at understanding disease mechanisms and identifying candidate therapeutic targets.

17
MTB-KB: A Curated Knowledgebase of Mycobacterium tuberculosis Related Studies

Li, P.; Li, C.; Zhu, R.; Sun, W.; Zhou, H.; Fan, Z.; Yue, L.; Zhang, S.; Jiang, X.; Luo, Q.; Han, J.; Huang, H.; Shen, A.; Bahetibieke, T.; Wang, J.; Zhang, W.; Wen, H.; Niu, H.; Bu, C.; Zhang, Z.; Xiao, J.; Gao, R.; Chen, F.

2026-04-10 bioinformatics 10.64898/2026.04.07.716833 medRxiv
Top 0.2%
3.1%
Show abstract

Tuberculosis (TB), caused by Mycobacterium tuberculosis (MTB), has regained its position as the worlds leading killer among infectious diseases. Despite extensive research progress across epidemiology, diagnosis, drug development, treatment regimens, vaccines, drug resistance, virulence factors, and immune mechanisms, MTB-related knowledge remains fragmented across thousands of publications, limiting its effective use. To address this gap, we present MTB-KB, a literature-curated knowledgebase that systematically integrates high-impact findings from eight major sections of TB research. The current release contains 75,170 associations from 1,246 publications, covering 18,439 entities standardized using authoritative databases and WHO-endorsed classifications. A central feature is the interactive knowledge graph, which links cross-section associations to reveal and infer MTB-host interactions, treatment strategies, and vaccine development opportunities. MTB-KB also provides a user-friendly interface with browsing, advanced search, and statistical visualization. Overall, by consolidating dispersed MTB knowledge into a structured and accessible platform, MTB-KB provides a valuable resource for researchers, clinicians, and policymakers, supporting both basic and clinical TB research, enabling evidence-based TB prevention, diagnosis, and treatment, and contributing to global elimination efforts. MTB-KB is accessible at https://ngdc.cncb.ac.cn/mtbkb/.

18
nSIGHT™: A Data Discovery Platform for Visualization, Integration and Retrospective Analysis of Multimodal Clinical Research Data

Zia, M. K.; Plessinger, B.; Eng, K. H.; Flierl, A.; Wilbert, M.; Jans, K.; Whalen, P.; Mullin, S.; Ohm, J.; Singh, A. K.; Farrugia, M.; Morrison, C.; Darlak, C. J.; Seshadri, M.

2026-03-11 oncology 10.64898/2026.03.10.26347202 medRxiv
Top 0.2%
2.8%
Show abstract

The lack of interoperability among clinical and research data systems poses a significant barrier to cancer researchers interested in evaluating novel mechanistic hypotheses or translating innovative treatment strategies from the laboratory to the clinic. To address this gap in knowledge, we developed an innovative, web-based, data discovery, visualization and analysis tool (nSight) that allows researchers to quickly and easily query clinical/research data and construct de-identified cancer cohorts. Guiding principles for development of the tool were focused on ease of use, intuitiveness, self-service, and presentation of structured but de-identified data to the end user. nSight provides users with information on patient demographics, disease histology, diagnostic procedures and therapeutic interventions, timeline of disease progression/recurrence, along with available molecular profiling/sequencing data and indicators of participation in epidemiologic or lifestyle studies for specific cancer patient cohorts. The platform also allows users to obtain summary statistics based on demographic, histologic and clinical factors as well as perform basic survival analysis using Kaplan-Meier curves between specific patient cohorts. nSight is an intuitive, user-friendly tool that enables visualization, integration and analysis of multimodal clinical and research data without placing high technical demands or time constraints on researchers. The platform is designed for research feasibility assessment, cohort development, and retrospective data discovery, which in turn should help investigators identify potential study populations and explore novel hypotheses.

19
Input design for unsupervised cross-national branded food database alignment using large language models

Nakagawa, S.; Yamamoto, A.

2026-05-25 nutrition 10.64898/2026.05.23.26353945 medRxiv
Top 0.2%
2.8%
Show abstract

Cross-national alignment of branded food databases is essential for international nutritional epidemiology but lacks standardized methods. Existing approaches - including food ontologies, domain-specific fine-tuned language models, and manual expert mapping - require either substantial infrastructure or do not scale to thousands of items. We propose an unsupervised evaluation framework for large language model (LLM)-based food database alignment that requires no ground-truth labels. Using the Japan Branded Food Database (JBFD; 9,519 items, 71 mid-level categories) and USDA FoodData Central (448 categories) as a case study, we introduce two complementary metrics: weighted centroid distance (nutritional proximity between matched category pairs) and dominant category share (structural consistency of category-level assignments). We then conducted a systematic ablation study across eight input conditions (A-H), varying combinations of product name, nutrient profile, and semantic category label. Results showed that nutrient-only inputs yielded poor structural consistency despite low centroid distances, while semantic category labels achieved the highest dominant category share (89.3%) but introduced circularity due to their LLM-derived origin. Among circularity-free conditions, product name combined with minimal nutrient information (energy, protein, salt; condition E) achieved the best balance of centroid distance (0.471) and dominant category share (65.8%). Model comparison across Claude Haiku, Sonnet, and Opus confirmed that NO_MATCH rates were consistent across model sizes (12-14%), suggesting that prompt design contributes more to alignment quality than model scale. These findings provide practical guidance for input design in LLM-based food database alignment without ground-truth annotation.Sonnet 4.6

20
Evo 2 Predicts Cardiomyopathy-Associated Variants and Elucidates Their Underlying Mechanisms

kurozumi, a.; otsuka, n.; Masamichi, I.; kawakami, t.; Isagawa, T.; kodera, s.; takeda, n.

2026-05-17 genomics 10.64898/2026.05.15.725304 medRxiv
Top 0.2%
2.7%
Show abstract

BackgroundAlthough advances in next-generation sequencing have accelerated the identification of genetic variants in cardiomyopathy, interpreting variants of uncertain significance (VUS) remains a clinical challenge. Evo 2 is a high-resolution genomic artificial intelligence model capable of predicting pathogenicity across large sequence contexts and enabling mechanistic interpretation; however, its application in cardiovascular genetics is limited. Here, we evaluated the utility of Evo 2 for assessing the pathogenicity and underlying mechanisms of cardiomyopathy-associated variants. MethodsWe used Evo 2 to predict the pathogenicity of single-nucleotide variants in cardiomyopathy-related genes listed on ClinVar. We assessed the ability of the model to identify characteristic structural features in both coding and noncoding regions using internal representation such as embeddings, and to infer the molecular mechanisms of variants within these regions. ResultsEvo 2 demonstrated high predictive accuracy for pathogenicity, achieving an AUROC of 0.983 and an AUPRC of 0.915. Notably, sparse autoencoders (SAEs) from embeddings identified features corresponding to higher-order structural features, including coiled-coil and actin-binding domains characteristic of cardiomyopathy-related proteins, and accurately detected mutations known to disrupt these domains. The model recognized the binding motif of the cardiac-enriched transcription factor TBX5 with SAEs and accurately predicted a single-nucleotide polymorphism affecting TBX5 binding affinity after supervised fine-tuning. ConclusionsEvo 2 demonstrated strong performance for both predicting pathogenicity and extracting biological features of cardiomyopathy-associated variants. It may represent a powerful emerging tool for evaluating VUS in cardiovascular medicine.