Database
◐ Oxford University Press (OUP)
Preprints posted in the last 90 days, ranked by how well they match Database's content profile, based on 51 papers previously published here. The average preprint has a 0.04% match score for this journal, so anything above that is already an above-average fit.
Chae, J.; Heise, D. A.; Connatser, K.; Honerlaw, J.; Maripuri, M.; Ho, Y.-L.; Fontin, F.; Tanukonda, V.; Cho, K.
Show abstract
ObjectiveThe demand for a comprehensive phenomics library, which requires identifying computable phenotype definitions and associated metadata from an ever-expanding biomedical literature, presents a significant, labor-intensive, and unscalable challenge. To address this, we introduce a transformer-based language model specifically designed for identifying biomedical texts containing computable phenotypes and piloted its use in the Centralized Interactive Phenomics Resource (CIPHER) platform. Materials and MethodsWe fine-tuned a BioBERT model using a labeled dataset of 396 manuscripts. The model incorporates our novel sliding-window approach to effectively overcome token-length limitations, thereby enabling accurate classification of full-length manuscripts. For scalable deployment and continuous refinement, we developed a cohesive framework that integrates a web-based user interface, a control server, and a classification module. ResultsThe staged approach for model development yielded a final model with 95% accuracy. The web-based user interface was deployed on the CIPHER platform and enables user feedback for model retraining. DiscussionWe developed a model and user interface which are currently in use by data curators to identify computable phenotype definitions from the literature. ConclusionThrough this system, users can submit literature, assess classification results, and provide feedback directly influencing future model training, thereby offering an efficient and adaptive solution for accelerating phenotype-driven literature curation.
Rogic, S.; Mancarci, B. O.; Xu, B.; Xiao, A.; Yang, C.; Pavlidis, P.
Show abstract
Accurate, consistent and comprehensive metadata are essential for the reuse of functional genomics data deposited in repositories such as the Gene Expression Omnibus (GEO), however, achieving this often requires careful manual curation that is time-consuming, costly and prone to errors. In this paper, we evaluate the performance of Large Language Models (LLMs), specifically OpenAIs GPT-4o, as an assistive tool for entity-to-ontology annotation of two commonly encountered descriptors in transcriptomic experiments - mouse strains and cell lines. Using over 9,000 manually curated experiments from the Gemma database and over 5,000 associated journal articles, we assess the models ability to identify relevant free-text entries and map them to appropriate ontology terms. Using zero-shot prompting and retrieval-augmented generation (RAG) to incorporate domain-specific ontology knowledge, GPT-4o correctly annotated 77% of mouse strain and 59% of cell line experiments, and uncovered manual curation errors in Gemma for over 200 experiments. GPT-4o substantially outperformed a regular expression-based string-matching method, which correctly annotated only 6% of mouse strain experiments due to low precision. Model errors often arose from typographical mistakes or inconsistent naming in the GEO record or publication, and resembled those made by human curators. Along with annotations, our approach requests that the model output supporting context and quotes from the sources. These were typically accurate and enabled rapid curator verification. These findings suggest that LLMs are not ready to fully replace manual curators, but can already effectively support them. A human-in-the-loop workflow, in which LLMs annotations are provided to human curators for validation, may improve the efficiency and quality of large-scale biomedical metadata curation.
Choi, D.; Gu, Y.; Zong, K.; Lain, A. D.; Zaikis, D.; Rowlands, T.; Rei, M.; CoDiet Consortium, ; Beck, T.; Posma, J. M.
Show abstract
Diet plays a critical role in human health, with growing evidence linking dietary habits to disease outcomes. However, extracting structured dietary knowledge from biomedical literature remains challenging due to the lack of dedicated relation extraction datasets. To address this gap, we introduce RECoDe, a novel relation extraction (RE) dataset designed specifically for diet, disease, and related biomedical entities. RECoDe captures a diverse set of relation types, including a broad spectrum of positive association patterns and explicit negative examples, with over 5,000 human-annotated instances validated by up to five independent annotators. Furthermore, we benchmark various natural language processing (NLP) RE models, including BERT-based architectures and enhanced prompting techniques with locally deployed large language models (LLMs) to improve classification performance on underrepresented relation types. The best performing model gpt-oss-20B, a local LLM, achieved an F1-score of 64% for multi-class classification and 92% for binary classification using a hierarchical prompting strategy with a separate reflection step built in. To demonstrate the practical utility of RECoDe, we introduce the Contextual Co-occurrence Summarisation (Co-CoS) framework, which aggregates sentence-level relation extractions into document-level summaries and further integrates evidence across multiple documents. Co-CoS produces effect estimates consistent with established dietary knowledge, demonstrating its validity as a general framework for systematic evidence synthesis. AvailabilityThe code, models, and data will be made freely available upon acceptance.
Corradi, M.; Djidrovski, I.; Ladeira, L.; Staumont, B.; Verhoeven, A.; Sanz Serrano, J.; Rougny, A.; Vaez, A.; Hemedan, A.; Mazein, A.; Niarakis, A.; de Carvalho e Silva, A.; Auffray, C.; Wilighagen, E.; Kuchovska, E.; Schreiber, F.; Balaur, I.; Calzone, L.; Matthews, L.; Veschini, L.; Gillespie, M. E.; Kutmon, M.; Koenig, M.; van Welzen, M.; Hiroi, N.; Lopata, O.; Klemmer, P.; Overall, R.; Hofer, T.; Satagopam, V.; Schneider, R.; Teunis, M.; Geris, L.; Ostaszewski, M.
Show abstract
As biomedical knowledge keeps growing, resources storing available information multiply and grow in size and complexity. Such resources can be in the format of molecular interaction maps, which represent cellular and molecular processes under normal or pathological conditions. However, these maps can be complex and hard to navigate, especially to novice users. Large Language Models (LLMs), particularly in the form of agentic frameworks, have emerged as a promising technology to support this exploration. In this article, we describe a user-driven process of prototyping, development, and user testing of Llemy, an LLM-based system for exploring these molecular interaction maps. By involving domain experts from the very first prototyping in the form of a hackathon and collecting both fine-grained and general feedback on more refined versions, we were able to evaluate the perceived utility and quality of the developed system, in particular for summarising maps and pathways, as well as prioritise the development of future features. We recommend continued user-driven development and benchmarking to keep the community engaged. This will also facilitate the transition towards open-weight LLMs to support the needs of the open research environment in an ever-changing technology landscape.
Schimmelpfennig, L. E.; Cannon, M.; Cody, Q.; McMichael, J.; Coffman, A.; Kiwala, S.; Krysiak, K. J.; Wagner, A. H.; Griffith, M.; Griffith, O. L.
Show abstract
SummaryThe druggable genome encompasses the genes that are known or predicted to interact with drugs. The Drug-Gene Interaction Database (DGIdb) provides an integrated resource for discovering and contextualizing these interactions, supporting a broad range of research and clinical applications. DGIdb is currently accessed through structured web interfaces and API calls, requiring users to translate natural-language questions into database-specific query patterns. To allow for the use of DGIdb through natural language, we developed the DGIdb Model Context Protocol (MCP) server, which allows large language models (LLMs) access to up-to-date information through the DGIdb API. We demonstrate that the MCP server greatly enhances an LLMs ability to answer questions requiring accurate, up-to-date biomedical knowledge drawn from structured external resources. Availability and implementationThe DGIdb MCP server is detailed at https://github.com/griffithlab/dgidb-mcp-server and includes instructions for accessing the server through the Claude desktop app.
Cannon, M. J.; Bratulin, A.; Stevenson, J. S.; Perry, K.; Coffman, A.; Kiwala, S.; Schimmelpfennig, L.; Costello, H.; McMichael, J. F.; Griffith, M.; Griffith, O. L.; Wagner, A. H.
Show abstract
IMPORTANCEThe Drug-Gene Interaction Database (DGIdb) has a long history of driving hypothesis generation for biomedical research through the careful curation of drug-gene interaction data from primary and secondary sources with supporting literature. Recent advances in large-language model (LLM) and artificial intelligence (AI) technologies have enabled new paradigms for knowledge extraction and biocuration. The accelerating growth of biomedical literature presents a significant challenge for maintaining up-to-date interaction data. With more than 38 million citations indexed in PubMed alone, new strategies must evolve to identify and incorporate new interaction data into DGIdb. OBJECTIVEIdentify new cost-effective AI curation strategies for incorporating new drug-gene interactions into DGIdb. METHODSWe present a methodology that leverages deterministic natural language processing techniques, existing harmonization frameworks, and AI-assisted curation to systematically narrow the literature space and identify new drug-gene interactions from published studies for inclusion in DGIdb. RESULTSWe demonstrate the use of lemmatization to prioritize a set of 100 abstracts containing high amounts of interaction words for downstream AI curation. From our set of abstracts, we were then able to identify 137 drug-gene interactions via an AI curation task, with 121 (88.3%) of these interactions being completely novel to DGIdb. A human expert evaluator reviewed this interaction set and was able to validate 134 of 137 (97.8%) interactions as being valid based on the text provided. CONCLUSIONTaken together, our results highlight a promising, cost-effective method of ingesting new interactions into DGIdb.
Gjerga, E.; Wiesenbach, P.; Goerner, C.-A.; Zhang, Y.; Pelz, K.; List, M.; Dieterich, C.
Show abstract
MotivationProtein-protein interactions are fundamental to cardiovascular disease biology, but the corresponding knowledge is dispersed across the literature and heterogeneous databases, making systematic curation time-consuming. Moreover, many existing PPI resources may be biased and lack detailed information on structural interaction interfaces or associated thermodynamic parameters. ResultsWe present CAPRINI-M (CArdiac PRotein INteractions In Mice), a web-based tool hosting an AI-curated atlas of cardiac protein interactions. We mined 9,105 cardiobiology manuscripts and used open-source LLMs (LLaMA-3.3 70B) to extract 11,189 protein-protein interactions. We then used AlphaFold3 to infer interaction interfaces, estimate thermodynamic properties related to complex stability, and predict the likelihood that each protein pair forms a complex. In our benchmarking analysis, CAPRINI-M showed stronger performance than the comparator PPI resources tested here. Predicted interaction favourability also agreed with published experimental evidence, with lower predicted Gibbs free energy associated with experimentally preferred binding partners. Overall, CAPRINI-M provides a more comprehensive, mechanistically annotated view of cardiovascular disease-relevant protein-protein interactions by integrating literature evidence with structural, interface-level, and stability-related information. AvailabilityThe CAPRINI-M web application is available at https://shiny.dieterichlab.org/app/caprinim. The source code used in this study is linked in the manuscripts Availability section.
Benetti, E.; Scicolone, G.; Tajwar, M.; Masciullo, C.; Bucci, G.; Riba, M.
Show abstract
The OMOP Common Data Model (OMOP CDM) in which observational health data are organized and stored is a broadly accepted data standard which helps clinical research facilitating federation study protocols. In case of cancer studies, there is a growing need to incorporate cancer genomics data in a standardized way. Starting from a brief overview of the basic features of the OMOP CDM, we imagine a path of increasing complexity for including known biomarker genomic data coming from pathology or reports or clinical laboratory findings, towards storing thousands of known and unknown variants coming from genome sequencing data. Data should be stored using standardized identifiers, including those defined by the Global Alliance for Genomics and Health (GA4GH). We propose a scalable strategy for storing genomics variants in increasingly complex scenarios and present KOIOS-VRS, a pipeline that automates the conversion of VCF files into OMOP compatible format.
Pitarch, B.; Pazos, F.; Chagoyen, M.
Show abstract
The Gene Ontology (GO) is a long-standing, community-maintained knowledge resource that underpins the functional annotation of gene products across numerous biological databases. Released regularly, GO and its associated annotations form a large, continuously evolving dataset whose temporal dynamics have direct consequences for data reuse, versioning, and reproducibility. Because analytical results derived from GO are inherently tied to specific ontology and annotation releases, a systematic understanding of how GO changes over time is essential for transparent interpretation and long-term reuse of GO-based analyses. Here, we present a comprehensive temporal characterization of the Gene Ontology and its annotations spanning 21 years of publicly available releases. Treating successive ontology and annotation versions as longitudinal research data, we quantify changes in ontology structure, term composition, relationships, and annotation content across time and across three representative annotation resources. Our analysis reveals sustained growth of GO over its lifetime, accompanied by marked structural reorganization, particularly affecting high-level, general ontology terms. Notably, across multiple structural and annotation metrics, we identify a transition toward increased stability beginning around 2017, consistent with a maturation phase of the resource. This work provides a reference framework for researchers who rely on GO releases for data integration, benchmarking, and reproducible functional analysis.
Liu, H.; Shi, K.; li, A.; Li, X.; Chu, J.; Xue, Y.; Cen, S.; Wang, Y.; Zhang, T.
Show abstract
ObjectiveTo address the inefficiency, subjectivity, and high expertise barrier of traditional epidemiological causal inference, this study designed, developed, and validated an AI-powered agent (EpiCausalX Agent) to automate the end-to-end workflow. It integrates cross-database literature retrieval, intelligent causal reasoning, and Directed Acyclic Graph (DAG) visualization to provide a reliable, accessible tool for researchers. Materials and MethodsBuilt on the LangChain 1.0 framework with a layered design (Agent/Tool/Storage/Utility Layers), the agent uses the DeepSeek V3.2 LLM and ReAct paradigm for dynamic task orchestration. Four specialized tools were integrated including multi-database retrieval with 7 databases, causal inference based on Hills criteria and DAG logic, automated DAG drawing using NetworkX and Matplotlib, and clinical standard query. Performance was validated via unit tests, workflow verification, and usability testing. ResultsThe agent achieved full-process automation. It efficiently retrieves and synthesizes literature, automatically identifies confounders and mediators, and generates standardized interactive DAGs. It produces evidence-based, traceable conclusions aligned with established epidemiological knowledge. Its user-friendly natural language interface enables seamless use by non-technical researchers who complete task initiation quickly without operational confusion. The agent is publicly available on WeChat Mini Program for easy access. ConclusionEpiCausalX Agent advances intelligent, automated epidemiological research. By integrating domain expertise with AI agent technology, it overcomes limitations of manual methods and general LLMs to provide a specialized, verifiable, efficient solution. It has broad applications in observational research, clinical study design, and education to enhance productivity and lower barriers to rigorous causal analysis.
Zandigohar, M.; Dai, Y.
Show abstract
MotivationPrioritization of transcription factor (TF)-target relationships predicted by computational models for experimental validation often requires biologists to manually inspect heterogeneous and context-dependent evidence scattered across the biomedical literature. Large Language Models (LLMs) offer a promising solution to streamline this task. However, their reliance on general-purpose knowledge may lead to hallucinations and inaccurate interpretations. ResultsWe present RAGulate, a retrieval-augmented generation (RAG) framework for literature-grounded assessment of transcriptional regulation. RAGulate leverages CollecTRI, an external regulatory knowledge base, and integrates alias-aware query expansion, sparse and dense retrieval, maximum-marginal-relevance re-ranking, and LLM-based classification of predictions within a modular pipeline. Using a balanced TF-target-context benchmark from the same resource, we evaluate retrieval, classification, and evidence faithfulness. While CollecTRI provides TF-target links with supporting PubMed Identifiers (PMIDs), RAGulate infers the context of each interaction from the retrieved literature. Results show that alias normalization markedly improves retrieval recall, while hybrid retrieval, which merges lexical and embedding-based candidates, achieves the highest evidence recovery across all cut-offs. Conditioning LLMs on retrieved documents consistently improves AUROC and AUPR for classifying whether a TF-target interaction is supported in the specified context compared with direct prompting. RAGulate reduces hallucinations and improves PMID-level citation correctness, producing explanations that faithfully reflect the supporting literature. RAGulate represents a knowledge-based AI tool that partners with biologists to accelerate the process of TF-target prioritization for experimental validation and foster hypothesis generation. Availability and implementationThe software and tutorials are available at github.com/YDaiLab/RAGulate.
Tao, L.; Shi, S.; Zhu, R.; Liu, Z.; Yang, B.; Liu, L.; Chen, W.; Long, Q.; Jiao, N.; Zhang, G.; Xu, P.; Wu, D.
Show abstract
IBDkb (Inflammatory Bowel Disease Knowledge Base; https://www.biosino.org/ibdkb) is a freely accessible, integrated web-based platform that systematically curates and harmonizes multi-source data related to inflammatory bowel disease (IBD). To address the fragmentation and therapeutic gaps in existing specialized resources, IBDkb establishes a unified framework featuring advanced full-text search, interactive visualizations, cross-module knowledge graphs, and AI-powered utilities for real-time literature retrieval, trend analysis, text/PDF interpretation, and domain-specific conversational assistance. The platform currently integrates 98,453 research articles, 3,390 clinical trials, 200 investigational drugs, 200,606 bioactive compounds, 103 therapeutic targets, 77 experimental models, 12 pathogenesis summaries, and 15 treatment strategies. These integrated tools facilitate efficient exploration of complex associations among drugs, targets, trials, and mechanisms, thereby accelerating hypothesis generation and translational research in IBD. The platform is openly available without registration and supports data downloads. A case study on structure-aware drug comparison further demonstrates its utility in facilitating cross-disease drug repositioning hypotheses.
Shen, J.; Tang, S.; Xia, Y.; Qin, J.; Xu, H.; Tan, Z.
Show abstract
BackgroundConventional models of human ribosomal DNA (rDNA) array organization have historically depended on transcription-centric boundaries, partitioning the unit into a [~]13 kb rDNA transcription region and a monolithic [~]31 kb intergenic spacer (IGS). While our previous identification of Duplication Segment Units (DSUs) mapped these arrays based on an intuitive analysis of the microsatellite density landscape of the complete reference human genome, our present deep mining of this landscape has revealed a more accurate rDNA Gene Unit Pattern. Methods & ResultsIn this study, we conducted a deep mining analysis of our previously established microsatellite density landscape of the T2T-CHM13 assembly, focusing specifically on nucleolar organizing regions (NORs). We suggest a more accurate rDNA Gene Unit Pattern containing a (CTTT)n microsatellite aggregation ahead of the rDNA gene and a (CT)n microsatellite aggregation behind the gene, rather than a pattern featuring an IGS region inserted between two rDNA genes. ConclusionsA correct rDNA gene pattern of the human genome probably includes a (CTTT)n microsatellite aggregation ahead of the gene and a (CT)n microsatellite aggregation behind it, which possibly constitute cis- and trans-regulating regions; the (CTTT)n and (CT)n microsatellite aggregations may provide two different local stable DNA structures for regulatory protein binding.
Muneeb, M. -; Ascher, D.; Myung, Y.; Feng, S.; Henschel, A.
Show abstract
Genotype-phenotype prediction plays a crucial role in identifying disease-causing single nucleotide polymorphisms and precision medicine. In this manuscript, we benchmark the performance of various machine/deep learning algorithms and polygenic risk score tools on 80 binary phenotypes extracted from the openSNP dataset. After cleaning and extraction, the genotype data for each phenotype is passed to PLINK for quality control, after which it is transformed separately for each of the considered tools/algorithms. To compute polygenic risk scores, we used the quality control measures for the test data and the genome-wide association studies summary statistic file, along with various combinations of clumping and pruning. For the machine learning algorithms, we used p-value thresholding on the training data to select the single nucleotide polymorphisms, and the resulting data was passed to the algorithm. Our results report the average 5-fold Area Under the Curve (AUC) for 29 machine learning algorithms, 80 deep learning algorithms, and 3 polygenic risk scores tools with 675 different clumping and pruning parameters. Machine learning outperformed for 44 phenotypes, while polygenic risk score tools excelled for 36 phenotypes. The results give us valuable insights into which techniques tend to perform better for certain phenotypes compared to more traditional polygenic risk scores tools.
Li, S.; Chou, E.; Wang, K.; Boyle, A. P.; Sartor, M. A.
Show abstract
Mapping the genomic locations and patterns of transcription factor binding sites (TFBS) is essential for understanding gene regulation and advancing treatments for diseases driven by DNA modifications, including epigenetic changes and sequence variants. Although several TFBS databases exist, no study has systematically benchmarked these databases across different sequencing technologies and computational algorithms. In this study, we addressed this gap by constructing a TFBS database that integrates all available ENCODE cell line ATAC-seq and Cistrome Data Browser ChIP-seq datasets, comprising 11.3 million human and 1.87 million mouse TFBS. We also integrated previously published TFBS resources (Factorbook, Unibind, RegulomeDB, and ENCODE_footprint) and found each contains a substantial fraction of unique TFBS predictions, highlighting significant discrepancies among existing resources. To assess the accuracy of the combined TFBS regions, we assembled ten independent genomic annotation datasets for evaluation and found that TFBS regions predicted by multiple databases are more likely to represent true and biologically meaningful binding sites. For each predicted TFBS region, we define two scores: the confidence score reflects prediction reliability, while the importance score represents biological functional relevance. Finally, we introduce TFBSpedia, a lightweight and efficient search engine that enables rapid retrieval of TFBS regions and comprehensive annotation information across the integrated databases.
Muneeb, M.; Ascher, D.
Show abstract
Identifying disease-associated genes enables the development of precision medicine and the understanding of biological processes. Genome-wide association studies (GWAS), gene expression data, biological pathway analysis, and protein network analysis are among the techniques used to identify causal genes. We propose a machine-learning (ML) and deep-learning (DL) pipeline to identify genes associated with a phenotype. The proposed pipeline consists of two interrelated processes. The first is classifying people into case/control based on the genotype data. The second is calculating feature importance to identify genes associated with a particular phenotype. We considered 30 phenotypes from the openSNP data for analysis, 21 ML algorithms, and 80 DL algorithms and variants. The best-performing ML and DL models, evaluated by the area under the curve (AUC), F1 score, and Matthews correlation coefficient (MCC), were used to identify important single-nucleotide polymorphisms (SNPs), and the identified SNPs were compared with the phenotype-associated SNPs from the GWAS Catalog. The mean per-phenotype gene identification ratio (GIR) was 0.84. These results suggest that SNPs selected by ML/DL algorithms that maximize classification performance can help prioritise phenotype-associated SNPs and genes, potentially supporting downstream studies aimed at understanding disease mechanisms and identifying candidate therapeutic targets.
Jia, G.-S.; Suo, F.; Noly, A.; Fort, P.; Liang, Y.; Li, W.; Zhang, W.-C.; Li, H.-L.; Du, X.-M.; Zhang, F.-Y.; Du, T.-Y.; Hua, Y.; Bai, F.-Y.; Wang, Q.-M.; Brysch-Herzberg, M.; Helmlinger, D.; Du, L.-L.
Show abstract
The fission yeast Schizosaccharomyces pombe is a prominent model organism widely used to investigate fundamental cellular mechanisms. In addition to S. pombe, the genus Schizosaccharomyces includes six other species--S. octosporus, S. japonicus, S. cryophilus, S. osmophilus, S. lindneri, and S. versatilis. These fission yeast species share a common ancestor from which the genus diversified over more than 200 million years. This extensive evolutionary divergence provides opportunities for comparative genomics. Here, we present the Schizosaccharomyces orthogroup (SOG) resource, a web platform developed from our high-quality genome assemblies, gene annotations, and orthology assignments. Most fission yeast genes are assigned to one of over 5,000 orthogroups. The platform enables users to visualize orthogroup sequence alignments and phylogenetic trees, retrieve coding and flanking sequences, and explore the conservation of local synteny. This resource will benefit researchers focusing on individual genes as well as those investigating gene evolution at broader scales. It is freely accessible at https://www.sogweb.org. TAKE AWAYO_LIThe SOG resource covers all known species of Schizosaccharomyces. C_LIO_LIThe platform is built on high-quality genome assemblies and annotations. C_LIO_LIMost genes are assigned to one of over 5,000 orthogroups. C_LIO_LIUsers can view and explore alignments, phylogenetic trees, and local synteny. C_LIO_LIThis free resource aids functional and evolutionary research. C_LI
Park, J.-S.; Ha, S.; Lee, Y.; Kang, Y. J.
Show abstract
MotivationGene regulatory networks provide fundamental insights into plant biology, yet extracting structured interaction data from scientific literature remains a significant bottleneck. Traditional manual curation cannot scale to meet the demands of modern research, while automated text mining approaches struggle with the complexity of gene nomenclature and relationship classification. Large language models offer promising capabilities for information extraction, but integrated platforms combining LLM extraction with community validation for plant regulatory databases remain scarce. ResultsWe developed GeneReL, an integrated platform combining LLM-based extraction with community-driven curation for gene regulatory networks in Arabidopsis thaliana. The system employs a tiered pipeline using Claude Haiku 4.5 for screening, Claude Sonnet 4 for extraction, and Claude Opus 4 for verification, along with a novel five-step gene normalization pipeline incorporating paper-text search and LLM-based disambiguation with UniProt annotations. The database contains 13,710 curated interactions across 51 relationship types, with 90.2% classified as high confidence based on linguistic certainty markers in source text. Comparison with IntAct reveals 86.8% of interactions are unique to our literature-derived database, demonstrating complementary coverage to existing resources. The web platform provides card-based browsing with voting capabilities, interactive network visualization using Cytoscape.js with locus-ID-based node consolidation, and administrative interfaces for curator review of ambiguous gene mappings. Availability and ImplementationGeneReL is freely accessible at https://generel.newgenes.me. Contactkangyangjae@gnu.ac.kr
Li, P.; Li, C.; Zhu, R.; Sun, W.; Zhou, H.; Fan, Z.; Yue, L.; Zhang, S.; Jiang, X.; Luo, Q.; Han, J.; Huang, H.; Shen, A.; Bahetibieke, T.; Wang, J.; Zhang, W.; Wen, H.; Niu, H.; Bu, C.; Zhang, Z.; Xiao, J.; Gao, R.; Chen, F.
Show abstract
Tuberculosis (TB), caused by Mycobacterium tuberculosis (MTB), has regained its position as the worlds leading killer among infectious diseases. Despite extensive research progress across epidemiology, diagnosis, drug development, treatment regimens, vaccines, drug resistance, virulence factors, and immune mechanisms, MTB-related knowledge remains fragmented across thousands of publications, limiting its effective use. To address this gap, we present MTB-KB, a literature-curated knowledgebase that systematically integrates high-impact findings from eight major sections of TB research. The current release contains 75,170 associations from 1,246 publications, covering 18,439 entities standardized using authoritative databases and WHO-endorsed classifications. A central feature is the interactive knowledge graph, which links cross-section associations to reveal and infer MTB-host interactions, treatment strategies, and vaccine development opportunities. MTB-KB also provides a user-friendly interface with browsing, advanced search, and statistical visualization. Overall, by consolidating dispersed MTB knowledge into a structured and accessible platform, MTB-KB provides a valuable resource for researchers, clinicians, and policymakers, supporting both basic and clinical TB research, enabling evidence-based TB prevention, diagnosis, and treatment, and contributing to global elimination efforts. MTB-KB is accessible at https://ngdc.cncb.ac.cn/mtbkb/.
Walter, J.; Kuenne, C.; Knoppik, N.; Goymann, P.; Looso, M.
Show abstract
Scientific research relies on transparent dissemination of data and its associated interpretations. This task encompasses accessibility of raw data, its metadata, details concerning experimental design, along with parameters and tools employed for data interpretation. Production and handling of these data represents an ongoing challenge, extending beyond publication into individual facilities, institutes and research groups, often termed Research Data Management (RDM). It is foundational to scientific discovery and innovation, and can be paraphrased as Findability, Accessibility, Interoperability and Reusability (FAIR). Although the majority of peer-reviewed journals require the deposition of raw data in public repositories in alignment with FAIR principles, metadata frequently lacks full standardization. This critical gap in data management practices hinders effective utilization of research findings and complicates sharing of scientific knowledge. Here we present a flexible design of a machine-readable metadata format to store experimental metadata, along with an implementation of a generalized tool named FRED. It enables i) dialog based creation of metadata files, ii) structured semantic validation, iii) logical search, iv) an external programming interface (API), and v) a standalone web-front end. The tool is intended to be used by non-computational scientists as well as specialized facilities, and can be seamlessly integrated in existing RDM infrastructure.