Database
◐ Oxford University Press (OUP)
All preprints, ranked by how well they match Database's content profile, based on 51 papers previously published here. The average preprint has a 0.04% match score for this journal, so anything above that is already an above-average fit. Older preprints may already have been published elsewhere.
Lain, A. D.; Go, S.; Mahmud, A.; Rajendra, S.; Cano San Jose, A.; Loupasaki, K.; Theodoridis, G.; Bizkarguenaga Uribiarte, M.; Gu, Y.; Deda, O.; Conde, R. D. A.; Embade, N.; de Diego Rodriguez, A.; Burguera, N.; Rossiou, D.; Gil Redondo, R.; Gallou, D.; Tueros, I.; Velmurugan, R.; Gkanali, V.; Caro Burgos, M.; Pousinis, P.; Alektoridis, G.; Arranz, S.; Nikolopoulos, N.; Yan, X.; Fernandez Carrion, R.; Rowlands, T.; Choi, D.; Rei, M.; Cave-Ayland, C.; D Alessandro, A.; The CoDiet consortium, ; Beck, T.; Posma, J. M.
Show abstract
We present here four biomedical, multi-entity corpora that can be used as benchmarks for named-entity recognition (NER), targeted to literature on metabolic syndrome. The CoDiet-Gold corpus (348,413 annotations) contains 500 re-distributable full-text publications, of which each document was independently annotated by two human experts, with disagreements fully adjudicated by a third expert. The CoDiet-Electrum corpus (2,349,499 annotations) contains 3,688 publications that were annotated using the entities from CoDiet-gold. Finally, for the same 3,688 documents, two fully machine-annotated corpora CoDiet-Bronze (2,399,647 annotations) and CoDiet-Silver (1,868,422 annotations), were created by utilising existing NER algorithms to annotate these. These corpora contain categories (organisms, disease, genes, proteins, metabolites) that add depth to existing corpora, as well as new categories that do not have other corpora (food, dietary methods, sample types, computational methods, study methodology, population characteristics, data types, and microbiome).
Andrade, V. D. T.; Ruas, P.; Couto, F. M.
Show abstract
Biomedical literature is the main mean of communication for researchers to share their findings. Since biomedical literature is composed of a large collection of text expressed in natural language, the usage of text mining tools to extract information from those texts automatically is of utmost importance. The problem is that the majority of the state-of-the-art tools were not developed to deal with other languages besides English, which in biomedical literature is even more critical since a significant part of health-related texts is written in the authors native language. To address this issue, this work presents a deep learning NERL (Named Entity Recognition and Linking) system and a parallel corpus for the Spanish and Portuguese languages focused on the oncological domain. Both the system and the corpus are available at https://github.com/lasigeBioTM/ICERL_system-ICR_Corpus.
Krithara, A.; Nentidis, A.; Bougiatiotis, K.; Paliouras, G.
Show abstract
The BioASQ question answering (QA) benchmark dataset contains questions in English, along with golden standard (reference) answers and related material. The dataset has been designed to reflect real information needs of biomedical experts and is therefore more realistic and challenging than most existing datasets. Furthermore, unlike most previous QA benchmarks that contain only exact answers, the BioASQ-QA dataset also includes ideal answers (in effect summaries), which are particularly useful for research on multi-document summarization. The dataset combines structured and unstructured data. The material linked with each question comprise documents and snippets, which are useful for Information Retrieval and Passage Retrieval experiments, as well as concepts that are useful in concept-to-text Natural Language Generation. Researchers working on paraphrasing and textual entailment can also measure the degree to which their methods improve the performance of biomedical QA systems. Last but not least, the dataset is continuously extended, as the BioASQ challenge is running and new data are generated.
Seep, L.; Grein, S.; Splichalova, I.; Ran, D.; Mikhael, M.; Hildebrand, S.; Lauterbach, M.; Hiller, K.; Ribeiro, D. J. S.; Sieckmann, K.; Kardinal, R.; Huang, H.; Yu, J.; Kallabis, S.; Behrens, J.; Till, A.; Peeva, V.; Strohmeyer, A.; Bruder, J.; Blum, T.; Soriano-Arroquia, A.; Tischer, D.; Kuellmer, K.; Li, Y.; Beyer, M.; Gellner, A.-K.; Fromme, T.; Wackerhage, H.; Klingenspor, M.; Fenske, W. K.; Scheja, L.; Meissner, F.; Schlitzer, A.; Mass, E.; Wachten, D.; Latz, E.; Pfeifer, A.; Hasenauer, J.
Show abstract
Datasets consist of measurement data and metadata. Metadata provides context, essential for understanding and (re-)using data. Various metadata standards exist for different methods, systems and contexts. However, relevant information resides at differing stages across the data-lifecycle. Often, this information is defined and standardized only at publication stage, which can lead to data loss and workload increase. In this study, we developed Metadatasheet, a metadata standard based on interviews with members of two biomedical consortia and systematic screening of data repositories. It aligns with the data-lifecycle allowing synchronous metadata recording within Microsoft Excel, a widespread data recording software. Additionally, we provide an implementation, the Metadata Workbook, that offers user-friendly features like automation, dynamic adaption, metadata integrity checks, and export options for various metadata standards. By design and due to its extensive documentation, the proposed metadata standard simplifies recording and structuring of metadata for biomedical scientists, promoting practicality and convenience in data management. This framework can accelerate scientific progress by enhancing collaboration and knowledge transfer throughout the intermediate steps of data creation.
Chae, J.; Heise, D. A.; Connatser, K.; Honerlaw, J.; Maripuri, M.; Ho, Y.-L.; Fontin, F.; Tanukonda, V.; Cho, K.
Show abstract
ObjectiveThe demand for a comprehensive phenomics library, which requires identifying computable phenotype definitions and associated metadata from an ever-expanding biomedical literature, presents a significant, labor-intensive, and unscalable challenge. To address this, we introduce a transformer-based language model specifically designed for identifying biomedical texts containing computable phenotypes and piloted its use in the Centralized Interactive Phenomics Resource (CIPHER) platform. Materials and MethodsWe fine-tuned a BioBERT model using a labeled dataset of 396 manuscripts. The model incorporates our novel sliding-window approach to effectively overcome token-length limitations, thereby enabling accurate classification of full-length manuscripts. For scalable deployment and continuous refinement, we developed a cohesive framework that integrates a web-based user interface, a control server, and a classification module. ResultsThe staged approach for model development yielded a final model with 95% accuracy. The web-based user interface was deployed on the CIPHER platform and enables user feedback for model retraining. DiscussionWe developed a model and user interface which are currently in use by data curators to identify computable phenotype definitions from the literature. ConclusionThrough this system, users can submit literature, assess classification results, and provide feedback directly influencing future model training, thereby offering an efficient and adaptive solution for accelerating phenotype-driven literature curation.
Huang, Y.-N.; Rajesh, A.; Ayyala, R.; Sarkar, A.; Guo, R.; Ling, E.; Nakashidze, I.; Wong, M. Y.; Hu, J.; Nosov, A.; Chang, Y.; Abedalthagafi, M. S.; Mangul, S.
Show abstract
Recent advances in high-throughput sequencing technologies have made it possible to collect and share a massive amount of omics data, along with its associated metadata. Enhancing metadata availability is critical to ensure data reusability and reproducibility and to facilitate novel biomedical discoveries through effective data reuse. Yet, incomplete metadata accompanying public omics data may hinder reproducibility and reusability by reducing sample interpretability and limiting secondary analyses. In this study, we performed a comprehensive assessment of metadata completeness shared in both scientific publications and/or public repositories by analyzing over 253 studies encompassing over 164 thousands samples, including both human and non-human mammalian studies. We observed that studies often omit over a quarter of important phenotypes, with an average of only 74.8% of them shared either in the text of publication or the corresponding repository. Notably, public repositories alone contained 62% of the metadata, surpassing the textual content of publications by 3.5%. Only 11.5% of studies completely shared all phenotypes, while 37.9% shared less than 40% of the phenotypes. Studies involving non-human samples were more likely to share metadata than studies involving human samples. We observed similar results on the extended dataset spanning 2.1 million samples across over 61,000 studies from the Gene Expression Omnibus repository. The limited availability of metadata reported in our study emphasizes the necessity for improved metadata sharing practices and standardized reporting. Finally, we discuss the numerous benefits of improving the availability and quality of metadata to the scientific community and beyond, supporting data-driven decision-making and policy development in the field of biomedical research. This work provides a scalable framework for evaluating metadata availability and may help guide future policy and infrastructure development.
Rogic, S.; Mancarci, B. O.; Xu, B.; Xiao, A.; Yang, C.; Pavlidis, P.
Show abstract
Accurate, consistent and comprehensive metadata are essential for the reuse of functional genomics data deposited in repositories such as the Gene Expression Omnibus (GEO), however, achieving this often requires careful manual curation that is time-consuming, costly and prone to errors. In this paper, we evaluate the performance of Large Language Models (LLMs), specifically OpenAIs GPT-4o, as an assistive tool for entity-to-ontology annotation of two commonly encountered descriptors in transcriptomic experiments - mouse strains and cell lines. Using over 9,000 manually curated experiments from the Gemma database and over 5,000 associated journal articles, we assess the models ability to identify relevant free-text entries and map them to appropriate ontology terms. Using zero-shot prompting and retrieval-augmented generation (RAG) to incorporate domain-specific ontology knowledge, GPT-4o correctly annotated 77% of mouse strain and 59% of cell line experiments, and uncovered manual curation errors in Gemma for over 200 experiments. GPT-4o substantially outperformed a regular expression-based string-matching method, which correctly annotated only 6% of mouse strain experiments due to low precision. Model errors often arose from typographical mistakes or inconsistent naming in the GEO record or publication, and resembled those made by human curators. Along with annotations, our approach requests that the model output supporting context and quotes from the sources. These were typically accurate and enabled rapid curator verification. These findings suggest that LLMs are not ready to fully replace manual curators, but can already effectively support them. A human-in-the-loop workflow, in which LLMs annotations are provided to human curators for validation, may improve the efficiency and quality of large-scale biomedical metadata curation.
Alghamdi, D.; Dooley, D.; Samman, M.; Hsiao, W.
Show abstract
BackgroundWith improvements in high throughput sequencing technologies and the constant generation of large biomedical datasets, biobanks increasingly take on the role of managing and delivering not just specimens but also data. However, re-using data from different biobanks is challenged by incompatible data representations. Contextual data describing biobank digital resources often contain unstructured textual information incompatible with computational processes such as automated data discovery and integration. Therefore, a consistent and comprehensive contextual data framework is needed to increase discovery, reusability, and integrability across data sources. MethodsBased on available genomics standards (e.g., Minimum information about a microarray experiment (MIAME)), the College of American Pathologists (CAP) laboratory accreditation requirements, and the Open Biological and Biomedical Ontologies Foundry principles, we developed the Next Generation Biobanking Ontology (NGBO). In addition, we created new terms and re-used concepts from the Ontology for Biomedical Investigations (OBI) and the Ontology for Biobanking (OBIB) to build NGBO. ResultsThe Next Generation Biobanking Ontology https://www.ebi.ac.uk/ols4/ontologies/ngbo is an open application ontology representing omics contextual data, licensed under the Apache License 2.0. The ontology focuses on capturing information about three main activities: wet bench analysis used to generate omics data, bioinformatics analysis used to process and interpret data, and data management. In this paper, we demonstrated the use of the NGBO to add semantic statements to real-life use cases and query data previously stored in unstructured textual format.
Chothani, S.; Ruiz-Orera, J.; Tierney, J. A. S.; Clauwaert, J.; Deutsch, E. W.; Alba, M. M.; Aspden, J. L.; Baranov, P. V.; Bazzini, A. A.; Bruford, E. A.; Brunet, M. A.; Cardon, T.; Carvunis, A.-R.; Casola, C.; Choudhary, J. S.; Dean, K.; Faridi, P.; Fierro-Monti, I.; Fournier, I.; Frankish, A.; Gerstein, M.; Hubner, N.; Jiang, Y.; Kellis, M.; Kok, L. W.; Martinez, T. F.; Menschaert, G.; Ni, P.; Orchard, S.; Roucou, X.; Rozowsky, J.; Salzet, M.; Siragusa, M.; Slavoff, S.; Swirski, M. I.; Valen, E.; Vizcaino, J. A.; Wacholder, A.; Wu, W.; Xie, Z.; Yang, Y. T.; Moritz, R. L.; Mudge, J.; van Hee
Show abstract
Non-canonical (i.e., unannotated) open reading frames (ncORFs) have until recently been omitted from reference genome annotations, despite evidence of their translation, limiting their incorporation into biomedical research. To address this, in 2022, we initiated the TransCODE consortium and built the first community-driven consensus catalog of human ncORFs, which was openly distributed to the research community via Ensembl-GENCODE. While this catalog represented a starting point for reference ncORF annotation, major technical and scientific issues remained. In particular, this initial catalogue had no standardized framework to judge the evidence of translation for individual ncORFs. Here, we present an expanded and refined catalog of the human reference annotation of ncORFs. By incorporating more datasets and by lifting constraints on ORF length and start-codon, we define a comprehensive set of 28,359 ncORFs that is nearly four times the size of the previous catalog. Furthermore, to aid users who wish to work with ncORFs with the strongest and most reproducible signals of translation, we utilized a data-driven framework (i.e. translation signature scores) to assess the accumulated evidence for any individual ncORF. Using this approach, we derive a subset of 7,888 ncORFs with translation evidence on par with canonical protein-coding genes, which we refer to as the Primary set. This set can serve as a reliable reference for downstream analyses and validation, with a particular emphasis on high quality. Overall, this update reflects continual community-driven efforts to make ncORFs accessible and actionable to the broader research public and further iterations of the catalog will continue to expand and refine this resource.
Sternberg, P. W.; Alliance of Genome Resources Consortium,
Show abstract
The Alliance of Genome Resources (Alliance) is an extensible coalition of knowledgebases focused on the genetics and genomics of intensively-studied model organisms. The Alliance is organized as individual knowledge centers with strong connections to their research communities and a centralized software infrastructure, discussed here. Model organisms currently represented in the Alliance are budding yeast, C. elegans, Drosophila, zebrafish, frog, laboratory mouse, laboratory rat, and the Gene Ontology Consortium. The project is in a rapid development phase to harmonize knowledge, store it, analyze it, and present it to the community through a web portal, direct downloads, and APIs. Here we focus on developments over the last two years. Specifically, we added and enhanced tools for browsing the genome (JBrowse), downloading sequences, mining complex data (AllianceMine), visualizing pathways, full-text searching of the literature (Textpresso), and sequence similarity searching (SequenceServer). We enhanced existing interactive data tables and added an interactive table of paralogs to complement our representation of orthology. To support individual model organism communities, we implemented species-specific "landing pages" and will add disease-specific portals soon; in addition, we support a common community forum implemented in Discourse. We describe our progress towards a central persistent database to support curation, the data modeling that underpins harmonization, and progress towards a state-of-the art literature curation system with integrated Artificial Intelligence and Machine Learning (AI/ML).
Pan, Y.; Manuel, W.; Abeysinghe, R.; Zheng, J.; Davydov, A.; Yang, Q.; Lin, A. Y.; Cui, L.; He, Y.
Show abstract
BackgroundWith many vaccines developed and used, it is critical to standardize vaccine information. The OHDSI OMOP Common Data Model (CDM), widely used to support EHR data integration and analysis, leverages CVX, RxNorm, and RxNorm Extension codes to standardize vaccine-related records. However, these terminologies lack robust semantic relations, making the vaccine classification ineffective in OMOP CDM. To address this issue, our OHDSI Vaccine Vocabulary Working Group proposes to use the Vaccine Ontology (VO) to map these standards and build up its own semantic relations. As a first study of the work, we performed the mapping and alignment of the Vaccine Administered (CVX) codes with the VO using a combination of semi-automatic and manual mapping methods. ResultsA total of 273 CVX terms were first collected and classified. A high-level VO design pattern and an exact one-to-one mapping strategy were developed to guide the CVX-to-VO term mapping. To facilitate the manual mapping and harmonization process, we also developed and evaluated three semi-automated mapping approaches utilizing lexical and semantic information of vaccine concepts to map CVX to VO. These approaches suggested candidate VO mappings for CVX terms and also indicated CVX terms that were unmappable to VO and required new term additions to VO. The application of the best approach to the 2022-10-05 release of VO achieved an accuracy of 85.55% for its suggestions. The suggestions made by the semi-automated approaches were taken into account to further enhance the mappings, which led to our eventual mapping of all CVX terms to the latest version of VO. We innovatively proposed the inclusion of the passive vaccine branch in VO, which includes 24 immunoglobulins and antitoxins from CVX as passive vaccines. A specific CVX-VO OWL file was developed and added to the VO GitHub. Use case queries were developed to demonstrate its support for computer-assisted queries of vaccine groups based on CVX-VO hierarchies. ConclusionAll CVX terms were mapped to the VO using our combined semi-automatic and manual mapping methods. The mapped results enhanced semantic vaccine classification, providing a basis for further OMOP vaccine classification and EHR data analysis.
Joshi, B. P.; Bakrola, V. D.; Shah, P.; Krishnamurthy, R.
Show abstract
The recent pandemic created due to Novel Coronavirus (nCOV-2019) from Wuhan, China demanding a large scale of a general health emergency. This demands novel research on the vaccine to fight against this pandemic situation, re-purposing of the existing drugs, phylogenetic analysis to identify the origin and determine the similarity with other known viruses, etc. The very preliminary task from the research community is to analyze the wide verities of existing related research articles, which is very much time-consuming in such situations where each minute counts for saving hundreds of human lives. The entire manual processing is even lower down the efficiency in mining the information. We have developed a complete automatic literature mining system that delivers efficient and fast mining from existing biomedical literature databases. With the help of modern-day deep learning algorithms, our system also delivers a summarization of important research articles that provides ease and fast comprehension of critical research articles. The system is currently scanning nearly 1,46,115,136 English words from 29,315 research articles in not greater than 1.5 seconds with multiple search keywords. Our research article presents the criticality of literature mining, especially in pandemic situations with the implementation and online deployment of the system.
Olayo-Alarcon, R.; Morales-Soto, L.; Velazquez-Ramirez, D. A.; Munguia-Reyes, A.; Balderas-Martinez, Y. I.; Mendez-Cruz, C. F.; Collado-Vides, J.
Show abstract
MotivationThe genetic mechanisms involved in human diseases are fundamental in biomedical research. Several databases with curated associations between genes and diseases have emerged in the last decades. Although, due to the demanding and time consuming nature of manual curation of literature, they still lack large amounts of information. Current automatic approaches extract associations by considering each abstract or sentence independently. This approach could potentially lead to contradictions between individual cases. Therefore, there is a current need for automatic strategies that can provide a literature consensus of gene-disease associations, and are not prone to making contradictory predictions. ResultsHere, we present GeDex, an effective and freely available automatic approach to extract consensus gene-disease associations from biomedical literature based on a predictive model trained with four simple features. As far as we know, it is the only system that reports a single consensus prediction from multiple sentences supporting the same association. We tested our approach on the curated fraction of DisGeNet (f-score 0.77) and validated it on a manually curated dataset, obtaining a competitive performance when compared to pre-existing methods (f-score 0.74). In addition, we effectively recovered associations from an article collection of chronic pulmonary diseases, and discovered that a large proportion is not reported in current databases. Our results demonstrate that GeDex, despite its simplicity, is a competitive tool that can successfully assist the curation of existing databases. AvailabilityGeDex is available at https://bitbucket.org/laigen/gedex/src/master/ and can be used as a docker image https://hub.docker.com/r/laigen/gedex Contactcmendezc@ccg.unam.mx Supplementary informationSupplementary material are available at bioRxiv online.
Wang, Y.; Ye, M.; Zhang, F.; Freeman, Z. T.; Yu, H.; Ye, X.; He, Y.
Show abstract
To fully understand COVID-19, it is critical to identify and analyze all the possible hosts of SARS-CoV-2 (the pathogen of COVID-19) and compare them with the hosts of other human coronaviruses. In this study, we collected, annotated, and performed taxonomical and ontological analysis of all the reported and verified hosts for all human coronaviruses including SARS-CoV, MERS-CoV, SARS-CoV-2, and four others that cause the common cold. A total of 37 natural hosts and 19 laboratory animal hosts of host human coronaviruses were identified based on experimental or clinical evidence. Our taxonomical ontology-based analysis found that all the verified susceptible natural and laboratory animals belong to therian mammals. Specifically, these 37 natural therian hosts include one wildlife marsupial mammal (i.e., Didelphis virginiana) and 36 Eutheria mammals (a.k.a. placental mammals). The 19 laboratory animal hosts are also classified as placental mammals. While several non-therian animals (including snake, housefly, zebrafish) were reported to be likely SARS-CoV-2 hosts, our analysis excluded them due to the lack of convincing evidence. Genetically modified mouse models with human Angiotensin-converting enzyme 2 (ACE2) or dipeptidyl peptidase-4 (DPP4) protein were more susceptible to virulent human coronaviruses with clear symptoms. Coronaviruses often became more virulent and adaptive in the mouse hosts after a series of viral passages in the mice. To support knowledge standardization and analysis, we have also represented the annotated host knowledge in the Coronavirus Infectious Disease Ontology (CIDO) and provided ways to automatically query the knowledge.
Cano, M. A.; Tsueng, G.; Zhou, X.; Hughes, L. D.; Mullen, J. L.; Xin, J.; Su, A. I.; Wu, C.
Show abstract
BackgroundBiomedical researchers are strongly encouraged to make their research outputs more Findable, Accessible, Interoperable, and Reusable (FAIR). While many biomedical research outputs are more readily accessible through open data efforts, finding relevant outputs remains a significant challenge. Schema.org is a metadata vocabulary standardization project that enables web content creators to make their content more FAIR. Leveraging schema.org could benefit biomedical research resource providers, but it can be challenging to apply schema.org standards to biomedical research outputs. We created an online browser-based tool that empowers researchers and repository developers to utilize schema.org or other biomedical schema projects. ResultsOur browser-based tool includes features which can help address many of the barriers towards schema.org-compliance such as: The ability to easily browse for relevant schema.org classes, the ability to extend and customize a class to be more suitable for biomedical research outputs, the ability to create data validation to ensure adherence of a research output to a customized class, and the ability to register a custom class to our schema registry enabling others to search and re-use it. We demonstrate the use of our tool with the creation of the Outbreak.info schema--a large multi-class schema for harmonizing various COVID-19 related resources. ConclusionsWe have created a browser-based tool to empower biomedical research resource providers to leverage schema.org classes to make their research outputs more FAIR.
Rahimian, K.; Mahmanzar, M.; Mahdavi, B.; Arefian, E.; Kuehu, D. L.; Deng, Y.
Show abstract
The coronavirus disease 19 (COVID-19) is a highly pathogenic viral infection of the novel severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2), resulting in the global pandemic of 2020.A lack of therapeutic and preventive approaches including drugs and vaccines, has quickly posed significant threats to world health. A comprehensive understanding of the evolution and natural selection of SARS-CoV-2 against the host interaction and symptoms at the phenotype level could impact the candidates strategies for the fight against this virus. SARS-CoV-2 Mutation (SARS2Mutant, http://sars2mutant.com/) is a database thatprovides comprehensive analysis results based on tens of thousands of high-coverage and high-quality SARS-CoV-2 complete protein sequences. The structure of this database is designed to allow the users to search for the three different strategies among amino acid substitution mutations based on gene name, geographical zone or comparative analysis. Based on each strategy, five data types are available to the user: mutated sample frequencies, heat map of the mutated amino acid positions, timeline trend for mutation survivals and natural selections, and charts of changed amino acids and their frequencies. Due to the increase of virus protein sequence samples published daily showing the latest trends of current results, all sequences in the database are reanalyzed and updated monthly. The SARS-2Mutant database providescurrent analysis and updated data of mutation patterns and conserved regions, helpful in developing and designing targeted vaccines, primers and drug discoveries.
Tian, S.; Zhang, J.
Show abstract
The BioCreative VII Track 5 calls for participants to tackle the multi-label classification task for automated topic annotation of COVID-19 literature. In our participation, we evaluated several deep learning models built on PubMedBERT, a pre-trained language model, with different strategies addressing the challenges of the task. Specifically, multi-instance learning was used to deal with the large variation in the lengths of the articles, and focal loss function was used to address the imbalance in the distribution of different topics. We found that the ensemble model performed the best among all the models we have tested. Test results of our submissions showed that our approach was able to achieve satisfactory performance with an F1 score of 0.9247, which is significantly better than the baseline model (F1 score: 0.8678) and the mean of all the submissions (F1 score: 0.8931).
Choi, D.; Gu, Y.; Zong, K.; Lain, A. D.; Zaikis, D.; Rowlands, T.; Rei, M.; CoDiet Consortium, ; Beck, T.; Posma, J. M.
Show abstract
Diet plays a critical role in human health, with growing evidence linking dietary habits to disease outcomes. However, extracting structured dietary knowledge from biomedical literature remains challenging due to the lack of dedicated relation extraction datasets. To address this gap, we introduce RECoDe, a novel relation extraction (RE) dataset designed specifically for diet, disease, and related biomedical entities. RECoDe captures a diverse set of relation types, including a broad spectrum of positive association patterns and explicit negative examples, with over 5,000 human-annotated instances validated by up to five independent annotators. Furthermore, we benchmark various natural language processing (NLP) RE models, including BERT-based architectures and enhanced prompting techniques with locally deployed large language models (LLMs) to improve classification performance on underrepresented relation types. The best performing model gpt-oss-20B, a local LLM, achieved an F1-score of 64% for multi-class classification and 92% for binary classification using a hierarchical prompting strategy with a separate reflection step built in. To demonstrate the practical utility of RECoDe, we introduce the Contextual Co-occurrence Summarisation (Co-CoS) framework, which aggregates sentence-level relation extractions into document-level summaries and further integrates evidence across multiple documents. Co-CoS produces effect estimates consistent with established dietary knowledge, demonstrating its validity as a general framework for systematic evidence synthesis. AvailabilityThe code, models, and data will be made freely available upon acceptance.
Freedman, H. G.; Williams, H.; Miller, M.; Birtwell, D.; Stoeckert, C. J.
Show abstract
Standardizing clinical information in a common data model is important for promoting interoperability and facilitating high quality research. Semantic Web technologies such as Resource Description Framework can be utilized to their full potential when a clinical data model accurately reflects the reality of the clinical situation it describes. To this end, the Open Biomedical Ontologies Foundry provides a set of ontologies that conform to the principles of realism and can be used to create a realism-based clinical data model. However, the challenge of programmatically defining such a model and loading data from disparate sources into the model has not been addressed by pre-existing software solutions. The PennTURBO Semantic Engine is a tool developed at the University of Pennsylvania that works in conjunction with data aggregation software to transform source-specific RDF data into a source-independent, realism-based data model. This system sources classes from an application ontology and specifically defines how instances of those classes may relate to each other. Additionally, the system defines and executes RDF data transformations by launching dynamically generated SPARQL update statements. The Semantic Engine was designed as a generalizable RDF data standardization tool, and is able to work with various data models and incoming data sources. Its human-readable configuration files can easily be shared between institutions, providing the basis for collaboration on a standard realism-based clinical data model.
Yu, W.; Gwinn, M.; Khoury, M. J.
Show abstract
SummaryWe developed a new online database that contains the most updated published scientific literature, online news and reports, CDC and National Institutes of Health (NIH) resources. The tool captures emerging discoveries and applications of genomics, molecular, and other precision medicine and precision public health tools in the investigation and control of coronavirus diseases, including COVID-19, MERS-CoV, and SARS. AvailabilityCoronavirus Disease Portal (CDP) can be freely accessed via https://phgkb.cdc.gov/PHGKB/coVInfoStartPage.action. Contactwyu@cdc.gov