Patterns
○ Elsevier BV
Preprints posted in the last 30 days, ranked by how well they match Patterns's content profile, based on 70 papers previously published here. The average preprint has a 0.07% match score for this journal, so anything above that is already an above-average fit.
Arabi, S.; Hutchins, B. I.
Show abstract
Early identification of promising drug research topics is challenging yet crucial for the scientific community to accelerate the development of novel therapeutics. In this work, we leverage large-scale public data from the biomedical literature to extract predictive features to identify promising therapeutic research topics at an early stage. We divide the global citation graph of biomedical literature into a time series of research topics and extract topic features based on citation activity, publication content, and measurable flocking of scientists into novel research topics. Based on these features, our machine learning model identifies research topics that in the future yield Food and Drug Administration (FDA)-approved drugs years before approval (F1-score of 0.84). 80% of target drugs are predicted in advance, with 65% predicted 8 or more years before approval. This predates the start of phase 2 clinical trials in the vast majority of positive predictions. These results show this approach can efficiently flag research topics generating approved drugs several years prior to approval using public data that would have been contemporaneous at the time of prediction. Thus, reliable forecasting can be accomplished with a high-level view of the publication and citation behavior of scientists, without depending on clinical trial data that may only be deposited with a significant lag. This demonstrates that it is possible to detect early signals of future FDA approved therapies even without any specialized information about these applied research efforts. TeaserLarge-scale data analysis can use the full set of scientific citations to predict which areas of research will yield new FDA approved drugs, years in advance.
Jovanovic, M.; Weidener, L. S.; Brkic, M.; Ulgac, E.; Meduri, A.
Show abstract
Drug-induced inhibition of the hERG potassium channel is the leading cause of cardiac safety-related drug attrition, but the Comprehensive in Vitro Proarrhythmia Assay (CiPA) framework requires activity data on multiple cardiac ion channels to assess proarrhythmic risk. We present CardioSafe, a three-branch multi-task neural network with cross-attention fusion that integrates chemical fingerprints, ChemBERTa embeddings, and predicted L1000 transcriptomic features to predict blocker status and potency for hERG, Nav1.5, and Cav1.2, with an exploratory IKs head. CardioSafe was trained on the largest publicly reported multi-channel cardiac ion channel dataset, combining ChEMBL 36 with the hERGCentral database (331127 hERG, 3160 Nav1.5, 1138 Cav1.2, and 115 IKs compounds), curated under a pharmacology-aware policy that retains censored measurements and inhibition-percentage votes. Under Tanimoto-similarity-controlled splits, CardioSafe outperforms the leading published comparators (CToxPred2 and CardioGenAI) on the data-rich hERG head; on the smaller Nav1.5 and Cav1.2 heads the standard evaluation is statistically inconclusive. A reverse-leak audit revealed that 22% of Nav1.5 and 21% of Cav1.2 test compounds were present in published comparators training data (92% as exact compound matches); after removing these contaminated compounds, CardioSafes lead on Nav1.5 and Cav1.2 also reaches statistical significance, demonstrating that prior cross-publication benchmarks for these channels were inflated by training-data overlap. Scientific contributionWe present the first multi-task neural network jointly predicting blocker activity for the three primary CiPA cardiac ion channels (hERG, Nav1.5, Cav1.2) within a single architecture. We introduce a reverse-leak audit methodology that reveals systematic test-set contamination in cross-publication cardiac safety benchmarks, establishing a stricter evaluation protocol. We provide the empirical test of predicted L1000 transcriptomic features as auxiliary input for cardiac ion channel prediction and document a well-characterized negative result. Graphical abstractCardioSafe encodes each query SMILES with three branches (chemical fingerprints + descriptors, pretrained ChemBERTa, and predicted L1000 transcriptomic signatures), fuses them via a cross-attention block with four learnable per-channel query tokens, and emits binary blocker calls plus pChEMBL regression for hERG, Nav1.5, Cav1.2, and (exploratory) IKs. O_FIG O_LINKSMALLFIG WIDTH=200 HEIGHT=59 SRC="FIGDIR/small/723181v1_ufig1.gif" ALT="Figure 1"> View larger version (13K): org.highwire.dtl.DTLVardef@1c0ba2aorg.highwire.dtl.DTLVardef@1fe3a0borg.highwire.dtl.DTLVardef@194de8aorg.highwire.dtl.DTLVardef@9e4f74_HPS_FORMAT_FIGEXP M_FIG C_FIG
Qin, Y.; Peng, Y.; Chen, Q.; Chen, J.; Ren, P.; Deng, H.; Wang, D.; Liu, X.; Ou, Z.; Deng, Z.; Shi, X.
Show abstract
Spatial transcriptomic studies of infectious diseases still rely on fragmented data analysis processes. Here, we developed STID, a standardized framework for spatial transcriptomic analysis of infectious diseases that leverages the Seurat ecosystem and incorporates Python-based modules. STID provides an extensible infection-specific data structure and supports a full suite of analyses, such as pathogen background correction, infection-associated spot and niche identification, single-sample niche characterization, and multi-sample comparative and temporal analyses. Moreover, STID is broadly applicable to spatial transcriptomic data from infectious diseases caused by bacteria, viruses, and parasites, and enables systematic characterization of the structural features, cellular composition, molecular functions, and host-pathogen interactions within pathogen-infected and/or host-responsive niches. Overall, STID provides an accessible, reproducible, and extensible framework for analyzing infection-associated spatial transcriptomic data and for dissecting host-pathogen interactions in their native spatial microenvironments. MotivationSpatial transcriptomics technologies have emerged as powerful approaches for dissecting the structural and functional features of spatial microenvironments. However, the current general-purpose tools remain fundamentally inadequate for resolving the spatial heterogeneity of infectious disease samples, where the intricacies of host-pathogen interactions render spatial microenvironments both challenging to dissect and largely inaccessible. Tools tailored to infectious diseases are critically lacking, including those for reducing pathogen-derived background noise, identifying and isolating infection{square}associated spots or niches, dissecting host-pathogen interactions, and supporting systematic multi-sample analyses. We therefore developed STID, a unified framework that integrates standardized workflows and addresses the analytical bottlenecks in spatial transcriptomic analysis of infectious diseases. HighlightsO_LISTID standardizes spatial transcriptomic analysis in infectious diseases C_LIO_LISTID improves pathogen-infected spot detection by correcting pathogen background C_LIO_LISTID distinguishes pathogen-infected and host-responsive niches C_LIO_LISTID supports multi-sample comparative and temporal analyses of niches C_LI Graphical abstract O_FIG O_LINKSMALLFIG WIDTH=200 HEIGHT=194 SRC="FIGDIR/small/727492v1_ufig1.gif" ALT="Figure 1"> View larger version (75K): org.highwire.dtl.DTLVardef@167d351org.highwire.dtl.DTLVardef@1628848org.highwire.dtl.DTLVardef@1e157aforg.highwire.dtl.DTLVardef@143ca1b_HPS_FORMAT_FIGEXP M_FIG C_FIG
Yin, Q.; Chen, L.
Show abstract
Programmed cell death (PCD) encompasses multiple regulated processes whose dysregulation shapes cancer fitness, yet current computational studies largely use known PCD genes for prognosis rather than discovering regulators. We developed xNNPCD, an interpretable neural-network framework that links CRISPR-Cas9 perturbation signatures from CMap to gene dependency profiles from DepMap. The model constrains hidden neurons to five PCD pathways and iteratively refines a prior gene-pathway mask matrix derived from GO, KEGG, and Reactome using pathway-neuron ablation. This converts binary gene-pathway relationships into continuous-valued associations and improves dependency prediction over random forests, standard fully connected multi-layer perceptron, and its own non-iterative variant. The learned matrix recovers annotated death regulators and nominates candidate regulators, including RPL23A, HSPA5, SNRPA1, SLC6A2, and ASAH1; combined with dependency scores, it further separates pathway coupling from regulatory direction. Transferring the refined relationship matrix and learned weights to compound-induced perturbation data enables in silico drug screening, identifying BRD-K19103580 and decitabine as targeted therapeutic agents for apoptosis and ferroptosis, respectively. The pathway-resolved drug profiles can facilitate the rational design of combination therapies targeting complementary PCD pathways to overcome single-pathway resistance. Overall, xNNPCD offers a generalizable, interpretable approach for mapping the regulatory landscape and elucidating the molecular processes of PCD in cancer. GRAPHICAL ABSTRACT O_FIG O_LINKSMALLFIG WIDTH=200 HEIGHT=75 SRC="FIGDIR/small/724150v1_ufig1.gif" ALT="Figure 1"> View larger version (21K): org.highwire.dtl.DTLVardef@e74c0forg.highwire.dtl.DTLVardef@1326f4corg.highwire.dtl.DTLVardef@291e96org.highwire.dtl.DTLVardef@1970f10_HPS_FORMAT_FIGEXP M_FIG C_FIG
Ren, H.-C.; Gu, Y.-X.
Show abstract
Pharmacokinetic analysis has spent half a century compressing drug concentration-time curves into scalar summaries--AUC, Cmax, clearance--discarding the shape information that encodes mechanistic fingerprints of the underlying physiology. We introduce Topological Pharmacokinetics (TPK), a framework that reads the shape of pharmacokinetic trajectories directly from data without prior commitment to a compartmental model. TPK uses delay embedding to reconstruct the pharmacokinetic attractor from the concentration-time curve, and persistent homology to extract its topological invariants--connected components and loops--as a Pharmacokinetic Topological Invariant (PTI) vector. We validate TPK across three levels: linear systems (negative control), nonlinear saturable elimination (detection of the N_PTP +1 rule and a nonlinear diagnostic triad), and endogenous circadian rhythms (contrastive detection of rhythmic interference via Dev specificity and Decouple Collapse). The PTI vector provides a model-agnostic shape fingerprint that, in simulation, demonstrates the diagnostic potential of shape-based analysis; validation on experimental data is required to assess whether this potential generalizes to real pharmacokinetic data. All findings are demonstrated as proof of concept on simulated data; validation on experimentally measured concentration-time curves is the essential next step.
Zeng, T.; Li, H.; Zhang, S.; Tan, Y. Q.; Tian, F.; Orban, C.; An, L.; Che, W.; Cheng, J.; Chong, J. S. X.; Dehestani, N.; Dong, Z.; Li, X.; Li, Z.; Lim, M. J. R.; Lin, Y.; Ling, Q.; Ling, Z.; Low, X. Z.; Mansour L., S.; Ng, K. K.; Nguyen, T. T.; Ooi, L. Q. R.; Pande, S.; Qian, X.; Ruan, J.; Wang, Z.; Xie, Y.; Zhang, C.; Zhang, Y.; Patil, K.; Parkes, L.; Dhamala, E.; Chopra, S.; Zalesky, A.; Holmes, A.; Eickhoff, S.; Zhou, J. H.; Renaud, O.; Dosenbach, N.; Kording, K. P.; Bzdok, D.; Nichols, T.; Yeo, B. T. T.
Show abstract
Machine learning is accelerating biomedical research. Cross-validation is widely used to compare predictive performance - not only to benchmark algorithms, but also to inform scientific applications, such as ranking biomarkers. However, prediction performance estimates across cross-validation folds are not independent. Standard tests for comparing prediction performance (e.g., paired t-test) assume independence and can therefore inflate false positive rates. In a PRISMA-guided meta-analysis of 210 studies (impact factor [≥]15, 1 June 2020 - 1 June 2025), we find that 97% ignored fold dependence when comparing prediction performance. This problem is ubiquitous across scientific fields and unaffected by impact factor, rigor-promoting policies, or open science practices. Simulations across 420 scenarios spanning four diverse datasets show that ignoring fold dependence leads to invalid false positive control in most settings. Repeated cross-validation further compounds this problem, with false positive rates rising toward 100% as the number of repetitions grows. Existing fold-dependence-aware tests rely on strong assumptions because the variance of fold-level statistics and the between-fold correlation cannot be disentangled under standard cross-validation. We therefore propose the SHARP (Split-HAlf RePeated) test, a simple modification to standard cross-validation that enables direct estimation of variance and correlation. Benchmarked against 12 tests, SHARP provides the best overall balance of false-positive control, statistical power, and confidence-interval calibration across simulation schemes. We conclude by providing best practices and reporting guidelines for valid model comparison inference in biomedical machine learning and beyond.
Vindas Yassine, Y. E.; Bornet, A.; Abbas, M.; Geissbuehler, D.; Rodrigues-Jr, J. F.; Teodoro, D.
Show abstract
Transmissible hospital-acquired infections (HAIs) arise from complex, time-varying interactions among patients, healthcare workers, and clinical environments. Although data-driven approaches like graph neural networks (GNNs) effectively model these contacts, they often function as black boxes that over-look established epidemiological principles, limiting interpretability and clinical trust. Inspired by physics-informed neural networks, we propose a epidemiology-informed GNN (EIGNN) framework for patient-level state transitions prediction in dynamic hospital settings, integrating mechanistic epidemiological models into GNNs in a principled manner. Patient-level risk factors learned from dynamic contact networks are jointly leveraged to infer latent epidemiological states, predict state transitions across multiple horizons, and estimate key epidemiological parameters, including transmission and recovery rates. We evaluate the approach on a real-world hospital-onset COVID-19 cohort and two public datasets simulating viral and bacterial HAIs. Across multiple architectures and horizons, EIGNNs achieves AUC-ROC up to 98.46% while providing interpretable, mechanistically consistent insights, offering a transparent tool for infection prevention and control.
Ayyalasomayajula, V. S. R. K.; Senders, M. L.; Wolterink, J. M.; Yeung, K. K.
Show abstract
Peripheral artery disease (PAD) is a complex vascular disorder characterized by heterogeneous molecular mechanisms and incomplete functional annotation, limiting systematic biomarker discovery. Network-based learning approaches provide a powerful framework for disease gene prioritization; however, most existing methods produce overconfident predictions without explicitly accounting for model uncertainty or structural novelty. Here, we present an uncertainty-aware framework for PAD biomarker discovery that integrates unsupervised graph representation learning, positive-unlabeled (PU) classification, ensemble prediction, and mechanistic explainability. Node embeddings were learned using multiple unsupervised graph neural network (GNN) objectives and combined with heterogeneous classifiers to generate ensemble-averaged probability estimates and epistemic uncertainty. By jointly modeling predictive confidence and embedding-space novelty, we stratified candidates into high-confidence rediscoveries and structurally novel hypotheses under explicit uncertainty control. Across eight embedding objectives and five classifiers, ensemble aggregation produced stable, well-calibrated predictions and enabled prioritization of 100 candidate PAD-associated proteins. Probability-heavy candidates clustered tightly with known PAD proteins and were enriched for established vascular and hemostatic pathways, including extracellular matrix organization, integrin signaling, coagulation, and fibrinolysis. In contrast, novelty-heavy candidates occupied distinct embedding-space regions and partitioned into multiple coherent clusters enriched for upstream regulatory and signaling processes, including G protein-coupled receptor, ephrin receptor, kinase-driven, and NF-{kappa}B-associated pathways. Five-fold cross-validated comparison with established PU learning baselines demonstrated consistent improvement across all evaluation metrics (AUC 0.916 {+/-} 0.019 vs. 0.821 {+/-} 0.030 for the best baseline), and external validity was confirmed by significant enrichment of top candidates for related cardiovascular disease annotations (5.7x above background). Together, these results demonstrate that integrating uncertainty, novelty, and explainability enables calibrated and biologically grounded biomarker prioritization, with broad applicability to PAD and other complex diseases. Author summaryPeripheral artery disease affects millions of people worldwide but remains underdiagnosed, partly because we lack reliable molecular markers to detect it early. In this study, we developed a computational framework that uses protein interaction network data to predict which proteins may be involved in PAD, even when we only know a small number of confirmed disease-associated proteins. Our approach combines graph neural network embeddings with a machine learning technique called positive-unlabeled learning, which is specifically designed for situations where you have confirmed positives but no confirmed negatives. We also quantify how confident the model is in each prediction and identify candidates that are genuinely novel compared to what is already known. Tested against established methods, our framework consistently found more known disease proteins in cross-validated evaluation. The candidates we identified map to biologically coherent pathways relevant to vascular disease, and our top predictions are enriched for proteins associated with related cardiovascular conditions, providing external validation. This work provides a principled and transparent approach to biomarker discovery that could be applied to other complex diseases with limited molecular annotations.
Loecker, J.; Pujara, N.; Bryant, W.; Puniya, B. L.; Packrisamy, P.; Hamed, A.; Helikar, T.
Show abstract
Constraint-based metabolic modeling is a powerful way to study the mechanistic basis of cellular states and disease, but effective use demands substantial computational expertise and careful coordination of multi-step analyses. We developed MechAInistic to lower this barrier enabling researchers to ask complex biological questions in natural language. MechAInistic is a multi-agent system harnessing large language models organized around an Architect-Reviewer pattern that that converts a natural-language question into an executable, model-grounded workflow and produces a structured report. It supports pathway comparison, perturbation analysis, drug-target exploration, and literature interpretation across healthy and disease paired states. We evaluated MechAInistics therapeutic hypothesis generation using two immune-cell use-cases. For rheumatoid arthritis/healthy Naive B models, it identified mitochondrial metabolic rewiring and nominated Devimistat/CPI-613 as an investigational OGDH-centered hypothesis. In CD4+ Th17 multiple sclerosis/healthy models, the workflow identified NADP-dependent isocitrate dehydrogenase as the optimal target and proposed Ivosidenib as an FDA-approved repurposing candidate. GRAPHICAL ABSTRACT O_FIG O_LINKSMALLFIG WIDTH=200 HEIGHT=83 SRC="FIGDIR/small/723319v1_ufig1.gif" ALT="Figure 1"> View larger version (19K): org.highwire.dtl.DTLVardef@1b5c1d1org.highwire.dtl.DTLVardef@1c798cforg.highwire.dtl.DTLVardef@10161d3org.highwire.dtl.DTLVardef@1bd7dce_HPS_FORMAT_FIGEXP M_FIG C_FIG
White, D.; Uzun, A.
Show abstract
Cancer incidence varies substantially across geographic regions and demographic groups, yet translating large-scale surveillance datasets into accessible, interpretable visualizations remains a challenge for researchers and public health professionals without computational expertise. We developed OncoContour, an interactive web-based platform that enables geographic visualization and demographic analysis of cancer incidence data through a browser-accessible interface. To demonstrate its capabilities, we analyzed publicly available cancer incidence data from the United States Cancer Statistics database via CDC WONDER, covering five major cancer types across four northeastern U.S. metropolitan statistical areas from 2017 through 2022, supplemented by demographic data from the U.S. Census Bureau American Community Survey. OncoContour integrates population distribution heatmaps, per-capita cancer incidence heatmaps, interactive multi-city temporal trend charts, structured cancer data tables, and demographic visualizations covering race, ethnicity, age, and sex distributions into a single dynamically generated HTML report. The platform is implemented in Python using Flask, Folium, Plotly, and Matplotlib, and is containerized using Docker for reproducible local deployment. Across all four metropolitan areas, breast and prostate cancers accounted for the highest incidence counts over the study period, while a decline in reported cases observed in 2020 is consistent with documented disruptions to cancer screening during the COVID-19 pandemic. By integrating geospatial mapping, temporal analysis, and demographic visualization within a unified, no-code interface, OncoContour aims to support cancer surveillance, epidemiological investigation, and targeted public health planning. OncoContour is freely available at https://github.com/alperuzun/oncocontour_docker.
Sangüesa Recalde, M.; De Andrea, C. E.; Ariz, M.
Show abstract
Multiplexed imaging technologies enable the simultaneous measurement of dozens of protein markers while preserving context, providing a high-resolution view of tissue organization schemes. However, extracting meaningful insights from these high-dimensional datasets--particularly in hyperplex settings (>20 markers)--remains a major computational challenge, especially in the absence of annotated data. Here, we present UMITIC (Unsupervised Analysis of Multiplex Images via TIssue Characterization), a modular and unsupervised computational framework for the joint characterization of cell phenotypes and tissue neighborhoods from multiplex imaging data. UMITIC integrates three components: (i) CellCut, a strategy that combines nuclear and cytoplasmic predictions to improve the delineation capabilities of the framework; (ii) CellMap, a contrastive learning approach that generates low-dimensional representations of single-cell image crops that are enriched with morphological features; and (iii) TissueNet, a graph neural network that models spatial cell-cell interactions to identify tissue neighborhoods. We evaluated UMITIC across four datasets of increasing complexity to assess its robustness, scalability and biological relevance. With respect to a 7-plex human tonsil dataset, the framework identified canonical immune cell populations and reconstructed well-established anatomical regions. When applied to a 43-plex tonsil image, UMITIC preserved these tissue-level structures while enabling a finer cell subtype stratification process driven by increased marker dimensionality. We further validated our method on a 58-plex colorectal cancer cohort, where UMITIC was able to recover previously reported immune composition differences and spatial organization variations between patient groups with different prognoses. Finally, when an expert-annotated mass cytometry imaging dataset concerning human lung tissue was used, UMITIC achieved higher agreement with the reference tissue annotations than the existing approaches did, demonstrating improved lung microanatomy reconstruction accuracy. Together, these results show that UMITIC enables consistent and interpretable analyses of both cellular phenotypes and tissue architectures across diverse multiplex and hyperplex imaging datasets without the need for manual annotations. Author summaryUnderstanding how cells are organized within tissues is fundamental to deciphering diseases, yet analyzing tissue imaging data remains a major challenge. The recently developed imaging technologies enable the visualization of dozens of proteins in a single tissue section, revealing unprecedented cell identity and spatial organization details. However, extracting meaningful biological insights requires extensive manual annotation work performed by expert pathologists, limiting the scalability. Here, we present a fully automated computational framework that characterizes tissue architectures in an unsupervised manner at two complementary levels: it identifies cell types based on their protein expressions and morphologies and maps how those cells are organized into spatially coherent tissue structures, and it does so without requiring any manual annotations. Our approach is modular and interpretable at the cell level. We validated our framework across four independent datasets with panels consisting of 7 to 58 simultaneous protein markers, including healthy human tissue and a colorectal cancer cohort in which patients with distinct immune profiles were analyzed. Remarkably, UMITIC improved upon the performance of existing methods across both qualitative and quantitative assessments. These results suggest that our framework provides objective, interpretable and reproducible image processing tools for conducting tissue analyses in both research and clinical settings.
Jung, K. J.; Qiu, J.; Cho, S.; McDonough, E.; Chadwick, C.; Ghose, S.; West, R. B.; Brooks, J. D.; Ginty, F.; Machiraju, R.; Mallick, P.
Show abstract
Accurate prognostic assessment of prostate cancer (PCa) requires an integrated understanding of tissue morphology-encompassing cell structure, glandular architecture, and tissue organization-and the immune environment. We present Prostate-TriMod, a novel tri-modal histology dataset designed to integrate high-resolution visual morphology with spatial tissue maps, immune infiltration patterns, and clinical outcomes. This dataset, generated from the Cell DIVE multiplexed imaging platform, consists of three synchronized modalities: (1) multiscale virtual H&E tiles (224px, 256px, 512px, and 2040px) providing visual morphological context, (2) spatial tissue maps identifying cancerous/non-cancerous epithelial cells, stroma and immune cell populations (via TOPAZ and CAT models), and (3) text captions generated from single-cell data and patterns. The dataset includes comprehensive clinical annotations, including Grade Groups and biochemical recurrence (BCR) status. By providing high-fidelity alignment between visual features, spatial tissue maps, and textual descriptions, Prostate-TriMod empowers the development of advanced multimodal AI frameworks. We expect this resource to support reuse in multimodal representation learning, spatial analysis, and benchmarking studies that link histology morphology and immune context to clinical outcomes in prostate cancer.
Devlin, L. M.; Nguyen, P. H.; Cuthbert, R.; Doan, P. N.; Tran, V. H.; Zhang, Z.; Murchie, A. K.; Bamford, C. G. G.; Dick, J. T. A.; Morgan, E. R.; Mai, T. S.
Show abstract
Reliable early warning of infectious disease outbreaks remains a major challenge for surveillance systems, particularly for vector-borne pathogens whose transmission depends on interactions among hosts, vectors, and climate-sensitive environmental conditions. Data-driven forecasting offers a promising approach for predicting outbreak risk using surveillance and environmental data. This study develops a logit-weighted ensemble (LWE), a machine-learning framework that predicts outbreak occurrence 1-6 months ahead at the administrative unit-month scale using routinely available outbreak notifications and gridded climate data. Bluetongue virus (BTV), an arbovirus of ruminants transmitted by Culicoides biting midges, provides a well-characterised system in which transmission is strongly shaped by climate, making it a useful system for applying and testing this approach. The framework is evaluated using surveillance data collected between 2005 and 2024 from France, Greece, and Italy, selected for their long-running and high-quality outbreak surveillance records. Across all three countries, the LWE achieved the strongest and most stable predictive performance under a recall-focused evaluation that prioritises correctly identifying outbreak months. It outperformed or matched 14 benchmark models, with differences becoming more pronounced at longer lead times (month +3 onward), when predictions are more uncertain and outbreaks are relatively rare. Predictability varied across countries, with the highest performance in Greece, strong performance in France, and lower, more variable performance in Italy, reflecting differences in how consistently outbreaks occur and spread across regions. Overall, the results demonstrate that horizon-aware, climate-informed forecasting can reliably identify months and locations at elevated risk of outbreak occurrence up to six months in advance, supporting surveillance planning and preparedness across heterogeneous European settings. The ensemble framework provides a robust and portable strategy for outbreak prediction using routinely collected surveillance and environmental data. Author SummaryPredicting infectious disease outbreaks before they occur remains a major challenge, particularly for diseases influenced by environmental conditions. In this study, we focus on bluetongue, a viral disease of livestock transmitted by biting midges, where transmission is strongly affected by climate and seasonal patterns. We develop a method that uses routinely collected outbreak reports and climate data to estimate where and when outbreaks are more likely to occur, up to six months in advance. We apply this approach across three European countries with a history of bluetongue outbreaks. We find that combining climate information with recent outbreak patterns can provide useful early signals of increased risk. Predictions are most accurate at shorter timeframes, but longer-range forecasts can still support planning and preparedness. Because our approach uses widely available data, it could be applied in other regions or to similar environmentally driven diseases. However, it does not include factors such as vaccination, animal movement, or detailed information on vector populations, which may also influence how outbreaks develop. Graphical Abstract O_FIG O_LINKSMALLFIG WIDTH=200 HEIGHT=98 SRC="FIGDIR/small/726753v1_ufig1.gif" ALT="Figure 1"> View larger version (30K): org.highwire.dtl.DTLVardef@45e41borg.highwire.dtl.DTLVardef@82c787org.highwire.dtl.DTLVardef@1f97888org.highwire.dtl.DTLVardef@1586747_HPS_FORMAT_FIGEXP M_FIG C_FIG
Chen, Y.; Giuliano, V.; Dacillo, I.; Lin, W.; Yan, Y.; Luo, P.
Show abstract
Accurate prioritization of T-cell receptor (TCR)-epitope interactions and identification of tumor-reactive T cells are important but difficult steps in immunotherapy-oriented bioinformatics workflows. Existing methods typically address these tasks separately and either model TCR-epitope pairs as independent observations or rely primarily on transcriptomic signatures. In this study, we present TRACE (TCR-epitope pRioritization And T-Cell idEntification), a graph-based computational workflow that unifies both applications within a single heterogeneous graph framework. The protocol represents TCRs, epitopes, and T cells as typed nodes connected by similarity and association edges, and combines pretrained sequence embeddings with edge-aware graph attention, Laplacian positional encoding, and bidirectional cross-domain attention. Applied to the IEDB and VDJdb benchmarks, TRACE achieved AUROC/AUPR values of 0.937/0.922 and 0.992/0.990, respectively, outperforming five state-of-the-art algorithms. In addition, on a single-cell RNA-seq dataset, the workflow achieved an AUROC of 0.984 and an AUPR of 0.984, substantially exceeding transcriptomic signature-based baselines for tumor-reactive T-cell identification. Ablation analysis showed that Laplacian positional encoding provided the largest performance gain, particularly in sparse graph settings. These results suggest that heterogeneous graph modeling can serve as a practical protocol for integrating receptor sequence, antigen context, and cellular phenotype in computational immunology.
Wang, W.-T.; Zhou, M.; Tong, J.; Lin, M.-J.; Ke, A.; Wei, M.; Xu, Z.; Tai, H.; Parvathaneni, A.; Hill, K. T.; Cohen, S. R.; Petukhova, L.; Chiu, E. S.; Wang, F.; Lu, C. P.; Su, C.
Show abstract
Complex human diseases exhibit substantial clinical heterogeneity driven by poorly understood molecular mechanisms, while many also lack sufficient molecular and omics data for mechanistic investigation, hindering therapeutic development. We introduce PiMInfer, a phenotype-to-mechanism framework that leveraged largely available real-world clinical data-based deep phenotypic characterizations with a biomedical knowledge graph approach to resolve disease clinical heterogeneity into phenotype-informed molecular modules, thereby accelerating therapeutic target discovery. We applied PiMInfer to investigate Hidradenitis Suppurativa (HS), an autoimmune skin disease with poorly understood pathogenesis and limited treatment options. PiMInfer identified a coherent, phenotype-informed HS gene module (PiHSM) and functional endotypes, which were validated using multimodal evidence. In silico drug repurposing using PiHSM prioritized Carfilzomib, targeting the immunoproteasome subunit PSMB9, essential for MHC Class I antigen presentation. Preclinical testing using human patient lesional skin explants confirmed its anti-inflammatory activity and demonstrated a significant downregulation of IFN-{gamma}, IL-17, and mTOR signaling pathways within HS lesional microenvironment through single-cell RNA sequencing. PiHSM-based network predictions further suggest a potential enhanced efficacy of combining Carfilzomib with approved HS agents. Collectively, PiMInfer provides a scalable framework that bridges real-world phenome-wide comorbid associations to mechanism-anchored therapeutic discovery, enabling a paradigm shift in precision medicine approaches for complex diseases with limited molecular characterization and in need of better therapeutic strategies.
Vomo-Donfack, K. L.; Bousquet, G.; Falgarone, G.; Ginot, G.; Morilla, I.
Show abstract
Whole-genome sequencing comprehensively captures coding, non-coding and structural variation in families with suspected inherited disorders, yet its clinical utility remains constrained by an interpretation bottleneck: selecting a handful of relevant variants from millions of candidates. Current rule-based pipelines, anchored in ACMG/AMP criteria, excel at identifying highly penetrant Mendelian alleles but frequently miss variants of low-to-moderate penetrance, non-coding alterations and germline-somatic interactions. Here we introduce PolyCLIP-T, a topology-guided multimodal framework that transforms variant selection from a classification problem into a geometric discovery task. By contrastively aligning DNA-sequence embeddings with functional annotations, PolyCLIP-T constructs a unified latent space in which the displacement between reference and alternate embeddings quantifies the molecular perturbation induced by each variant. Persistent homology then identifies stable topological components - coherent variant groups shared among affected relatives - that transcend single-variant scoring logic. Applied to six families with multi-morbid cancer, autoimmune and cardiovascular disease, PolyCLIP-T recovered non-coding and structural candidates overlooked by conventional pipelines and revealed pleiotropic networks spanning disease categories. This approach provides an interpretable, scalable solution for genome-first investigations of disorders driven by polygenic architectures that evade single-variant analysis. The framework was developed and benchmarked on deeply characterised familial cohorts selected for transgenerational multimorbidity; validation in larger, independent populations will be essential to establish its generalisability. An interactive web tool is freely available at https://www.polyclip-t.uma.es/.
Nekrutenko, A.
Show abstract
Agentic tools -- software environments where a large language model plans, calls external tools, executes code, and iterates with minimal human intervention -- will run a substantial share of routine biomedical data analysis within the next few years. However, per-call inference cost on frontier models is the bottleneck and can add up quickly. Here, we tested whether a free, locally-runnable open-weight model could take over the repetitive execution steps at frontier accuracy. We used Claudes Opus to author plans of increasing detail for per-sample variant calling, and ran six 2026-release open-weight implementer LLMs against those plans on a set of desktop GPUs. qwen3.6:27b reproduced frontier accuracy on every plan and matched Opus cell-for-cell on a 36-cell error-injection matrix. A sub-$2,000 Jetson or Apple Mac Mini sufficed for the implementer side. The open-weight model landscape evolves on the order of months, so the specific implementer recommended here will be superseded; we provide the plans, harness, scoring code, and per-cell artifacts at https://github.com/nekrut/LLM-eval-paper as a framework for re-evaluating future models.
Cao, X.; Shi, D.; Du, Z.; Zhou, J.; Wang, Z.; Liu, Z.; Wang, Q.
Show abstract
Carbapenem-resistant Gram-negative bacteria (CRGNB) infections remain difficult to manage because treatment decisions must balance heterogeneous patient risk, limited antibiotic options, potential toxicity and emerging resistance. Clinical care in this setting requires not only single-endpoint risk prediction, but also decision-support frameworks that can jointly enable prognosis assessment, result interpretation, and individualized treatment comparison. Here we present Dr.BUG, an interactive clinical AI agent for personalized decision support in CRGNB infection. Dr.BUG integrates stable feature-set selection, multi-task prognostic modelling, interpretability analysis and model-based simulation of antibiotic regimen recommendation into a unified workflow. Using a development cohort, a temporally independent validation cohort, and external cohorts from the MIMIC-IV dataset, we developed and validated models for four clinically relevant tasks: clinical efficacy, survival outcome, polymyxin resistance and treatment duration. Model inputs were derived primarily from routinely available and relatively low-cost clinical variables, supporting translational feasibility. Across the major tasks, selected-feature models matched or exceeded the performance of their full-feature counterparts while using fewer variables, as reflected in 82.0% of optimized-metric comparisons in the development cohort, and remained robust in both temporal and external validation. Dr.BUG further provided both population-level and patient-level interpretability and generated individualized rankings of candidate antibiotic regimens. In the retrospective analysis of non-survivors, clinician review suggested that regimens recommended by Dr.BUG might be associated with higher predicted survival probabilities. These findings support a broader role for clinical AI in complex drug-resistant infections, extending its utility from offline risk prediction to interpretable, deployable, and personalized decision support.
Root, B.; Longest, A.; Grace, T.; Tran, M.; Northrop, B.; Donohue, A.; Said, A.; Guertin, S.
Show abstract
Novel infectious diseases, predominately originating from non-human animals, pose a significant threat to global public health and economic stability. Avian influenza virus presents an especially significant challenge due to its high mortality rates and spillover capability into new host species. Recent H5N1 spillover events into poultry and cattle resulted in massive economic burden and increased human health risk. Traditional methods of disease surveillance rely on reactive case detection and pathogen characterization, providing insufficient lead time for effective intervention. Computational tools that allow efficient and proactive prediction of zoonotic potential are critical in mitigation of influenza outbreaks and identification of strains with human spillover risk. Existing models predicting influenza virus subtypes or host have been developed; however, the complexity of spillover events, including the non-binary nature of zoonotic potential, limits the capabilities of these models. In the approach reported here, rich protein language model embeddings were generated from ESM-2 for each protein in influenza virus strains and used to predict the protein host tropism probabilities across nine animal families. The protein host tropism model achieved weighted precision and recall scores of 0.95 and 0.95, respectively. We then constructed a zoonotic risk prediction model using the outputs from the protein host tropism prediction model to classify the strains into six classifications: avian, mammal, human, avian-to-human zoonotic, avian-to-mammal zoonotic, or mammal-to-human zoonotic. The average weighted precision and recall scores for this model were 0.90 and 0.90, respectively. This framework advances the prediction of influenza zoonotic risk by being agnostic to influenza subtype, incorporating non-human mammals and mammal zoonotic spillover classifications, and using the full influenza proteome to capture the complexity of spillover dynamics.
Garcia-Valiente, R.; Triantafyllou, C.; van Schaik, B.; Jongejan, A.; Pollastro, S.; Anang, D. C.; Guikema, J. E.; de Vries, N.; Hoefsloot, H. C.; van Kampen, A. H. C.
Show abstract
High-throughput sequencing of B-cell and T-cell immune receptor repertoires provides unprecedented insight into adaptive immune responses. The data produced are structured by clonal relationships and somatic mutation signatures, and yield extremely rich information in sequence-derived features, including physicochemical properties and compositional patterns. However, integrated analysis across datasets, conditions, and time points remains challenging. Current analytical tools typically focus only on certain features within individual repertoires, without enabling integrated, multivariable comparisons across datasets, conditions, and time points to address their diversity and variability. Here we present AbSolution, a user-friendly and flexible interactive application for comprehensive exploration of immune repertoires and their sequence-based properties. AbSolution enables multiscale analysis of thousands of sequence-derived features across receptor regions, while accounting for V(D)J usage, clonal composition and experimental groupings. We demonstrate its utility by identifying distinct sequence-based profiles associated with dominant (highly abundant) and non-dominant B-cell clones in peripheral blood BCR repertoires from patients with idiopathic inflammatory myopathies, and with antigen-responsive T-cell populations over time in a longitudinal in vitro antigen-stimulation dataset. Through interactive, interlinked visualizations, statistical feature selection and multi-sample comparisons, AbSolution facilitates integrated feature profiling that supports the interpretation of immune selection processes and enables systematic analysis of complex repertoire datasets.