Patterns
○ Elsevier BV
Preprints posted in the last 30 days, ranked by how well they match Patterns's content profile, based on 70 papers previously published here. The average preprint has a 0.07% match score for this journal, so anything above that is already an above-average fit.
Tan, S.; Tian, Z.
Show abstract
The rapid advancement of AI research automation systems--including AI Scientist, data-to-paper, and Agent Laboratory--has demonstrated the potential for autonomous scientific discovery. However, existing benchmarks for evaluating these systems focus predominantly on fundamental sciences (machine learning, physics, chemistry), overlooking the unique challenges of medical clinical research: complex survey designs, inferential statistics with confounding control, adherence to reporting standards (STROBE, CONSORT), and the requirement for clinically actionable interpretation. We present MedResearchBench, the first benchmark specifically designed to evaluate AI systems on medical clinical research tasks. MedResearchBench comprises 16 tasks spanning 7 clinical domains (cardiovascular, oncology, mental health, metabolic, respiratory, neurology, infectious disease), built on publicly available datasets (the National Health and Nutrition Examination Survey [NHANES] and the Surveillance, Epidemiology, and End Results [SEER] program) with ground truth from 16 high-quality published papers (IF range: 2.3-51.0). Each task is evaluated along 6 medical-specific dimensions: statistical methodology, results accuracy, visualization quality, clinical interpretation, confounding sensitivity, and reporting compliance. We describe the benchmark design rationale, task construction methodology, paper selection criteria with anti-paper-mill filtering, and a detailed analysis of task characteristics including methodological diversity, evaluation dimension coverage, and difficulty stratification. To demonstrate benchmark executability, we evaluate an agentic data2paper pipeline on 3 pilot tasks spanning all three difficulty tiers, achieving scores of 72/100 (Tier 1, Cardio_000), 69/100 (Tier 2, Mental_000), and 75/100 (Tier 3, Metabolic_002), with a mean score of 72/100 (B-level). Survey-weighted methodology was correctly implemented across all tasks; primary limitations included covariate incompleteness and reference group misspecification. MedResearchBench addresses a critical gap in AI research evaluation and provides a standardized, community-extensible platform for assessing whether AI systems can conduct clinically sound, publication-quality medical research. All task materials are publicly available at https://github.com/TerryFYL/MedResearchBench.
Lu, H.-E.; Koivisto, D.; Lou, Y.; Zeng, Z.; Yu, T.; Wang, J.; Meng, X.; Nowikow, C.; Wilson, R.; Kumbhare, D.; Pu, J.
Show abstract
Deep learning has transformed medical image and video analysis, but it usually requires large, well annotated datasets. In many clinical domains, especially when testing novel mechanistic hypotheses, such retrospective datasets are hard to obtain since acquiring adequate cohorts is time intensive, costly, and operationally difficult. This creates a critical translational gap: scientifically compelling early stage ideas may remain untested due to lack of sufficient sample size to support conventional deep learning pipelines. Developing data-efficient strategies for evaluating new hypotheses within small prospective cohorts is therefore essential to de-risk innovation before large-scale validation. Myofascial Pain Syndrome (MPS) exemplifies this challenge, as quantitative ultrasound imaging biomarkers for MPS remain underexplored. We investigated whether MPS in the upper trapezius can be detected from full B-mode ultrasound videos in a small prospective cohort (11 controls, 13 patients). Videos were automatically preprocessed and resampled using a sliding window strategy to expand training samples (404 clips). A self-supervised Video Diffusion Encoder (VDE) is developed to learn spatiotemporal representations without relying on extensive labeled data, and compared it with transfer-learning-based ResNet, VideoMAE, and SimCLR. Using subject-level stratified four-fold cross-validation, the VDE outperformed transfer learning baselines and achieved performance comparable to SimCLR, with subject-level AUC of 0.79 and accuracy of 0.86, and no significant differences between latent-only and combined trigger point analyses. These results demonstrate that self-supervised diffusion learning can support robust, data-efficient deep learning in small prospective studies, enabling early feasibility testing of innovative ultrasound biomarkers before large-scale clinical trials.
Ramirez, A.; Thomas, N.; Calabrese, D. R.; Greenland, J. R.; Meyer, A. S.
Show abstract
Cell-cell communication (CCC) mediates coordinated cellular activities that vary dynamically across time, location, and biological context. While various tools exist to infer CCC, they typically aggregate data according to pre-defined cell types, obscuring critical single-cell heterogeneity. Furthermore, because signaling pathways and cell populations operate in a coordinated manner, an integrative analytical approach is essential. To address these challenges, we developed CCC-RISE, an extension of the tensor-based method Reduction and Insight in Single-cell Exploration (RISE). CCC-RISE identifies integrative patterns of single-cell variation by deconvolving communication into interpretable modules defined by unique sender cells, receiver cells, ligands, and condition associations. We applied this framework to a COVID-19 cohort with varying disease severity and a lung transplant cohort with acute allograft dysfunction. In both contexts, CCC-RISE successfully identified disease-relevant communication programs and traced them to specific cellular subpopulations, often crossing conventional cell-type boundaries. This approach offers a robust pipeline enabling the identification of disease-relevant signaling subpopulations that are invisible to aggregate methods. HighlightsO_LICCC-RISE enables integrative analysis of cell-cell communication across multiple conditions at single-cell resolution C_LIO_LICCC-RISE deconvolves signaling patterns into modules defined by their sender cells, receiver cells, LR pairs, and experimental conditions/samples C_LIO_LIAnalysis at single-cell resolution uncovers signaling activity within and across conventional cell types C_LI
Jackson, N. J.; Yan, C.; Caro-Vega, Y.; Paredes, F.; Ismerio Moreira, R.; Cadet, S.; Varela, D.; Cesar, C.; Duda, S. N.; Shepherd, B. E.; Malin, B. A.
Show abstract
Digital health technologies, including machine learning (ML), are transforming infectious disease management, however ML models for HIV care have been limited by data sharing restrictions that prevent multi-site collaboration. Federated Learning (FL) offers a privacy-preserving solution, enabling cross-site model training without sharing patient-level data. We evaluated FL for developing clinical prediction models using data from 22,234 people living with HIV (PLWH) across six sites in five countries within the Caribbean, Central, and South America network for HIV epidemiology (CCASAnet). Across four prediction tasks --- 1-year mortality, 3-year mortality, tuberculosis incidence, and AIDS-defining cancer incidence --- FL algorithms achieved near-centralized performance while substantially outperforming site-specific models. Performance gains varied across sites, driven by both site size and between-site heterogeneity. Local fine-tuning often improved FL performance, though benefits were task dependent. These findings support FL as a scalable, privacy-preserving infrastructure for multi-site ML in international HIV research.
Piorkowska, N. J.; Olejnik, A.; Ostromecki, A.; Kuliczkowski, W.; Mysiak, A.; Bil-Lula, I.
Show abstract
Interpreting machine learning models typically relies on feature attribution methods that quantify the contribution of individual variables to model predictions. However, it remains unclear whether attribution magnitude reflects the true functional importance of features for model performance. Here, we present a unified interpretability framework integrating permutation-based attribution, feature ablation, and stability under perturbation across multiple feature spaces. Using nested cross-validation and permutation-based null diagnostics, we systematically evaluate the relationship between attribution magnitude and functional dependence in clinical and biomarker-based prediction models. Attribution magnitude is frequently misaligned with functional importance, with weak to strong negative correlations observed across feature spaces (Spearman {rho} ranging from -0.374 to -0.917). Features with high attribution often have limited impact on model performance when removed, whereas features with low attribution can be essential for maintaining predictive accuracy. These discrepancies define distinct classes of interpretability failure, including attribution excess and latent dependence. Interpretability further depends on feature space composition, and stable, functionally relevant features are not necessarily those with the highest attribution scores. By integrating attribution, functional impact, and stability into a composite Feature Reliability Score, we identify features that remain informative across perturbations and analytical contexts. These findings indicate that interpretability does not arise from attribution magnitude alone but is better characterized from stability under perturbation. This framework provides a basis for more robust model interpretation and highlights limitations of attribution-centric approaches in high-dimensional and correlated data settings.
van Geest, G.; Thomas-Lopez, D.; Feitzinger, A. A.; Weissgold, L. A.; Halabi, S.; Cuesta, I.; Hjerde, E.; Gurwitz, K. T.; Arora, N.; Neves, A.; Palagi, P. M.; Williams, J. J.
Show abstract
BackgroundDatasets related to infectious diseases are essential for public health decision-making, yet their reuse remains limited by persistent barriers to data sharing and integration. Achieving data that are Findable, Accessible, Interoperable, and Reusable (FAIR) is widely recognized as essential for accelerating scientific discovery and enabling coordinated responses to emerging threats, but the needs of the global pathogen data community have not been systematically characterized. AimThis study, conducted by the Pathogen Data Network (PDN), aims to identify infrastructural and educational priorities among stakeholders working with infectious disease-related data in order to guide community-responsive support for data sharing and interoperability. MethodsA cross-sectional stakeholder survey was disseminated to a well-defined expert population within PDN networks and via open professional channels. A total of 136 responses from researchers, healthcare professionals, bioinformaticians, and educators were analyzed descriptively to identify prioritized barriers, training needs, and preferred support mechanisms. ResultsRespondents consistently identified structural constraints as the primary impediments to effective data use, including limited funding (74%), data-aggregation challenges (68%), and a shortage of skilled personnel (52%). Respondents identified bioinformatics for infectious disease research (68%) as the highest priority for training, followed by guidance on using the integrated pathogen data and tools portal provided by the PDN, the Pathogens Portal (51%). The Pathogens Portal was also ranked as the most essential PDN resource (72%). Preferred training formats included virtual short courses (68%) and webinars (66%). Notably, while researchers emphasized technical subjects like machine learning, educators prioritized foundational case studies. ConclusionThese findings provide an evidence-based diagnostic of community needs and suggest that barriers to FAIR pathogen data are predominantly systemic rather than purely technological. The survey framework and openly available dataset offer a reusable template for assessing needs in other communities and regions. By aligning training, infrastructure development, and outreach with empirically identified priorities, organizations supporting infectious disease research can strengthen the interoperability and reuse of data and establish a benchmark for future community-driven improvements.
Pavlovic, M.; Wurtzen, C.; Kanduri, C.; Mamica, M.; Scheffer, L.; Lund-Andersen, C.; Gubatan, J. M.; Ullmann, T.; Greiff, V.; Sandve, G. K.
Show abstract
Machine learning (ML) enables adaptive immune receptor repertoires (AIRRs) analyses for biomarker identification and therapeutic development. With the majority of AIRR data partially or imperfectly labeled, unsupervised ML is essential for motif discovery, biologically meaningful clustering, and generation of novel receptor sequences. However, no unified framework for unsupervised ML exists in the AIRR field, hindering the assessment of model robustness and generalizability. Here, we present an immuneML release advancing unsupervised ML in the AIRR field through unified clustering workflows, interpretable generative modeling, integration with protein language model embeddings, dimensionality reduction, and visualization. We demonstrate immuneMLs utility in three use cases: (i) benchmarking generative models for epitope-specific sequence generation, assessing specificity and novelty, (ii) systematic evaluation of clustering approaches on experimental receptor sequences against biological properties, such as epitope specificity and MHC, and (iii) unsupervised analysis of an experimental AIRR dataset to examine potential confounding, a practice widespread in related fields but unexplored in AIRR analyses.
Pinero, S. L.; Li, X.; Lee, S. H.; Liu, L.; Li, J.; Le, T. D.
Show abstract
Long COVID affects millions of people worldwide, yet no disease-modifying treatment has been approved, and existing interventions have shown only modest and inconsistent benefits. A key reason for this limited progress is that current computational drug repurposing pipelines do not match well with the clinical reality of Long COVID. These patients often have persistent, multisystemic symptoms and may already be taking multiple medications, making treatment safety a primary concern. However, most repurposing workflows still treat safety as a downstream filter and rely on disease-associated targets rather than causal drivers. They also assume that the findings of one analysis would generalize across the diverse presentations of Long COVID. We introduce SPLIT, a safety-first repurposing framework that addresses these limitations. SPLIT prioritizes safety at the start of the candidate evaluation, integrates complementary causal inference strategies to identify likely driver genes, and uses a counterfactual substitution design to compare drugs within specific cohort contexts. When applied to cognitive and respiratory Long COVID cohorts, SPLIT revealed three main findings. First, drugs with similar predicted efficacy could have very different predicted safety profiles. Second, the drugs flagged as unfavorable were often different between the two cohorts, showing that drug prioritization is phenotype-specific. Third, SPLIT flagged 18 drugs currently under active investigation in Long COVID trials as having unfavorable predicted profiles. SPLIT provides a practical framework to identify safer, more context-appropriate candidates earlier in the process, supporting more targeted and better-tolerated treatment strategies for Long COVID.
Van De Vijver, E.; Dewitte, K.; Van Alboom, A.; Christophe, A.; Van Vlierberghe, H.; Van Troys, M.
Show abstract
Three-dimensional microtumour models such as spheroids are increasingly used in cancer research as they better capture tumour architecture, growth and invasion than conventional two-dimensional cultures. However, robust and accessible tools for quantitative analysis remain limited. Here we present SImBA-SiQuAl, an integrated open-source workflow for high-throughput quantitative phenotyping of 3D spheroids and organoids. The pipeline combines SImBA, an automated image-analysis framework for performant quality-controlled image segmentation and multi-feature extraction from spheroid assays, with SiQuAl, a downstream analysis platform that automatically performs comprehensive statistical and multivariate analyses to reveal phenotypic differences between experimental conditions. In a first case study, SImBA-SiQuAl resolves intrinsic invasion phenotypes between cancer cell lines. In a second case study, the workflow quantifies both uniform and heterogeneous responses in a spheroid drug screening assay. Together, SImBA-SiQuAl provides a new, timely tool for high-throughput, high-content microtumour phenomics in cancer research. MOTIVATION3D-microtumour assays such as spheroids and organoids are increasingly used in preclinical research. These assays generate rich phenotypic imaging data, but quantitative automated analysis remains a major bottleneck. This limits reproducibility, scalability, and broad adoption for large-scale, high-content phenomics studies, but also implies biologically relevant phenotypic (heterogeneous) responses in e.g. perturbation studies may not be comprehensively addressed. SImBA-SiQuAl is developed to address this gap by providing an open-source, integrated workflow offering solutions in both the image processing and downstream analysis. Together, this enables in-depth quantitative analysis of 3D microtumour phenotypes across experimental settings. HIGHLIGHTSO_LISImBA-SiQuAl provides a complete end-to-end workflow for high-throughput, high-content, quantitative 3D microtumour analysis, from quality-controlled image segmentation to statistical, multivariate and cluster-based biological interpretation. C_LIO_LISImBA-SiQuAl is broadly applicable across multiple 3D systems and assay types. C_LIO_LIWe demonstrate the workflow can capture biologically meaningful heterogeneity and treatment response at scale, supporting robust and unbiased analysis. C_LIO_LIBy combining accessibility, flexibility and analytical depth, SImBA-SiQuAl addresses a key unmet need for accessible advanced open-source tools in 3D preclinical research. C_LI
Hornak, G.; Heinolainen, A.; Solyomvari, K.; Silen, S.; Renkonen, R.; Koskinen, M.
Show abstract
Selecting an effective treatment relies on accurately anticipating patient's response to alternative interventions. However, forecasting longitudinal clinical trajectories remains difficult because electronic health records contain heterogeneous, irregularly sampled data over extended time periods. These issues are especially relevant for laboratory measurements, which are central for diagnostics, assessment of therapeutic responses, and tracking disease progression in routine clinical practice. However, existing deep learning methods for counterfactual prediction usually assume regularly sampled data, an assumption incompatible with the irregular, heterogeneous data-generation processes of real-world clinical practice. Here we present the Time-Aware G-Transformer, which integrates causal G-computation with time-aware attention to predict counterfactual outcomes on irregular data. By explicitly conditioning on the timing of future observations and encoding measurement patterns, the model captures temporal dynamics that previous methods overlook. Evaluated on synthetic tumor growth data and on 90,753 cancer patient trajectories from an academic medical center, our approach demonstrates superior long-horizon (> 1 day) prediction accuracy and uncertainty calibration compared to state-of-the-art baselines. These results demonstrate that embedding temporal relations directly into the attention mechanism enables robust integration of patient history data for evaluating potential treatment strategies in personalized medicine.
Xu, M.; Yan, J.; Feng, R.; Cai, Q.; Zhang, P.; Zhao, C.; He, C.; Wei, Z.; Li, J.; Lin, S.; Dong, H.; Jin, R.; Hou, T.; Liu, Q.; Zhang, Z.
Show abstract
Day-to-day research discussions in group chats often generate hypotheses, analysis requests, and interpretation decisions, yet executing those analyses still requires researchers to leave the conversation and rely on fragmented local tools, databases, visualization software, and literature search engines. In this work, we present BioClaw, a human-bot research collaboration ecosystem that converts natural-language requests in group conversations into tool-grounded analyses executed within isolated Docker containers. Deployed across 8 messaging platforms, BioClaw turns each group chat into a persistent execution workspace. Its design combines multi-channel orchestration, per-group state and workspace management, and isolated containerized execution for reliable shared use over long-lived conversations. To support practical research workflows, BioClaw combines containerized execution with preinstalled 31 biomedical tools and 95+ skills. The application of BioClaw spans various biomedical domains (e.g., genomics, clinics, structural biology) and data modalities (e.g., sequencing data, EHR data, protein structure data). These results establish the viability of embedding executable, tool-rich agent workflows within shared digital workspaces, positioning group chats as a transformative paradigm for collaborative scientific discovery and innovation.
Curtin, A.; Merriman, E.; Curtin, P.
Show abstract
Recurrence Quantification Analysis (RQA) is a powerful phenomenological method for characterizing dynamical systems from sequential empirical data, but it is fundamentally limited to continuous signals. Symbolic RQA (sRQA) extends this framework to discrete state sequences, enabling the analysis of both inherently discrete systems and continuous systems where state-based dynamics and motifs are of interest. Despite its promise, accessible and unified software support for sRQA has remained limited. Here we introduce the sRQA package, an open-source R library that consolidates discretization and symbolization, data visualization, and computation of recurrence and cross-recurrence metrics into a single accessible toolset. We validated the method using simulated data with known dynamical properties, confirming that sRQA metrics behaved as theoretically expected. We then demonstrated the utility of sRQA across three real-world applications. First, we applied sRQA to ECG recordings, showing that symbolic recurrence metrics reliably distinguished atrial fibrillation from normal sinus rhythm, with an XGBoost classifier achieving 92% accuracy and an AUC of 0.97. Second, we applied sRQA to fMRI BOLD time series from the dorsal attention network, finding that symbolic and cross-recurrence metrics differentiated movie-viewing from resting-state conditions, revealing greater regularity and inter-subnetwork coordination during task engagement. Third, we applied sRQA to intrinsically symbolized sequences of pauses in speech, identifying valence-specific differences in pause dynamics between truthful and deceptive statements, as well as sex differences in pause structure during negatively-valenced speech. Together, these results demonstrate that sRQA provides a flexible and sensitive framework for characterizing discrete and discretized dynamical systems across biological and behavioral domains. AUTHOR SUMMARYMany biological and behavioral systems are best understood as sequences of discrete states rather than smooth, continuous processes. For example, a heartbeat that shifts between rhythms, a brain that transitions between activity patterns, or a speaker who pauses and resumes in ways that carry meaning. Standard methods for analyzing the dynamics of such systems were not designed with this kind of data in mind. Here, we introduce the sRQA package, an open-source software library that makes it straightforward to apply symbolic recurrence analysis to both discrete and continuous data. We demonstrate the library across four examples: simulated data with known properties, cardiac recordings distinguishing atrial fibrillation from normal heart rhythm, brain imaging data capturing differences between rest and task engagement, and speech recordings where pause patterns differ between truthful and deceptive statements. In each case, sRQA revealed meaningful structure in the data that would be difficult to detect with conventional tools. We hope this library will make symbolic recurrence analysis more accessible to researchers across the biological and behavioral sciences.
Jurgens, J. A.; Bueckle, A.; Vora, J.; Maurya, M. R.; Mohseni Ahooyi, T.; Zheng, E.; Stear, B.; Wang, D.; Ree, C.; Ramachandran, S.; Nekrutenko, A.; Brandes, M.; Thaker, S.; Katz, D. H.; Munoz-Torres, M. C.; Diamant, I.; Chun, H.-J. E.; Simmons, J. A.; Tasian, S. K.; Jenkins, S. L.; Evangelista, J. E.; Dodia, H.; Saha, S.; Lindquist, M. A.; Gajjala, V.; Nemarich, C.; Zhen, J.; Ross, K. E.; Byrd, A. I.; Shilin, A.; Metzger, V. T.; Bologa, C. G.; Srinivasan, S.; Jang, D.; Kumar, P.; Taub, L. D.; Levanto, M. P.; Petrosyan, V.; Anandakrishnan, M.; Kim, M.; Clarke, D. J. B.; Ivich, A.; Crichton, D.
Show abstract
The NIH Common Fund Data Ecosystem (CFDE) integrates data resources from 18 NIH Common Fund programs for discovery and integrative analysis. These programs generate valuable but heterogeneous datasets that can be difficult to discover, access, and reuse. CFDE aims to provide a collaborative, community-built infrastructure that links and enriches Common Fund programs. We describe the evolution, structure, and core technologies of CFDE, including practical approaches that support submission, integration, visualization, and public release of multimodal data. Training programs and workforce initiatives lower barriers to adoption. CFDE has devised solutions to critical issues facing cross-program initiatives, including data scale and heterogeneity, dataset integration, and long-term sustainability. We demonstrate the utility of linking Common Fund resources through integrative tools and cross-dataset queries to yield insights that would otherwise be infeasible. Collectively, CFDE shows that a standards-driven, federated approach enhances and unifies cross-disciplinary resources, fostering collaboration and data-driven discovery.
Sanjaya, P.; Pitkänen, E.
Show abstract
MotivationDeep neural networks have proven effective in classifying tumour types using next-generation sequencing data. However, developing transferable models that work across heterogeneous operating environments remains challenging due to differences in cohort compositions and data generation protocols, privacy concerns, and limited computational capabilities. ResultsWe introduce muat, a transformer-based software for tumour classification using somatic variant data from whole-genome (WGS) and whole-exome sequencing (WES). Building on previously developed MuAt and MuAt2 models, we distribute the software via Docker containers and Bioconda for deployment in high-performance computing (HPC) systems and Secure Processing Environments (SPEs). Using a downloadable MuAt checkpoint, we reproduce the performance reported in the original study on whole genome (PCAWG; 89% accuracy in histological tumour typing) and exome sequencing data (TCGA; 64% accuracy). Cross-cohort evaluation in Genomics England SPE achieved 81% accuracy without retraining and 89% following fine-tuning. As a demonstration of the softwares adaptability, we also deployed muat within the iCAN Digital Precision Cancer Medicine Flagships SPE and integrated it into a Nextflow-managed workflow. Availability and implementationmuat is available through conda (www.anaconda.org/bioconda/muat) and GitHub (https://github.com/primasanjaya/muat), under the Apache 2.0 License. Contactprima.sanjaya@helsinki.fi, esa.pitkanen@helsinki.fi; website: mlbiomed.net
Ngo, A.; Guindon, S.; Pedergnana, V.
Show abstract
Understanding how genetic variation in pathogens influences clinical phenotypes observed in infected hosts is a fundamental challenge in evolutionary genomics and public health. Phenotypic traits such as infection severity are often non-randomly distributed within the pathogens phylogeny, suggesting the existence of evolutionary determinants but also violating the independence assumption underlying classical genome-wide association studies and potentially leading to inflated false positive rates. We present MutaPhy, a phylogeny-based method aimed at detecting correlations between a binary host phenotype and the corresponding pathogen genome by directly utilizing the hierarchical structure of phylogenetic trees. MutaPhy encompasses three different scales: (i) a subtree scale, on which relevant clades over-representing the phenotype of interest are detected using permutation-based tests; (ii) a tree scale, which agglomerates local signals into a global association statistics; and (iii) a site scale, whereby candidate mutational events on branches leading to significant clades are examined using ancestral sequence reconstruction. We evaluate the statistical behavior and detection performance of MutaPhy using simulations under diverse evolutionary scenarios. We also compare this tool to several existing phylogenetic association methods. As illustrative applications, we apply MutaPhy to dengue virus and hepatitis C virus datasets associated to clinical phenotypes in human hosts. Our results highlight the ability of the proposed approach to detect viral lineages associated to over-represented phenotypes while revealing limited evidence for robust mutation-level associations in these particular datasets. Altogether, MutaPhy provides a framework for guiding genotype-phenotype association analyses by leveraging phylogenetic structure, thereby reducing false positive findings and improving the interpretability of association signals.
Hakata, Y.; Oikawa, M.; Fujisawa, S.
Show abstract
Background. Federated learning (FL) enables collaborative model training across institutions without sharing patient-level data. However, standard FL algorithms such as FedAvg degrade under non-independently and non-identically distributed (non-IID) data, a prevalent condition when patient demographics, scanner hardware, and disease prevalence differ across hospital sites. Objective. We propose iPS-MFFL (Individualized Per-Site Meta-Federated Feature Learning), a federated framework with a hierarchical local-model architecture that addresses non-IID heterogeneity through (1) a shared feature extractor, (2) multiple weak-learner classification heads that can be trained with heterogeneous training objectives to promote complementary decision boundaries, (3) independent per-learner server aggregation so that each weak learner's parameters are averaged only with its counterparts at other clients, and (4) a lightweight meta-model, itself federated, that adaptively stacks the weak-learner outputs. Methods. We evaluate on the Brain Tumor MRI Classification dataset (7,200 images; 4 classes: glioma, meningioma, pituitary tumor, no tumor) partitioned across K = 5 simulated hospital sites using Dirichlet non-IID sampling (alpha = 0.3). Four baselines are compared: Local-only training, FedAvg, FedProx, and Freeze-FT. All experiments are repeated over three random seeds (13, 42, 2025) and evaluated using paired t-tests, Cohen's d effect sizes, and post-hoc power analysis.
Xu, Y.; Li, Y.; Yuan, Y.; Yu, C.; Zang, Z.
Show abstract
While single-cell foundation models (SCFMs) have shown promise across various downstream tasks, their generalization performance in label-scarce settings remains a critical bottleneck. The absence of systematic benchmarks for these low-resource scenarios hinders their translation to realworld biomedical research. To bridge this gap, we present CellBench-LS, a comprehensive framework designed to rigorously evaluate SCFMs generalization under low-supervision conditions. This framework employ a stratified evaluation protocol to systematically compare traditional methods and foundation models. We evaluate their zero-shot representational abilities on cell clustering and batch correction tasks, and apply lightweight fine-tuning of task-specific heads for predictive tasks, such as celltype annotation, expression reconstruction, and perturbation prediction. Experimental results demonstrate a biologically stratified landscape, with foundation models showing distinct advantages in tasks critically reliant on celltype recognition, while traditional methods remain competitive in those requiring precise quantification of gene expression patterns. CellBench-LS provides critical guidance for developing more biologically grounded and generalizable computational approaches in single-cell analysis.
Sen, B.; Ulusoy, E.; Darcan, M.; Ergun, M.; Lobentanzer, S.; Rifaioglu, A. S.; Turei, D.; Saez-Rodriguez, J.; Dogan, T.
Show abstract
Biomedical discovery is hindered by fragmented, modality-specific repositories and uneven metadata, limiting integrative analysis, accessibility, and reproducibility. To address these challenges, we present CROssBARv2, a provenance-rich biomedical data-and-knowledge integration platform that unifies heterogeneous sources into a maintainable, scalable system. By consolidating diverse data types into an extensive knowledge graph enriched with standardised ontologies, rich metadata, and deep learning-based vector embeddings, CROssBARv2 alleviates the need for researchers to navigate multiple siloed databases and can facilitate downstream tasks, including predictive modelling and mechanistic reasoning, enabling applications such as drug repurposing and protein function prediction. The platform offers interactive graph exploration and embedding-based semantic search with CROssBAR-LLM, an intuitive natural language question-answering system that grounds large language model (LLM) outputs in the underlying knowledge graph to mitigate hallucinations. We assess CROssBARv2 through (i) multiple use-case analyses to test biological coherence and relational validity; (ii) knowledge-augmented biomedical question-answering benchmarks comparing CROssBAR-LLM against generalist LLMs; and (iii) a deep learning-based predictive modelling experiment for protein function prediction leveraging the heterogeneous structure of CROssBARv2. Collectively, CROssBARv2 provides a scalable, AI-ready, and user-friendly foundation that facilitates hypothesis generation, knowledge discovery, and translational research.
Sabharwal, A.; Patel, M. S.; Carrano, A.; Rotman, M.; Wierson, W.; Ekker, S. C.
Show abstract
The deployment of large language models (LLMs) for science carries an intrinsic risk: hallucination of citations, fabricated drug approvals or clinical trials, and unsupported experimental outcomes. Here we describe the testing and deployment of a novel systematic, multi-layer approach called the Validation as a System (VaaS) pipeline, iteratively developed during the construction of an open-source, living Rare Disease Database (RDD). We report lessons learned and production results from 225 carefully annotated rare disease gene curations and a prospective 100-gene collection (99 net new), together representing over 3,000 verified citations. After three iterations of directed refinement, the net functional hallucination rate approached zero. We validated the pipeline using three complementary benchmarks: (1) VaaS-RIKER2, a 640-run prospective ablation study (4 conditions x 4 temperatures x 40 genes) plus 117 open-weight model runs on dedicated GPU hardware - unguided LLM output produced 95.9% Type II hallucination (wrong-topic citations that exist as real papers but carry a correct claim context yet do not support the cited claim); the full VaaS protocol achieved 0.0% Type I and 6.5% Type II, a >14-fold reduction; live PMID verification alone (C3) eliminated both error types entirely (0.0%/0.0%); (2) an independent L3 citation audit of Wave 3 (179 PMIDs, 99.4% valid, 0 Type I errors); and (3) the MedHallu clinical hallucination benchmark, on which the VaaS protocol achieved F1 = 0.9853 on the hard tier (cases where all benchmark ensemble models were fooled), compared to the published GPT-4o baseline of F1 = 0.811 (Pandit et al., 2025). Three independent open-weight models (llama3.2, qwen2.5:14b,mistral:7b) showed 81-87% Type II rates under unguided conditions, confirming that wrong-topic citation hallucination is structural and model-agnostic. In contrast, the corresponding VaaS rate was measured to be zero (n = 508 verified citations; 160 runs, C4 full protocol) under the same conditions. Human validation of [≥]50 entries confirmed zero Type I errors and less than 0.5% Type II errors in the manual curation test. The VaaS pipeline operated at less than [~]$1 overall per comprehensive gene review, demonstrating that citation-integrity standards in AI-assisted biomedical synthesis are achievable at production scale. The VaaS approach represents, to the authors' knowledge, the lowest measured hallucination system for science to date and is set to further accelerate the use of AI and AI agents for advancing research.
Xu, M.; Chen, J.; Zhang, Z.
Show abstract
Large language models have enabled a new class of scientific software in the form of AI agents that can execute research workflows across bioinformatics, drug discovery, and related domains. Among these systems, OpenClaw introduced a skill-based design that allows workflows to be expressed as structured Markdown files, lowering the barrier to contribution and enabling rapid ecosystem growth. However, this growth has led to fragmentation. Projects are distributed across independent repositories, skills vary widely in quality, naming is inconsistent, and there is no unified way to discover or compare available tools. In this work, we construct the first curated dataset of the OpenClaw scientific ecosystem. The dataset includes 91 projects organized by functional role and 2,230 skills spanning 34 scientific categories. Based on this dataset, we perform a systematic analysis of the structure, distribution, and emerging patterns of scientific agent development. To make this ecosystem accessible in practice, we further build Claw4Science, a public platform at https://claw4science.org, which is built on top of our dataset. The platform organizes projects and aggregates distributed skill repositories into a unified interface, with a focus on bioinformatics and scientific workflows, providing a practical entry point for navigating the ecosystem. Our results show that the OpenClaw ecosystem reflects a shift from isolated systems to a more modular and shareable model of scientific computation. At the same time, challenges in evaluation, reproducibility, and governance remain open. We argue that our dataset provides a foundation for future benchmark development and standardized infrastructure for scientific AI agents.