Patterns — Latest Matching Preprints

1

Interpretability as stability under perturbation reveals systematic inconsistencies in feature attribution

Piorkowska, N. J.; Olejnik, A.; Ostromecki, A.; Kuliczkowski, W.; Mysiak, A.; Bil-Lula, I.

2026-04-22 health informatics 10.64898/2026.04.20.26351354 medRxiv

Top 0.1%

7.0%

Show abstract

Interpreting machine learning models typically relies on feature attribution methods that quantify the contribution of individual variables to model predictions. However, it remains unclear whether attribution magnitude reflects the true functional importance of features for model performance. Here, we present a unified interpretability framework integrating permutation-based attribution, feature ablation, and stability under perturbation across multiple feature spaces. Using nested cross-validation and permutation-based null diagnostics, we systematically evaluate the relationship between attribution magnitude and functional dependence in clinical and biomarker-based prediction models. Attribution magnitude is frequently misaligned with functional importance, with weak to strong negative correlations observed across feature spaces (Spearman {rho} ranging from -0.374 to -0.917). Features with high attribution often have limited impact on model performance when removed, whereas features with low attribution can be essential for maintaining predictive accuracy. These discrepancies define distinct classes of interpretability failure, including attribution excess and latent dependence. Interpretability further depends on feature space composition, and stable, functionally relevant features are not necessarily those with the highest attribution scores. By integrating attribution, functional impact, and stability into a composite Feature Reliability Score, we identify features that remain informative across perturbations and analytical contexts. These findings indicate that interpretability does not arise from attribution magnitude alone but is better characterized from stability under perturbation. This framework provides a basis for more robust model interpretation and highlights limitations of attribution-centric approaches in high-dimensional and correlated data settings.

2

MutaPhy: A clade-based framework to detect genotype-phenotype associations on phylogenetic trees

Ngo, A.; Guindon, S.; Pedergnana, V.

2026-04-21 evolutionary biology 10.64898/2026.04.19.719535 medRxiv

Top 0.2%

3.6%

Show abstract

Understanding how genetic variation in pathogens influences clinical phenotypes observed in infected hosts is a fundamental challenge in evolutionary genomics and public health. Phenotypic traits such as infection severity are often non-randomly distributed within the pathogens phylogeny, suggesting the existence of evolutionary determinants but also violating the independence assumption underlying classical genome-wide association studies and potentially leading to inflated false positive rates. We present MutaPhy, a phylogeny-based method aimed at detecting correlations between a binary host phenotype and the corresponding pathogen genome by directly utilizing the hierarchical structure of phylogenetic trees. MutaPhy encompasses three different scales: (i) a subtree scale, on which relevant clades over-representing the phenotype of interest are detected using permutation-based tests; (ii) a tree scale, which agglomerates local signals into a global association statistics; and (iii) a site scale, whereby candidate mutational events on branches leading to significant clades are examined using ancestral sequence reconstruction. We evaluate the statistical behavior and detection performance of MutaPhy using simulations under diverse evolutionary scenarios. We also compare this tool to several existing phylogenetic association methods. As illustrative applications, we apply MutaPhy to dengue virus and hepatitis C virus datasets associated to clinical phenotypes in human hosts. Our results highlight the ability of the proposed approach to detect viral lineages associated to over-represented phenotypes while revealing limited evidence for robust mutation-level associations in these particular datasets. Altogether, MutaPhy provides a framework for guiding genotype-phenotype association analyses by leveraging phylogenetic structure, thereby reducing false positive findings and improving the interpretability of association signals.

3

Multimodal prediction of visual improvement in diabetic macular edema using real-world electronic health records and optical coherence tomography images

Sun, S.; Cai, C. X.; Fan, R.; You, S.; Tran, D.; Rao, P. K.; Suchard, M. A.; Wang, Y.; Lee, C. S.; Lee, A. Y.; Zhang, L.

2026-04-24 health informatics 10.64898/2026.04.23.26351616 medRxiv

Top 0.3%

2.9%

Show abstract

Multimodal learning has the potential to improve clinical prediction by integrating complementary data sources, but the incremental value of imaging beyond structured electronic health record (EHR) data remains unclear in real-world settings. We developed a multimodal survival modeling framework integrating optical coherence tomography (OCT) and EHR data to predict time to visual improvement in patients with diabetic macular edema (DME), and evaluated how different ophthalmic foundation model representations contribute to prognostic performance. In a retrospective cohort of 973 patients (1,450 eyes) receiving anti-vascular endothelial growth factor therapy, we compared multimodal models combining 22,227 EHR variables with 196,402 OCT images, with OCT embeddings derived from three ophthalmic foundation models (RETFound, EyeCLIP, and VisionFM). The EHR-only model showed minimal prognostic discrimination (C-index 0.50 [95% CI, 0.45-0.55]). Incorporating OCT improved performance, with the magnitude of improvement depending on the representation. EHR+RETFound achieved the strongest performance (C-index 0.59 [0.54-0.65]), followed by EHR+EyeCLIP (0.57 [0.52-0.62]) and EHR+VisionFM (0.56 [0.51-0.61]). Multimodal models, particularly EHR+RETFound, demonstrated improved risk stratification with clearer separation of Kaplan-Meier curves. Partial information decomposition revealed that prognostic information was dominated by modality-specific contributions, with OCT and EHR providing largely distinct signals and minimal shared information. The magnitude of OCT-specific contribution varied across foundation models and aligned with observed performance differences. These findings indicate that OCT provides complementary prognostic value beyond structured clinical data, but gains are modest and depend strongly on representation choice. Our results highlight both the promise of multimodal modeling for personalized prognosis and the need for rigorous, context-specific evaluation of foundation models in real-world clinical settings.

4

Generalizing intensive care AI across time scales in resource-limited settings

Devadiga, A.; Singh, P.; Sankar, J.; Lodha, R.; Sethi, T.

2026-04-24 health informatics 10.64898/2026.04.23.26351588 medRxiv

Top 0.4%

2.7%

Show abstract

Temporal resolution of physiological monitoring in intensive care varies widely across healthcare systems. Artificial intelligence models assume a uniform and fixed frequency of sampling, thus limiting the generalizability of models, especially to resource-limited settings. Here, we propose a novel resolution-transfer task for physiological time series and ask whether models trained on high-resolution data can generalize to a low data-density setting without the need to retrain them. SafeICU, a novel longitudinal pediatric intensive care dataset spanning ten years from a tertiary care hospital in India, was used to test this hypothesis. Self-supervised transformer models were trained on 144,271 patient-hours of high-resolution physiological signals from 984 pediatric ICU stays to learn representations of heart rate, respiratory rate, oxygen saturation, and arterial blood pressure. Transfer of this model to low-resolution data established robust performance in clinically relevant lower-frequency intervals, consistently outperforming models trained directly at coarser resolutions. Further, these representations generalized across patient populations, maintaining performance when evaluated on adult intensive care cohorts from the MIMIC-III and eICU databases without retraining. In a downstream task of early shock prediction, models achieved strong discrimination in the pediatric cohort (area under the receiver operating characteristic curve (AUROC) 0.87; area under the precision-recall curve (AUPRC) 0.92) and retained stable performance across monitoring intervals from 10 to 60 minutes (AUROC 0.78-0.88). Together, these results demonstrate that physiological representations learned from high-resolution data enable time-scale-robust and transferable AI for intensive care. The publicly released SafeICU dataset, comprising longitudinal vital signs, laboratory measurements, treatment records, microbiology, and admission and discharge, provides a foundation for developing and deploying generalizable clinical AI in resource-limited settings.

5

Wavelet analysis reveals non-stationary cardiovascular rhythms associated with delirium and deep sedation in ICU patients

Sreekanth, J.; Salgado-Baez, E.; Edel, A.; Gruenewald, E.; Piper, S. K.; Spies, C.; Balzer, F.; Boie, S. D.

2026-04-23 health informatics 10.64898/2026.04.22.26351455 medRxiv

Top 0.6%

2.1%

Show abstract

Routine ICU data offers valuable insights into daily physiological rhythms. While traditional methods assume these cycles maintain fixed periods and amplitudes, their inherent variability requires dynamic estimation of instantaneous trends. Wavelet transform effectively resolves circadian oscillations, especially for frequently measured vital parameters. We present novel extensions to the Continuous Wavelet Transform (CWT) power spectral analysis to better detect and segment subtle temporal patterns. Using this approach, we uncover hidden circadian patterns in cardiovascular vitals such as Heart Rate (HR) and Mean Blood Pressure (MBP) measured over five days in a retrospective cohort of 855 ICU patients. By quantifying non-stationary rhythms, we identified diurnal and semi-diurnal oscillations varying in period and power according to delirium and deep sedation. Notably, HR exhibits a clear diurnal and semi-diurnal rhythm when delirium is absent. Overall, our framework supports the CWT as a powerful tool for analyzing complex physiological signals, particularly vital signs. Crucially, our findings suggest that cardiovascular rhythm disruption can be associated with ICU-related delirium and deep sedation.

6

MedSAM2-CXR: A Box-Latent Framework for Chest X-ray Classification and Report Generation

Hakata, Y.; Oikawa, M.; Fujisawa, S.

2026-04-22 health informatics 10.64898/2026.04.20.26351338 medRxiv

Top 0.6%

1.9%

Show abstract

Who is affectedIn Japan, approximately 100 million chest radiographs (CXRs) are acquired annually, while only about 7,000 board-certified diagnostic radiologists practice nationwide (Japan Radiological Society workforce statistics; OECD Health Statistics, most recent available year). This implies an average workload exceeding 10,000 imaging studies per radiologist per year if all CXRs were attributed to board-certified diagnostic radiologists (an upper-bound estimate, because in practice many CXRs are primarily read by non-radiologist physicians). In settings such as night shifts, weekends, remote islands, and regional care networks, non-radiologist physicians frequently act as primary readers. Despite strong demand for AI assistance, existing systems are typically limited by one of three shortcomings -- poor cross-institutional generalization, limited interpretability, or inability to generate draft reports -- and consequently see limited clinical deployment. What we builtWe propose a Box-Latent Trinity that embeds each image as a hyperrectangle parameterized by a center c and a radius r, rather than as a single point in a latent space. We further introduce BL-TTA (Box-Latent Test-Time Augmentation), which approximately closes the train-inference gap (exact in the N [->] {infty} limit; N = 8 suffices in practice) by averaging predictions over samples drawn from within the latent box at inference time. Both components are implemented on top of the frozen MedSAM2 medical imaging foundation model. A single box representation simultaneously supports three functions: (A) theoretically grounded source selection, (B) device-invariant augmentation, and (C) case-based retrieval-augmented generation (RAG). Each prediction is accompanied by retrieved similar prior cases, a calibrated confidence estimate, and clinical-guideline references. How well it performsOn the Open-i CXR corpus (2,954 image-report pairs) under a patient-level 80/10/10 split and 5-seed reproducibility, the full system B5 achieves macro area under the receiver-operating-characteristic curve (macro-AUROC) 0.639 (best-seed test; 5-seed mean 0.626, Table 2; absolute +0.015 over the strongest same-backbone baseline, Merlin-style 0.624), elementwise accuracy 0.753 (absolute +0.072 over Merlin-style 0.681 -- equivalent to approximately 7 fewer label-level errors per 100 (label, image) predictions across 14 finding labels, not per 100 images), and report label-F1 0.435 (absolute +0.086, relative +25 % over the strongest same-backbone report-generation baseline, Bootstrapping-style 0.349). Under simulated pixel-space device-shift intensities up to twice the training distribution, AUROC degrades by only 0.014. Brier score (macro) is 0.061; Cohens{kappa} between two independent rule-based label extractors is 0.702 (substantial agreement); the box radius yields an out-of-distribution (OOD) detection AUROC of 0.595; and the framework provides four structural explainable-AI (XAI) outputs -- retrieved similar cases, confidence tier, per-axis uncertainty, and visual saliency -- which we jointly quantify in a single CXR study, a combination that, to our knowledge, has not been reported previously. O_TBL View this table: org.highwire.dtl.DTLVardef@d8ced6org.highwire.dtl.DTLVardef@1f3471dorg.highwire.dtl.DTLVardef@c1c2f1org.highwire.dtl.DTLVardef@e589bdorg.highwire.dtl.DTLVardef@1b5e410_HPS_FORMAT_FIGEXP M_TBL C_TBL Path to deploymentBecause the complete experiment can be reproduced in under two hours on a consumer-grade GPU (NVIDIA RTX 4060, 8 GB VRAM), the framework can run on compute resources already available at typical healthcare institutions. The approach thus supports the practical delivery of evidence-grounded diagnostic support to night shifts, remote-island care, and secondary readings in health checkups -- settings in which a board-certified radiologist is not locally available. One-sentence summaryReproducible end-to-end in under two hours on a single consumer-grade GPU, the proposed framework outperforms the strongest same-backbone medical-AI baselines on three principal metrics, maintains accuracy under simulated device shifts, and automatically drafts evidence-grounded radiology reports, offering a reproducible and compute-efficient direction toward reducing the reading burden of Japanese radiologists, subject to external validation.

7

Evaluating MaxEnt Modeling Strategies for Predicting Suitable Habitats of Invasive Insects Under Climate Change Scenarios

CHOUHAN, P.; Zavala-Romero, O.; Haseeb, M.

2026-04-21 ecology 10.64898/2026.04.18.719331 medRxiv

Top 0.7%

1.9%

Show abstract

Invasive insect species pose serious threats to agriculture and ecosystems, with their spread increasingly accelerated by global trade and climate change. To support prevention and mitigation efforts, it is essential to map the regions where these pests can survive and thrive. Here, we apply MaxEnt, a leading species distribution modeling framework, to estimate current (2020) and future (2040-2060) suitable habitats for five major invasive insects across the contiguous United States: brown marmorated stink bug, corn earworm, spongy moth, root weevil, and spotted lanternfly. To account for an uncertain climatic future, these projections are generated under four shared socioeconomic pathways, which reflect a range of plausible climate change scenarios. Beyond forecasting distributions, we examine several key modeling decisions, especially those often overlooked in practice. In particular, we find that background sampling strategies play a critical role in model calibration and that a hybrid sampling approach with a moderate buffer bias provides better predictive accuracy. We also show that permutation importance scores, commonly used to rank environmental variables, are highly sensitive to small changes in the background data and should be interpreted with caution. Finally, to bridge the gap between ecological modeling and applied machine learning, we provide a self-contained, math-focused background to MaxEnt aimed at practitioners outside of traditional ecological fields. Overall, this work delivers reproducible modeling workflows and critical insights into building robust, transparent, and ecologically meaningful MaxEnt models for climate-informed species distribution analysis.

8

MedSafe-Dx (v0): A Safety-Focused Benchmark for Evaluating LLMs in Clinical Diagnostic Decision Support

Van Oyen, C.; Mirza-Haq, N.

2026-04-21 health informatics 10.64898/2026.04.14.26350711 medRxiv

Top 0.8%

1.7%

Show abstract

MedSafe-Dx (v0), introduces a new safety-focused benchmark for evaluating large language models in clinical diagnostic decision support using a filtered subset of the DDx Plus dataset (N=250). MedSafe-Dx evaluates three dimensions: escalation sensitivity, avoidance of false reassurance, and calibration of uncertainty. Models were tasked with providing a ranked differential (ICD-10), an escalation decision (Urgent vs. Routine), and a confidence flag. Performance was measured via a "Safety Pass Rate," a composite metric penalizing three hard failure modes: missed escalations of life-threatening conditions, overconfident incorrect diagnoses, and unsafe reassurance in ambiguous cases. Eleven models were evaluated and revealed a significant disconnect between diagnostic recall and safety. GPT-5.2 achieved the highest Safety Pass Rate (97.6%), while several models exhibited high rates of missed escalations or unsafe reassurance. MedSafe-Dx provides a robust stress test for identifying high-risk failure modes in diagnostic decision support and shows that high diagnostic accuracy does not guarantee clinical safety. While the benchmark is currently limited by synthetic data and proxy labels, it provides a reproducible, auditable framework for testing AI behavior before clinical deployment. Our findings suggest that interventions such as safety-focused prompting and reasoning-token budgets could be essential components for the safe deployment of LLMs in clinical workflows.

9

Benchmarking Generative Large Language Models for de novo Antibody Design and Agentic Evaluation

Hossain, D.; Abir, F. A.; Zhang, S.; Chen, J. Y.

2026-04-21 bioinformatics 10.64898/2026.04.18.716776 medRxiv

Top 0.9%

1.7%

Show abstract

Despite major advances in computational antibody engineering, no systematic comparison of modern open-source LLM backbone families for antibody sequence generation exists, nor is it known whether architectural differences matter at compact model scales. In this study, five compact transformer variants inspired by prominent open-source LLM families (Llama-4, Gemma-3, DeepSeek-V3, Mistral 7B, and NVIDIA Nemotron-3) were customized and trained from scratch for de novo VH single-domain antibody (sdAb) design. All five models were pretrained from scratch on 15 million sequences from the Observed Antibody Space (OAS) database. Pretraining yielded uniformly high generative fidelity across architectures: sequence diversity 0.507-0.516 (CV=0.8%), uniqueness approaching 1.0, and novelty 0.925-0.977 (CV=2.2%). The models were subsequently fine-tuned on disease-stratified repertoires spanning SARS-CoV-2 (n=4,688), HIV (n=430), HER2 (n=22,778), and Ebola virus (n=2,868). Structural assessment of top-ranked candidates of those case studies via AlphaFold-2, Boltz-2, RoseTTAFold-2, and ESMFold produced mean pLDDT scores of 92.88{+/-}1.54 to 93.77{+/-}2.16, with no statistically significant inter-model differences (Kruskal-Wallis H=2.06, p>0.05; N=100), indicating no statistically detectable difference was observed across architectures at this compressed scale in a single-seed experiment, suggesting that generative capacity at this parameter regime is primarily determined by training data and model scale rather than family-specific design elements at this scale. Computational docking yielded predicted binding free energies of -36.34 to -65.60 kcal/mol; independent biological rigor validation through IMGT-defined CDR-H3 extraction, BLASTp novelty assessment, and NetMHCIIpan 4.3 MHC-II immunogenicity profiling collectively confirmed antigen-binding loop novelty (CDR-H3 identity 0-29% to closest database hits), germline-consistent humanness (77-90% VH germline content), and immunogenically silent antigen-binding surfaces with no strong MHC-II binders detected across CDR regions in any candidate. We further introduce a proof-of-concept agentic evaluation pipeline leveraging the Model Context Protocol (MCP) with Claude Sonnet 4.6, enabling automated structural profiling and candidate prioritization across disease targets.

10

Dissecting the coordinated progression of cell states in spatial transcriptomics with CoPro

Miao, Z.; Qu, Y.; Huang, S.; Laux, L.; Peters, S.; Aristel, A.; Zhang, Z.; Niedernhofer, L. J.; McMahon, A.; Kim, J.; Zhang, N.

2026-04-21 bioinformatics 10.64898/2026.04.17.719309 medRxiv

Top 1.0%

1.7%

Show abstract

Spatial transcriptomics enables the study of how cells coordinate their molecular states within tissue, providing insight into both normal function and disease processes. A key challenge is to identify gene expression programs that vary continuously across space and are coordinated between cell types. We present CoPro, a computational framework for detecting the spatially coordinated progression of cellular states. CoPro can operate in both supervised and unsupervised modes to identify gene programs that co-vary within or between cell types, and to disentangle multiple overlapping spatial patterns. CoPro can be applied to single-cell-level spatial transcriptomics datasets, including MERFISH, SeqFISH+, Xenium, and histology-imputed transcriptomic data. We demonstrate the utility of CoPro with data collected from colon, brain, liver, and kidney tissues. In the colon, CoPro separates epithelial differentiation along the crypt axis from spatially localized inflammatory signals. In the aging liver, it identifies multiple aging-associated cellular programs superimposed on anatomical zonation. In the brain, the flexible kernel design enables the decoupling of the gene expression gradient along the dorsal-ventral and medial-lateral axes. In the kidney, CoPro identifies tubule-vasculature coordination that is essential in nephron function. These results demonstrate CoPros utility for analyzing spatial coordination of gene expression in complex tissues and disentangling overlapping biological processes, such as anatomical organization and disease-associated variation.

11

Benchmarking single-cell foundation models for real-world RNA-seq data integration

Han, S.; Sztanka-Toth, T.; Senel, E.; Elnaggar, A.; Patel, J.; Mansi, T.; Smirnov, D.; Greshock, J.; Javidi, A.

2026-04-21 bioinformatics 10.64898/2026.04.17.719314 medRxiv

Top 1.0%

1.7%

Show abstract

Single-cell foundation models enable reusable representations and streamlined analysis workflows, yet rigorous evaluation of their performance and robustness in real-world pharmaceutical settings remain underexplored. Here, we benchmarked leading single-cell foundation models (scGPT; scGPT_CP, a continually pretrained checkpoint of scGPT; scFoundation; scMulan; CellFM) against established baseline methods (scVI; Harmony) for data integration using over 1.5 million cells from clinical and preclinical samples. Performance was assessed using well-established and complementary metrics for technical correction and biological structure preservation. We further introduced robustness-oriented rankings to summarize metric trade-offs and quantify performance consistency across datasets and evaluation settings. Our findings show that fine-tuning improved technical correction performance; among the foundation models, fine-tuned scGPT_CP performed best. However, the baseline scVI was the top overall performer, ranking first by our multi-metric Leximax ranking and achieving the highest Pareto Front-1 hit. Collectively, our study provides practical insights for adapting foundation models to real-world drug design and development.

12

Data Resource Profile: EST-Health-30

Reisberg, S.; Oja, M.; Mooses, K.; Tamm, S.; Sild, A.; Talvik, H.-A.; Laur, S.; Kolde, R.; Vilo, J.

2026-04-24 epidemiology 10.64898/2026.04.21.26351087 medRxiv

Top 1%

1.5%

Show abstract

Background: The increasing availability of routinely collected health data offers new opportunities for population-level research, yet access to comprehensive, linked, and standardised datasets remains limited. We describe EST-Health-30, a large-scale, population-representative health data resource from Estonia. Methods: EST-Health-30 comprises a random 30% sample of the Estonian population (~500,000 individuals), with longitudinal data from 2012 to 2024 and annual updates planned through 2026. Individual-level records are linked across five nationwide databases, including electronic health records, health insurance claims, prescription data, cancer registry, and cause of death records. A privacy-preserving hashing approach ensures consistent cohort inclusion over time while maintaining pseudonymisation. All data are harmonised to the Observational Medical Outcomes Partnership (OMOP) Common Data Model (version 5.4) using international standard vocabularies. Data quality was assessed using established OMOP-based validation frameworks. Results: The dataset contains rich multimodal information on diagnoses, procedures, laboratory measurements, prescriptions, free-text clinical notes, healthcare utilisation, and costs, with high population coverage and longitudinal depth. Data quality assessment showed high completeness and consistency, with 99.2% of applicable checks passing. The age-sex distribution closely reflects the national population, supporting representativeness, though coverage is marginally below the target 30% (29.2%), primarily attributable to recent immigrants without health system contact. The dataset enables construction of detailed clinical cohorts, analysis of disease trajectories, and evaluation of healthcare utilisation and outcomes across the life course. Conclusions: EST-Health-30 is a comprehensive, standardised, and population-representative real-world data resource that supports epidemiological, clinical, and methodological research. Its alignment with the OMOP CDM facilitates reproducible analytics and participation in international federated research networks, while secure access infrastructure ensures compliance with data protection regulations.

13

SpaFlow depicts the dynamics of ligand-receptor interaction in spatial transcriptomics data

Chen, H.; Wang, X.; Sun, Y.; Vanegas, N. D. P.; Rodriguez, J.; Ghobashi, A.; Ma, A.; Mora, A. L.; Rojas, M.; Ma, Q.

2026-04-21 bioinformatics 10.64898/2026.04.17.719264 medRxiv

Top 1%

1.5%

Show abstract

Spatial transcriptomics (ST) enables the study of cell-cell communication in native tissue context, but current methods for the ligand-receptor interaction (LRI) inference generally rely on static, distance-based assumptions. Here we present SpaFlow, a reaction-diffusion framework that models ligand diffusion, binding, dissociation, production and degradation to infer spatially resolved LRI activity and hotspots from ST data. Across paired 10x Visium and CosMx metastatic renal cell carcinoma datasets, SpaFlow outperformed existing methods in recovering spatially coherent LRIs, with inferred LRI activity showing stronger association with downstream signaling. In hepatocellular carcinoma after neoadjuvant immunotherapy, SpaFlow identified CXCL12-CXCR4 hotspots enriched at immune-rich tumor boundaries in responders. In aging mouse heart, SpaFlow resolved niche-specific pro-fibrotic and senescence-associated signaling, highlighting Postn-Itgav/Itgb5 as an additional pro-fibrotic axis and Angptl2-Pirb as a candidate mediator of inter-niche senescence-related communication. In human idiopathic pulmonary fibrosis lung, SpaFlow localized CXCL12-CXCR4 signaling between adventitial fibroblasts and CD4 T cells, CD8 T cells, and B cells in the fibrotic surrounding regions. Together, SpaFlow provides a physically informed framework for quantifying spatially constrained cell-cell communication and mechanistically interpreting signaling patterns in complex tissues.

14

Development of an original algorithm to characterize serological antibody response that improve infectious diseases surveillance

RAZAFIMAHATRATRA, S. L.; RASOLOHARIMANANA, L. T.; ANDRIAMARO, T. M.; RANAIVOMANANA, P.; SCHOENHALS, M.

2026-04-24 epidemiology 10.64898/2026.04.16.26350925 medRxiv

Top 1%

1.3%

Show abstract

Interpreting serological data remains challenging, particularly in low prevalence or cross reactive contexts, where antibody responses often show substantial overlap between exposed and unexposed individuals and may depart from normal distributional assumptions. Conventional cutoff based approaches often yield inconsistent or biased estimates of seroprevalence. Here, we present a decisional framework based on finite mixture models (FMMs) that enhances the robustness and interpretability of serological analyses. Beyond simply applying mixture models, our framework integrates multiple methodological innovations : (i) systematic comparison of Gaussian and skew normal mixture models to accommodate asymmetric antibody distributions; (ii) rigorous model selection using the Cramer von Mises test (p > 0.01) combined with a parsimonious score (APS) to prioritize models with well separated clusters; and (iii) hierarchical clustering of posterior probabilities to collapse latent components into biologically meaningful seronegative and seropositive groups. Applied to chikungunya virus (CHIKV) data from Bangladesh, the framework produced prevalence estimates consistent with ROC based methods while probabilistically identifying borderline cases. Validation on SARS CoV 2 and dengue datasets further demonstrated its generalizability: for SARS CoV 2, the approach identified up to five latent clusters with high sensitivity (up to 100%) and specificity (up to 100%), enabling discrimination by disease severity. For dengue, it revealed interpretable subgrouping consistent with background exposure and subclinical infection, despite limited confirmed cases. By integrating distributional flexibility, robust goodness of fit testing, and biologically guided cluster consolidation, this decisional FMM framework provides a reproducible and scalable method for serological interpretation across pathogens and epidemiological settings, addressing key limitations of threshold based classification.

15

Mechanistic learning to predict and understand minimal residual disease

Marzban, S.; Robertson-Tessi, M.; West, J.

2026-04-21 cancer biology 10.64898/2026.04.16.718968 medRxiv

Top 1%

1.2%

Show abstract

Mechanistic modeling has long been used as a tool to describe the dynamics of biological systems, especially cancer in response to treatment. Their key advantage lies in interpretability of relationships between input parameters and outcomes of interest. In contrast, machine learning techniques offer strong prediction performance, especially for high dimensional datasets that are common in oncology. Here, we employ a Mechanstic Learning framework that combines the advantages of both approaches by training machine learning models on mechanistic parameters inferred from clinical patient data. The mechanistic model (a Markov chain model) contains sixteen parameters that describe the rate of cell fate transitions that occur in patients with B-cell precursor acute lymphoblastic leukemia. The machine learning (a ridge logistic regression model) is trained on these parameters to predict two clinically-relevant features: BCR::ABL1 fusion gene status (positive or negative) and minimal residual disease status (positive or negative) post-induction chemotherapy. Model training is done in an iterative fashion to assess which (and how many) parameters are critical to maintain high predictive performance. Using machine learning models trained on the clinical flow-cytometry data, we find that the stem-like cell state alone is the most predictive feature for both BCR::ABL1-positive and MRD-positive disease, with combination scores (defined as the average of accuracy, balanced accuracy, and area under the curve) of 0.80 and 0.67, respectively. By comparison, mechanistic learning achieves comparable or improved combination scores for BCR::ABL1-positive and MRD-positive disease, with scores of 0.81 and 0.71, respectively, using only de-differentiation for BCR::ABL1 and primitive-state persistence together with differentiation-directed exit for MRD. Thus, the mechanistic-learning approach not only preserves predictive performance, but also provides a biological hypothesis for why stemness is predictive of these clinically relevant outcomes.

16

OpenEvidence errs on the safe side in a structured test of triage recommendations

Jia, E.; Omar, M.; Barash, Y.; Brook, O. R.; Ahmed, M.; Kruskal, J. B.; Gorenshtein, A.; Klang, E.

2026-04-24 health informatics 10.64898/2026.04.23.26351526 medRxiv

Top 2%

1.2%

Show abstract

Ramaswamy et al. recently reported in Nature Medicine that ChatGPT Health, a consumer-facing health AI tool, undertriaged 51.6% of true emergencies. It was also susceptible to social anchoring in a structured stress test of triage recommendations. We applied the same vignette-based benchmark to OpenEvidence, a widely used physician-facing AI platform for clinical decision support. The benchmark included 960 prompts across 21 clinical domains (Supplementary Table S3). OpenEvidence undertriaged 12.5% of emergencies, a four-fold reduction relative to ChatGPT Health. It also showed no anchoring effect. Its errors skewed in a safer direction, including 68.0% overtriage of Home presentations. In 65 of 960 responses (6.8%), it declined to assign a triage level. These refusals occurred only in symptom-only prompts and never in urgent or emergency cases. Performance improved when objective clinical data were provided. Under the same benchmark, a widely used physician-facing system showed a different safety profile from a consumer-facing one. This suggests that who a health AI is built for can shape how it fails.

17

diagFDR: Verifiable False Discovery Rate Reporting in Proteomics via Scope, Calibration, and Stability Diagnostics

Chion, M.; Godmer, A.; Douche, T.; Matondo, M.; Giai Gianetto, Q.

2026-04-20 bioinformatics 10.64898/2026.04.16.718468 medRxiv

Top 2%

1.1%

Show abstract

AO_SCPLOWBSTRACTC_SCPLOWIn mass spectrometry-based proteomics, false discovery rate (FDR) control underpins the credibility of peptide and protein identifications. In contemporary workflows, including multi-run Data Independent Acquisition (DIA), deep learning-assisted scoring, library-free searches, and extensive post-processing, the statement "1% FDR" has become increasingly ambiguous, potentially referring to different statistical entities, multiple-testing scopes, and null models. We propose a standardized framework requiring explicit specification of three complementary properties: "scope", meaning which statistical universe is controlled; "calibration", meaning whether confidence measures behave consistently with their intended interpretation on the reported unit; and "stability", meaning whether acceptance thresholds and resulting identification lists remain robust to perturbations. Building on routine target/decoy outputs, we introduce pipeline-agnostic diagnostics that audit internal coherence of scores, q-values, and posterior error probabilities, quantify tail support and cutoff fragility, and test plausibility of target-decoy assumptions. We further complement internal checks with external validation via entrapment, which measures empirical false positives on knownabsent sequences. We highlight a "granularity paradox": as scoring becomes more discriminative, decoy matches can become so sparse near stringent cutoffs that the numerical support for decoy-based estimation deteriorates, making reported FDR thresholds increasingly fragile despite improved separation between the distributions of target and decoy scores. Applications to DIA-NN and MS2Rescore show that scope and aggregation choices can materially alter both estimated error rates and list reproducibility. We provide a practical reporting checklist and an open-source R package (diagFDR, available from CRAN) that generates diagnostic reports from standard software outputs. As a minimal verifiable reporting standard, we recommend that any "FDR = %" claim specify the controlled unit and scope, report tail support at the operating cutoff, and make decoy-inclusive outputs available for independent verification. HighlightsO_LIFDR claims can be misleading without explicit scope, calibration, and stability assessment. C_LIO_LIdiagFDR introduces pipeline-agnostic diagnostics from standard software outputs. C_LIO_LIThe granularity paradox shows sparse decoy tails can make stringent cutoffs numerically fragile. C_LIO_LICase studies show that scope misuse and rescoring can affect both error rates and stability. C_LIO_LIdiagFDR produces reviewer-ready reports and a practical reporting checklist. C_LI

18

Predicting Traffic Accident Injury Severity Using Ensemble Machine Learning Models: Incident Level and Generalized Insights via Explainable AI

Zhang, E. R.; Mermer, O.; Demir, I.

2026-04-20 occupational and environmental health 10.64898/2026.04.13.26350778 medRxiv

Top 2%

1.0%

Show abstract

Road traffic accidents represent a global public safety crisis, necessitating advanced computational tools for accurate injury severity prediction and effective decision support. This study evaluates high-performing ensemble machine learning models, including AdaBoost, XGBoost, LightGBM, HistGBRT, CatBoost, Gradient Boosting, NGBoost, and Random Forest, using a comprehensive National Highway Traffic Safety Administration (NHTSA) dataset from 2018 to 2022. While all models demonstrated exceptional predictive accuracy, with HistGBRT achieving the highest overall accuracy of 92.26%, a defining achievement of this work is the perfect classification (100% precision and recall) of fatal injuries across all ensemble architectures. To bridge the gap between predictive performance and actionable intelligence, this research integrates SHapley Additive exPlanations (SHAP) to provide both global insights into dataset-wide risk factors and local, instance-specific rationales for individual crash events. The global analysis identified ethnicity, airbag deployment, and harmful event type as primary drivers of injury severity, while local force and waterfall plots revealed the precise "push and pull" of variables for specific incidents. The results offer a robust, interpretable framework for stakeholders tasked with improving traffic safety and mitigating crash-related harm.

19

ReviewBench: An Extensible Framework for Benchmarking Human and AI Manuscript Review

Khalil, N. N.; Reed, T. J.; Ciccozzi, M. R.

2026-04-20 scientific communication and education 10.64898/2026.04.17.719279 medRxiv

Top 2%

1.0%

Show abstract

AO_SCPLOWBSTRACTC_SCPLOWThe volume of scientific manuscripts is rising faster than the available pool of expert reviewers, and AI tools are emerging as a possible response, ranging from frontier large language models applied directly to peer review to purpose-built multi-agent systems. Scalable, standardized benchmarks are needed to regularly evaluate how these tools compare to one another and to human reviewers. We present ReviewBench, an open-source, venue-agnostic framework that compares human and AI reviews across structure, alignment with a papers major claims, impact, and critique category. We apply ReviewBench to 145,021 review comments from human reviewers, frontier large language models (GPT-5.2, and Gemini 3 Pro), and Reviewer3.com (R3), a multi-agent peer review system. The dataset spans papers in computer science (ICLR 2025, n = 1,000), social science (Nature Human Behaviour, n = 142), and life science (eLife, n = 1,000). Across disciplines, AI reviews are more structured and engage more directly with a papers major claims, with R3 more often surfacing consequential comments, defined as comments capable of undermining those claims. When restricting to critical comments, however, human reviewers rank first on consequential rate on more individual papers than any AI source, despite a lower average. We identify a bimodal reviewer distribution with peaks near 0% and 100%, indicating that many reviewers outperform AI on this metric, but a substantial fraction of reviewers near 0% brings the average down. Critique typing demonstrates systematic differences, where humans emphasize contribution and clarity, while AI emphasizes validity, sufficiency, and transparency. Together, these findings argue against framing AI as a replacement for human review and instead support a complementary model in which AI scales technical verification of major claims while human judgment remains essential for evaluating contribution and shaping editorial decisions.

20

DOME Copilot: Making transparency and reproducibility for artificial intelligence methods simple

Farrell, G.; Attafi, O. A.; Fragkouli, S.-C.; Heredia, I.; Fernandez Tobias, S.; Harrison, M.; Hermjakob, H.; Jeffryes, M.; Obregon Ruiz, M.; Pearce, M.; Pechlivanis, N.; Lopez Garcia, A.; Psomopoulos, F.; Tosatto, S. C. E.

2026-04-19 bioinformatics 10.64898/2026.04.16.718888 medRxiv

Top 2%

1.0%

Show abstract

Unprecedented breakthroughs are being made in life science research through the application of artificial intelligence (AI). However, adherence to method reporting guidelines is necessary to support their reusability and reproducibility. The DOME Copilot solution extracts structured reports of AI methods using a large language model to help interpret manuscripts. It is a fast and efficient resource capable of scaling to annotate the corpus of global AI literature, unlocking value and trust in published methods.