Patterns — Latest Matching Preprints

1

Harnessing AI and social media to understand real-world patient experiences in systemic lupus erythematosus

Yang, S.; Hawryluk, C.; Liu, J.; Eckert, N.; Otoo, J.; Vina, E. R.; Yao, L.

2026-02-22 rheumatology 10.64898/2026.02.20.26346724 medRxiv

Top 0.1%

23.3%

Show abstract

ObjectiveTo apply large language models (LLMs) to Reddit posts referencing systemic lupus erythematosus (SLE) to identify patient-expressed unmet medical needs, symptom experiences, and healthcare challenges, demonstrating how AI-enabled social media listening complements traditional patient-experience research. MethodsWe extracted 4,633 posts from ten SLE-related or health-focused Reddit communities using the public Reddit API (October-November 2025). After removing duplicates, promotional content, and posts with insufficient information, 2,603 posts remained. A thematic codebook was developed through manual review of 300 posts and iteratively refined. Two LLMs (Gemini 3.0 and GPT-5.2) were evaluated for automated thematic labeling using percent agreement, Cohens {kappa}, and a human-annotated reference set (n=100). The best-performing model was used to quantify theme prevalence, followed by qualitative review of representative narratives. ResultsGPT-5.2 demonstrated higher performance (F1=0.844) than Gemini 3.0 (F1=0.811), with substantial inter-model agreement across main themes (mean {kappa}=0.71). Posts reflected multidimensional experiences. The most frequent subtheme was Advice Seeking (84.1%), followed by Emotional Coping (55.6%). Common symptom-related themes included Pain (37.2%), Other Symptom Presentations (37.6%), Fatigue (24.7%), and Acute or Worsening Flares (30.2%). Diagnostic uncertainty was prominent, including confusion about laboratory results (24.0%) and emotional impact of uncertainty (33.0%). Qualitative review highlighted emotional distress, reliance on peer communities for interpretation of symptoms and labs, and difficulty managing complex treatment regimens. ConclusionLLM-enabled social media listening offers a scalable method for synthesizing large volumes of unstructured patient narratives, providing timely insights into lived experiences and unmet needs among individuals discussing lupus online. Findings align with established qualitative literature while highlighting persistent gaps in patient education, communication, and care coordination. This analytical framework can be applied across disease areas to support patient-centered care, measurement development, and evidence generation relevant to therapeutic and health-services research. What is already known on this topicO_LIPeople living with systemic lupus erythematosus (SLE) experience substantial unmet needs related to diagnostic uncertainty, symptom burden, emotional distress, medication challenges, and healthcare system barriers. C_LIO_LITraditional qualitative methods (e.g., interviews, focus groups, surveys) capture valuable patient perspectives but are limited by small sample sizes, recall bias, and restricted question frameworks. C_LIO_LISocial media listening has emerged as a promising way to collect real-time patient insights, and recent regulatory guidance acknowledges its value as patient experience data. However, systematic, scalable analysis of large patient-generated datasets has historically been constrained by analytic burden and variability. C_LI What this study addsO_LIThis study is among the first to apply state-of-the-art large language models (LLMs) to a large corpus of SLE-related social media posts, enabling scalable thematic analysis of thousands of patient narratives. C_LIO_LIIt provides a validated methodological framework for using dual-LLM agreement, human-annotated references, and performance benchmarking (precision, recall, F1) to ensure reliability in automated thematic labeling. C_LIO_LIFindings reveal a multidimensional patient burden consistent with prior studies while uncovering persistent gaps in patient education, confusion around laboratory testing, care coordination challenges, and heavy reliance on peer communities for advice. C_LIO_LIThe approach demonstrates that LLM-enabled social media listening can generate timely, granular, patient-prioritized insights at a scale unattainable by traditional methods. C_LI How this study might affect research, practice, or policyO_LIResearch: Establishes a reproducible, scalable framework for integrating LLM-based thematic analysis into patient-focused evidence generation, accelerating insight extraction from large unstructured datasets across disease areas. C_LIO_LIClinical practice: Highlights actionable gaps in patient education, communication, and care coordination, informing interventions to improve clinical encounters, shared decision-making, and symptom management support. C_LIO_LIPolicy and regulatory science: Demonstrates how social media-derived patient experience data, when paired with rigorous quality controls, can complement formal qualitative studies and support patient-focused drug development, measurement development, and health-services planning. C_LI

2

Decomposing Heterogeneity in Disease Progression Speeds and Pathways

Yada, Y.; Naoki, H.; The Pooled Resource Open-Access ALS Clinical Trials Consortium,

2026-02-01 health informatics 10.64898/2026.01.30.26345194 medRxiv

Top 0.1%

18.6%

Show abstract

Understanding why patients with the same diagnosis exhibit markedly different disease progression--some progressing rapidly, others slowly, and through distinct symptom patterns--remains a major challenge in medicine. Here, we developed a machine learning framework called DiSPAH (Disease-progression Speed and Pathway Analysis based on a Hidden Markov model) to estimate both the pathway and speed of disease progression in individual patients. DiSPAH models disease progression as transitions of latent states evolving over continuous time with a patient-specific progression speed. We applied DiSPAH to longitudinal clinical scores from an amyotrophic lateral sclerosis (ALS) cohort and successfully inferred each patients hidden disease trajectory and progression speed. These individualized dynamics were significantly associated with baseline clinical features and enabled prediction of future disease course from data available at the first clinical visit. Our results highlight that jointly modeling progression pathway and speed improves prediction of heterogeneous disease courses, offering a powerful tool for personalized care and research in ALS and other chronic conditions. O_FIG O_LINKSMALLFIG WIDTH=200 HEIGHT=104 SRC="FIGDIR/small/26345194v1_ufig1.gif" ALT="Figure 1"> View larger version (22K): org.highwire.dtl.DTLVardef@fb65ccorg.highwire.dtl.DTLVardef@d861b7org.highwire.dtl.DTLVardef@1f7670dorg.highwire.dtl.DTLVardef@18e95d9_HPS_FORMAT_FIGEXP M_FIG Graphical Abstract: Schematic illustration of the computational framework proposed in this paper C_FIG

3

Parsing Neurometabolic Signatures of Multiple Sclerosis with MRSI and cPCA

Raghu, N.; Abbasi, M.; Tashi, Z.; Zamora, C.; Key, S.; Chong, C. D.; Zhou, Y.; Niklova, S.; Ofori, E.; Bartelle, B. B.

2026-02-16 radiology and imaging 10.64898/2026.02.13.26346248 medRxiv

Top 0.1%

14.2%

Show abstract

Magnetic Resonance Spectroscopy Imaging (MRSI) offers spatially-resolved, neurometabolic information, acquired non-invasively at whole-brain scales from human subjects. Analysis of MRSI however, is extremely challenging. The metabolic information is highly convolved, and sparsely distributed across millions of spatial-spectral datapoints, allowing for little direct human interpretation. Conversely, the overall low signal-to-noise with high-intensity artifacts can confound unsupervised machine learning approaches. These technical barriers have left much of the potential of MRSI unrealized. We acquired MRSI data from 4 human subjects with a diagnosis of multiple sclerosis (MS), incorporating experimental design into an informed machine learning approach. MRSI acquisitions were registered to anatomical MRI to label 105k spectra from brain tissue and 162 spectra from white matter hyperintensities (WMHs), an imaging biomarker associated with MS lesions. Spectral labels were then used in contrastive principal component analysis (cPCA) to filter artifacts and background features in the MRSI data from lesion salient features and clustered into statistically significant states based on features that could be interpreted from the original data. Our approach renders MRSI data into testable representations of neurometabolism, enabling the method for fundamental and clinical research. Graphical AbstractAnalysis workflow for neurometabolic profiling of MS lesions. MRSI and anatomical MRI is acquired and processed in parallel for spectral data and anatomical labels. Spectra are then labeled and separated into experimental vs background data for contrastive PCA. Spectra are clustered for similarity, further labeled, and projected onto a brain atlas for a neurometabolic view. O_FIG O_LINKSMALLFIG WIDTH=200 HEIGHT=71 SRC="FIGDIR/small/26346248v1_ufig1.gif" ALT="Figure 1"> View larger version (28K): org.highwire.dtl.DTLVardef@1300853org.highwire.dtl.DTLVardef@72922aorg.highwire.dtl.DTLVardef@1da30c3org.highwire.dtl.DTLVardef@1b77816_HPS_FORMAT_FIGEXP M_FIG C_FIG

4

MESSI: Multimodal Experiments with SyStematic Interrogation using nextflow

Liang, C.; Grewal, T.; Singh, A.; Singh, A.

2026-03-11 bioinformatics 10.64898/2026.03.09.710452 medRxiv

Top 0.1%

12.4%

Show abstract

BackgroundMultimodal biomedical studies increasingly profile multiple molecular and clinical modalities from the same samples, creating new opportunities for disease prediction and biological discovery. However, benchmarking multimodal integration methods remains difficult because studies often use inconsistent preprocessing, unequal tuning strategies, and non-comparable evaluation schemes, limiting fair assessment across methods. ResultsWe developed MESSI (Multimodal Experiments with SyStematic Interrogation), a reproducible Nextflow-based benchmarking framework for multimodal outcome prediction that standardizes data preparation, supports interoperable R and Python workflows, and enforces leakage-free nested cross-validation for model selection and model assessment. MESSI currently implements representative intermediate- and late-integration methods and supports bulk multiomics, bulk multimodal, and single-cell multiomics datasets. In simulation studies with known ground truth, most methods were well calibrated in the absence of signal and achieved high performance under strong signal, whereas differences emerged under weaker signal and in feature recovery. We then applied MESSI to 19 real datasets spanning cancer, neurodevelopmental, neurodegenerative, infectious, renal, transplant, and metastatic disease settings, with diverse modality combinations including transcriptomic, epigenomic, proteomic, imaging, electrical, clinical, and single-cell-derived features. Across bulk multimodal datasets, classification differences were generally modest, although DIABLO and multiview cooperative learning tended to rank highest, while MOFA+glmnet and MOGONET were weaker overall. Biological enrichment analyses revealed clearer differences: DIABLO, RGCCA, MOFA, and IntegrAO more consistently recovered significant Reactome, oncogenic, and tissue-relevant gene signatures. In single-cell multiomics benchmarks, method rankings were more dataset dependent, but DIABLO performed consistently well across all case studies, while RGCCA also showed strong performance in specific settings. Computational analyses further showed that DIABLO and MOFA had the most favorable runtime and memory profiles, whereas multiview was the most time-intensive and IntegrAO the most memory-demanding. ConclusionsMESSI provides a reproducible, extensible, and equitable framework for benchmarking multimodal integration methods under a common model assessment strategy. Our results indicate that no single method is uniformly optimal across datasets and objectives; instead, method choice should balance predictive performance, biological interpretability, and computational efficiency. MESSI establishes a foundation for transparent benchmarking and future extensions to broader multimodal learning tasks.

5

Informative Missingness in Nominal Data: A Graph-Theoretic Approach to Revealing Hidden Structure

Zangene, E.; Schwammle, V.; JAFARI, M.

2026-02-03 bioinformatics 10.1101/2025.08.22.670516 medRxiv

Top 0.1%

10.4%

Show abstract

Missing data is often treated as a nuisance, routinely imputed or excluded from statistical analyses, especially in nominal datasets where its structure cannot be easily modeled. However, the form of missingness itself can reveal hidden relationships, substructures, and biological or operational constraints within a dataset. In this study, we present a graph-theoretic approach that reinterprets missing values not as gaps to be filled, but as informative signals. By representing nominal variables as nodes and encoding observed or missing associations as edges, we construct both weighted and unweighted bipartite graphs to analyze modularity, nestedness, and projection-based similarities. This framework enables downstream clustering and structural characterization of nominal data based on the topology of observed and missing associations; edge prediction via multiple imputation strategies is included as an optional downstream analysis to evaluate how well inferred values preserve the structure identified in the non-missing data. Across a series of biological, ecological, and social case studies, including proteomics data, the BeatAML drug screening dataset, ecological pollination networks, and HR analytics, we demonstrate that the structure of missing values can be highly informative. These configurations often reflect meaningful constraints and latent substructures, providing signals that help distinguish between data missing at random and not at random. When analyzed with appropriate graph-based tools, these patterns can be leveraged to improve the structural understanding of data and provide complementary signals for downstream tasks such as clustering and similarity analysis. Our findings support a conceptual shift: missing values are not merely analytical obstacles but valuable sources of insight that, when properly modeled, can enrich our understanding of complex nominal systems across domains. O_FIG O_LINKSMALLFIG WIDTH=200 HEIGHT=107 SRC="FIGDIR/small/670516v2_ufig1.gif" ALT="Figure 1"> View larger version (24K): org.highwire.dtl.DTLVardef@99c5eaorg.highwire.dtl.DTLVardef@1909d8corg.highwire.dtl.DTLVardef@1578c93org.highwire.dtl.DTLVardef@ce2e90_HPS_FORMAT_FIGEXP M_FIG C_FIG Shiny app address https://ehsan-zangene.shinyapps.io/nimaa_app/

6

MedResearchBench: A Multi-Domain Benchmark for Evaluating AI Research Agents on Clinical Medical Research

Tan, S.; Tian, Z.

2026-03-31 health informatics 10.64898/2026.03.30.26349749 medRxiv

Top 0.1%

9.3%

Show abstract

The rapid advancement of AI research automation systems--including AI Scientist, data-to-paper, and Agent Laboratory--has demonstrated the potential for autonomous scientific discovery. However, existing benchmarks for evaluating these systems focus predominantly on fundamental sciences (machine learning, physics, chemistry), overlooking the unique challenges of medical clinical research: complex survey designs, inferential statistics with confounding control, adherence to reporting standards (STROBE, CONSORT), and the requirement for clinically actionable interpretation. We present MedResearchBench, the first benchmark specifically designed to evaluate AI systems on medical clinical research tasks. MedResearchBench comprises 16 tasks spanning 7 clinical domains (cardiovascular, oncology, mental health, metabolic, respiratory, neurology, infectious disease), built on publicly available datasets (the National Health and Nutrition Examination Survey [NHANES] and the Surveillance, Epidemiology, and End Results [SEER] program) with ground truth from 16 high-quality published papers (IF range: 2.3-51.0). Each task is evaluated along 6 medical-specific dimensions: statistical methodology, results accuracy, visualization quality, clinical interpretation, confounding sensitivity, and reporting compliance. We describe the benchmark design rationale, task construction methodology, paper selection criteria with anti-paper-mill filtering, and a detailed analysis of task characteristics including methodological diversity, evaluation dimension coverage, and difficulty stratification. To demonstrate benchmark executability, we evaluate an agentic data2paper pipeline on 3 pilot tasks spanning all three difficulty tiers, achieving scores of 72/100 (Tier 1, Cardio_000), 69/100 (Tier 2, Mental_000), and 75/100 (Tier 3, Metabolic_002), with a mean score of 72/100 (B-level). Survey-weighted methodology was correctly implemented across all tasks; primary limitations included covariate incompleteness and reference group misspecification. MedResearchBench addresses a critical gap in AI research evaluation and provides a standardized, community-extensible platform for assessing whether AI systems can conduct clinically sound, publication-quality medical research. All task materials are publicly available at https://github.com/TerryFYL/MedResearchBench.

7

Data-efficient Self-Supervised Diffusion Learning for Detecting Myofascial Pain in Upper Trapezius Muscle with B-mode Ultrasound Videos

Lu, H.-E.; Koivisto, D.; Lou, Y.; Zeng, Z.; Yu, T.; Wang, J.; Meng, X.; Nowikow, C.; Wilson, R.; Kumbhare, D.; Pu, J.

2026-04-08 radiology and imaging 10.64898/2026.04.07.26350333 medRxiv

Top 0.1%

8.5%

Show abstract

Deep learning has transformed medical image and video analysis, but it usually requires large, well annotated datasets. In many clinical domains, especially when testing novel mechanistic hypotheses, such retrospective datasets are hard to obtain since acquiring adequate cohorts is time intensive, costly, and operationally difficult. This creates a critical translational gap: scientifically compelling early stage ideas may remain untested due to lack of sufficient sample size to support conventional deep learning pipelines. Developing data-efficient strategies for evaluating new hypotheses within small prospective cohorts is therefore essential to de-risk innovation before large-scale validation. Myofascial Pain Syndrome (MPS) exemplifies this challenge, as quantitative ultrasound imaging biomarkers for MPS remain underexplored. We investigated whether MPS in the upper trapezius can be detected from full B-mode ultrasound videos in a small prospective cohort (11 controls, 13 patients). Videos were automatically preprocessed and resampled using a sliding window strategy to expand training samples (404 clips). A self-supervised Video Diffusion Encoder (VDE) is developed to learn spatiotemporal representations without relying on extensive labeled data, and compared it with transfer-learning-based ResNet, VideoMAE, and SimCLR. Using subject-level stratified four-fold cross-validation, the VDE outperformed transfer learning baselines and achieved performance comparable to SimCLR, with subject-level AUC of 0.79 and accuracy of 0.86, and no significant differences between latent-only and combined trigger point analyses. These results demonstrate that self-supervised diffusion learning can support robust, data-efficient deep learning in small prospective studies, enabling early feasibility testing of innovative ultrasound biomarkers before large-scale clinical trials.

8

Bridging the gap between genome-wide association studies and network medicine with GNExT

Arend, L.; Woller, F.; Rehor, B.; Emmert, D.; Frasnelli, J.; Fuchsberger, C.; Blumenthal, D. B.; List, M.

2026-02-02 bioinformatics 10.64898/2026.01.30.702559 medRxiv

Top 0.1%

8.3%

Show abstract

MotivationA growing volume of large-scale genome-wide association study (GWAS) datasets offers unprecedented power to uncover the genetic determinants of complex traits, but existing web-based platforms for GWAS data exploration provide limited support for interpreting these findings within broader biological systems. Systems medicine is particularly well-suited to fill this gap, as its network-oriented view of molecular interactions enables the integration of genetic signals into coherent network modules, thereby opening opportunities for disease mechanism mining and drug repurposing. ResultsWe introduce GNExT (GWAS Network Exploration Tool), a web-based platform that significantly extends the scope of exploration of variant-level effects and significance beyond those provided by existing solutions. By including MAGMA and Drugst.One, GNExT allows its users to study genetic variants in the context of the latest systems medicine approaches, extending to the identification of potential drug repurposing candidates. Moreover, GNExT advances platform implementation well beyond the current state of the art by offering a highly standardized Nextflow pipeline for data import and preprocessing, allowing researchers to deploy their study results on a sophisticated web interface with minimal implementation overhead. We demonstrate the utility of GNExT using a genome-wide association meta-analysis of human olfactory identification, in which the framework translated isolated GWAS signals to potential pharmacological targets in human olfaction. Furthermore, the deployment of a GNExT instance on European-ancestry Pan-UK Biobank data demonstrates the frameworks scalability, resulting in a comprehensive large-scale resource encompassing thousands of traits and enabling new network medicine-based investigations. Availability and ImplementationThe complete GNExT ecosystem, including the Nextflow preprocessing pipeline, the backend service, and frontend interface, is publicly available on GitHub (https://github.com/dyhealthnet/gnext_nf_pipeline, https://github.com/dyhealthnet/gnext_platform). The public instances of the GNExT platform on olfaction and Pan-UKBB are available under https://olfaction.gnext.gm.eurac.edu and https://panukbb-eur.gnext.gm.eurac.edu.

9

Tracing cell communication programs across conditions at single cell resolution with CCC-RISE

Ramirez, A.; Thomas, N.; Calabrese, D. R.; Greenland, J. R.; Meyer, A. S.

2026-04-15 systems biology 10.64898/2026.04.14.718551 medRxiv

Top 0.1%

8.2%

Show abstract

Cell-cell communication (CCC) mediates coordinated cellular activities that vary dynamically across time, location, and biological context. While various tools exist to infer CCC, they typically aggregate data according to pre-defined cell types, obscuring critical single-cell heterogeneity. Furthermore, because signaling pathways and cell populations operate in a coordinated manner, an integrative analytical approach is essential. To address these challenges, we developed CCC-RISE, an extension of the tensor-based method Reduction and Insight in Single-cell Exploration (RISE). CCC-RISE identifies integrative patterns of single-cell variation by deconvolving communication into interpretable modules defined by unique sender cells, receiver cells, ligands, and condition associations. We applied this framework to a COVID-19 cohort with varying disease severity and a lung transplant cohort with acute allograft dysfunction. In both contexts, CCC-RISE successfully identified disease-relevant communication programs and traced them to specific cellular subpopulations, often crossing conventional cell-type boundaries. This approach offers a robust pipeline enabling the identification of disease-relevant signaling subpopulations that are invisible to aggregate methods. HighlightsO_LICCC-RISE enables integrative analysis of cell-cell communication across multiple conditions at single-cell resolution C_LIO_LICCC-RISE deconvolves signaling patterns into modules defined by their sender cells, receiver cells, LR pairs, and experimental conditions/samples C_LIO_LIAnalysis at single-cell resolution uncovers signaling activity within and across conventional cell types C_LI

10

Visualizing and sonifying neurodata (ViSoND) for enhanced observation

Blankenship, L.; Sterrett, S. C.; Martins, D. M.; Findley, T. M.; Abe, E. T. T.; Parker, P. R. L.; Niell, C.; Smear, M. C.

2026-03-24 neuroscience 10.64898/2026.03.21.713430 medRxiv

Top 0.1%

8.2%

Show abstract

Neuroscience needs observation. Observation lets us evaluate data quality, judge whether models are biologically realistic, and generate new hypotheses. However, high-dimensional behavioral and neural data are too complex to be easily displayed and eye-tested. Computational methods can reduce the dimensionality of data and reveal statistically robust dynamical structure but often yield results that are difficult to relate back to the underlying biology. In addition, the choice of what parameters to quantify may not capture unexpectedly relevant aspects of the data. To supplement quantification with enhanced qualitative observation, we developed Visualization and Sonification of NeuroData (ViSoND), an open-source approach for displaying multiple data streams using video and sonification. Sonification is nothing new to neuroscience. Scientists have sonified their physiological preparations since Lord Adrians earliest recordings. We extend this tradition by mapping multiple physiological datastreams to musical notes using MIDI. Synchronizing MIDI to video provides an opportunity to watch an animals movement while listening to physiological signals such as action potentials. Here we provide two demonstrations of this approach. First, we used ViSoND to interpret behavioral structure revealed by a computational model trained on the breathing rhythms of freely behaving mice. Second, ViSoND revealed patterns of neural activity in mouse visual cortex corresponding to eye blinks, events that were previously filtered out of analysis. These use cases show that ViSoND can supplement quantitative rigor with observational interpretability. Additionally, ViSoND provides an accessible way to display data which may broaden the audience for communication of neuroscientific findings.

11

Federated Learning Performance Depends on Site Variation in Global HIV Data Consortia

Jackson, N. J.; Yan, C.; Caro-Vega, Y.; Paredes, F.; Ismerio Moreira, R.; Cadet, S.; Varela, D.; Cesar, C.; Duda, S. N.; Shepherd, B. E.; Malin, B. A.

2026-03-27 health informatics 10.64898/2026.03.25.26349286 medRxiv

Top 0.1%

8.0%

Show abstract

Digital health technologies, including machine learning (ML), are transforming infectious disease management, however ML models for HIV care have been limited by data sharing restrictions that prevent multi-site collaboration. Federated Learning (FL) offers a privacy-preserving solution, enabling cross-site model training without sharing patient-level data. We evaluated FL for developing clinical prediction models using data from 22,234 people living with HIV (PLWH) across six sites in five countries within the Caribbean, Central, and South America network for HIV epidemiology (CCASAnet). Across four prediction tasks --- 1-year mortality, 3-year mortality, tuberculosis incidence, and AIDS-defining cancer incidence --- FL algorithms achieved near-centralized performance while substantially outperforming site-specific models. Performance gains varied across sites, driven by both site size and between-site heterogeneity. Local fine-tuning often improved FL performance, though benefits were task dependent. These findings support FL as a scalable, privacy-preserving infrastructure for multi-site ML in international HIV research.

12

Interpretability as stability under perturbation reveals systematic inconsistencies in feature attribution

Piorkowska, N. J.; Olejnik, A.; Ostromecki, A.; Kuliczkowski, W.; Mysiak, A.; Bil-Lula, I.

2026-04-22 health informatics 10.64898/2026.04.20.26351354 medRxiv

Top 0.1%

7.0%

Show abstract

Interpreting machine learning models typically relies on feature attribution methods that quantify the contribution of individual variables to model predictions. However, it remains unclear whether attribution magnitude reflects the true functional importance of features for model performance. Here, we present a unified interpretability framework integrating permutation-based attribution, feature ablation, and stability under perturbation across multiple feature spaces. Using nested cross-validation and permutation-based null diagnostics, we systematically evaluate the relationship between attribution magnitude and functional dependence in clinical and biomarker-based prediction models. Attribution magnitude is frequently misaligned with functional importance, with weak to strong negative correlations observed across feature spaces (Spearman {rho} ranging from -0.374 to -0.917). Features with high attribution often have limited impact on model performance when removed, whereas features with low attribution can be essential for maintaining predictive accuracy. These discrepancies define distinct classes of interpretability failure, including attribution excess and latent dependence. Interpretability further depends on feature space composition, and stable, functionally relevant features are not necessarily those with the highest attribution scores. By integrating attribution, functional impact, and stability into a composite Feature Reliability Score, we identify features that remain informative across perturbations and analytical contexts. These findings indicate that interpretability does not arise from attribution magnitude alone but is better characterized from stability under perturbation. This framework provides a basis for more robust model interpretation and highlights limitations of attribution-centric approaches in high-dimensional and correlated data settings.

13

The results of Transcriptome-wide Mendelian Randomization (TWMR) in large-scale populations can directly validate, across scales, the results of causal inference from deep learning combined with double machine learning on single-cell transcriptomes of human samples.

ye, w.; Jiang, X.; Shen, F.

2026-03-19 rheumatology 10.64898/2026.03.16.26348532 medRxiv

Top 0.1%

6.9%

Show abstract

ObjectiveAiming at the core problems prevalent in biomedical research, including the "translational distance", the difficulty in aligning cross-scale studies, and the lack of direct validation of single-cell systems biology models in human samples, this study aims to verify whether the results of transcriptome-wide Mendelian randomization (TWMR) based on large-scale populations are consistent with the causal inference results of deep learning combined with double machine learning (DML) using single-cell transcriptome data from human samples, to clarify whether statistical biology and systems biology can converge to the same biological truth, and provide methodological support for mechanism dissection and precision medicine research of complex diseases such as rheumatoid arthritis (RA). MethodsThis study integrated multi-omics data to conduct a two-stage causal inference and cross-scale validation analysis. In the first stage, based on the summary statistics of RA genome-wide association study (GWAS) from 456,348 individuals of European ancestry in the UK Biobank (UKB), and cis-expression quantitative trait locus (cis-eQTL) data from 31,684 individuals in the eQTLGen Consortium, a two-sample Mendelian randomization approach was adopted. Transcriptome-wide causal effect analysis was performed using the inverse-variance weighted (IVW) method, MR Egger regression, and weighted median method, and gene-level causal effect values were obtained after strict quality control and multiple testing correction. In the second stage, based on single-cell RNA sequencing (scRNA-seq) data from RA patients and healthy controls (RA group: 11 samples, 211,867 cells; Healthy control group: 38 samples, 456,631 cells), after preprocessing via the Seurat pipeline, batch effect correction, and cell type annotation, a hierarchical deep neural network was constructed to complete feature compression of high-dimensional expression data, and the DML framework was used to estimate the causal effects of genes on RA disease status. Finally, Pearson correlation analysis was performed to conduct cell type-specific cross-scale validation of gene-level causal effect values obtained by the two methods, and the validated model was used to quantify the causal effects of 16 RA-related pathways from the Reactome database. ResultsThis study confirmed that the gene causal effect values obtained from large-scale population TWMR analysis were significantly correlated with those calculated by the deep learning combined with DML model based on single-cell transcriptome data. Among them, the correlation was extremely significant (p<0.001) in core naive B cells (r=0.202, p=3.2e-05, n=414) and core naive CD4 T cells (r=0.102, p=0.037, n=412). The validated DML model successfully quantified the cell type-specific causal effect values of 16 RA-related signaling pathways. ConclusionStatistical biology and systems biology can converge to the same biological truth. The cross-scale consistency between the two can significantly shorten the "translational distance" in biomedical research, and realizes the direct validation of the single-cell systems biology causal model of human samples based on large-scale population genetic data, getting rid of the excessive dependence on animal/cell experimental models in traditional research. This research paradigm not only provides a new path for mechanism dissection and therapeutic target screening of complex diseases such as RA, but also provides a feasible solution for rare disease research to break through the limitation of GWAS sample size, and lays an important theoretical and methodological foundation for constructing standardized systems biology models of human complex diseases and promoting the development of precision medicine.

14

FluNexus: a versatile web platform for antigenic prediction and visualization of influenza A viruses

Li, X.; Zhou, C.; Wu, H.; Xiao, K.; Hao, J.; Zhao, D.; Zhu, J.; Li, Y.; Peng, J.; Gu, J.; Deng, G.; Cai, W.; Li, M.; Liu, Y.; Shang, X.; Chen, H.; Kong, H.

2026-01-30 bioinformatics 10.64898/2026.01.29.702696 medRxiv

Top 0.1%

6.9%

Show abstract

Influenza A viruses continuously undergo antigenic evolution to escape host immunity induced by previous infections or vaccinations, consequently causing seasonal epidemics and occasional pandemics. Antigenic prediction and visualization of influenza A viruses are crucial for precise vaccine strain selection and robust pandemic preparedness. However, a user-friendly online platform for these capabilities remains notably absent, despite widespread demand. Here, we present FluNexus (https://flunexus.com), the first-of-its-kind, one-stop-shop web platform designed to facilitate the prediction and visualization of the antigenic change in emerging variants. FluNexus features a data preprocessing module for hemagglutinin subunit 1 (HA1) and hemagglutination inhibition (HI) data across three major public health threat subtypes (H1, H3 and H5). Meanwhile, FluNexus provides an interactive interface for online antigenic prediction and offers practical guidance for researchers. Most notably, FluNexus offers the visualization of influenza A virus antigenic evolution, providing intuitive insights into its antigenic dynamics. Specially, FluNexus proposes a novel manifold-based method for positioning antigens and antisera, generating accurate antigenic cartographies even with sparse HI data. By alleviating the programming burden on biologists, FluNexus supports more informed decision-making in vaccine strain selection and strengthens surveillance and pandemic preparedness. HighlightsO_LIFluNexus features a data preprocessing module for HA1 and HI data spanning the H1, H3, and H5 subtypes. C_LIO_LIFluNexus facilitates online antigenic prediction utilizing ten state-of-the-art antigenic prediction tools, and offers practical guidance based on a comparative evaluation of their performance. C_LIO_LIFluNexus provides a visualization module for mapping antigenic evolution of influenza A viruses, incorporating a novel manifold-based method for antigenic cartography. C_LI

15

Community needs for FAIR pathogen data

van Geest, G.; Thomas-Lopez, D.; Feitzinger, A. A.; Weissgold, L. A.; Halabi, S.; Cuesta, I.; Hjerde, E.; Gurwitz, K. T.; Arora, N.; Neves, A.; Palagi, P. M.; Williams, J. J.

2026-04-15 scientific communication and education 10.64898/2026.04.14.718420 medRxiv

Top 0.1%

6.4%

Show abstract

BackgroundDatasets related to infectious diseases are essential for public health decision-making, yet their reuse remains limited by persistent barriers to data sharing and integration. Achieving data that are Findable, Accessible, Interoperable, and Reusable (FAIR) is widely recognized as essential for accelerating scientific discovery and enabling coordinated responses to emerging threats, but the needs of the global pathogen data community have not been systematically characterized. AimThis study, conducted by the Pathogen Data Network (PDN), aims to identify infrastructural and educational priorities among stakeholders working with infectious disease-related data in order to guide community-responsive support for data sharing and interoperability. MethodsA cross-sectional stakeholder survey was disseminated to a well-defined expert population within PDN networks and via open professional channels. A total of 136 responses from researchers, healthcare professionals, bioinformaticians, and educators were analyzed descriptively to identify prioritized barriers, training needs, and preferred support mechanisms. ResultsRespondents consistently identified structural constraints as the primary impediments to effective data use, including limited funding (74%), data-aggregation challenges (68%), and a shortage of skilled personnel (52%). Respondents identified bioinformatics for infectious disease research (68%) as the highest priority for training, followed by guidance on using the integrated pathogen data and tools portal provided by the PDN, the Pathogens Portal (51%). The Pathogens Portal was also ranked as the most essential PDN resource (72%). Preferred training formats included virtual short courses (68%) and webinars (66%). Notably, while researchers emphasized technical subjects like machine learning, educators prioritized foundational case studies. ConclusionThese findings provide an evidence-based diagnostic of community needs and suggest that barriers to FAIR pathogen data are predominantly systemic rather than purely technological. The survey framework and openly available dataset offer a reusable template for assessing needs in other communities and regions. By aligning training, infrastructure development, and outreach with empirically identified priorities, organizations supporting infectious disease research can strengthen the interoperability and reuse of data and establish a benchmark for future community-driven improvements.

16

OpenScientist: evaluating an open agentic AI co-scientist to accelerate biomedical discovery

Roberts, K. F.; Abrams, Z. B.; Cappelletti, L.; Moqri, M.; Heugel, N.; Caufield, J. H.; Bourdenx, M.; Li, Y.; Banerjee, J.; Foschini, L.; Galeano, D.; Harris, N. L.; Li, M.; Ying, K.; Melendez, J. A.; Barthelemy, N. R.; Bollinger, J. G.; He, Y.; Ovod, V.; Benzinger, T. L. S.; Flores, S.; Gordon, B.; Ojewole, A. A.; Phatak, M.; Elbert, D. L.; Biber, S.; Landsness, E. C.; Mungall, C. J.; Bateman, R. J.; Reese, J.

2026-03-18 health informatics 10.64898/2026.03.15.26348338 medRxiv

Top 0.1%

6.3%

Show abstract

BackgroundAdvances in medicine depend on analyzing large and complex data sources, but discovery is partly constrained by the limited time and domain expertise of human researchers. Agentic artificial intelligence (agentic AI) can accelerate discovery by automating components of the scientific workflow, including information retrieval, data analysis, and knowledge synthesis. AimOpenScientist, an open-source agentic AI co-scientist, aims to accelerate biomedical discovery by semi-autonomously investigating scientist-defined queries and generating clinically relevant, verifiable scientific insights. MethodsDomain experts evaluated OpenScientist for novel discoveries in four clinical case studies: (1) a prespecified analysis in a community-based Alzheimers disease biomarker cohort, (2) unsupervised modeling for plasma proteomic survival prediction, (3) hypothesis investigation in single-cell transcriptomic data from neurons with neurofibrillary tangles, and (4) hypothesis generation with validation in a multiple myeloma dataset with a randomized negative control. ResultsOpenScientist completed analyses in minutes that otherwise would take weeks to months of human time and expertise. It identified %ptau217 as the best predictor of amyloid PET status, generated a plasma proteomic survival model with performance comparable to published models, proposed a mechanism linking tau pathology to altered lysosomal acidification, and generated multiple myeloma hypotheses that were validated in an external cohort while distinguishing true signal from randomized controls. ConclusionOpenScientist demonstrates that open, auditable, agentic AI can support real-world clinical research by generating hypotheses, executing analyses, and discovering insights from complex datasets.

17

Handling onset age inconsistencies in longitudinal healthcare survey data

Li, W.; Yuan, M.; Park, Y.; Dao Duc, K.

2026-02-23 health informatics 10.64898/2026.02.20.26346741 medRxiv

Top 0.1%

6.3%

Show abstract

AO_SCPLOWBSTRACTC_SCPLOWLongitudinal healthcare surveys frequently contain inconsistencies in self-reported onset ages, where participants report different ages for the same condition between enrollment and follow-up surveys. We propose two methods to handle this challenge. First, we introduce a procedure that aggregates inconsistency patterns to construct participant-level reliability scores, enabling researchers to stratify participants and prioritize analysis on high-reliability cohorts. Second, we present a Bayesian adjustment method that models enrollment and follow-up reports as noisy observations of a latent true onset age, producing adjusted estimates for the inconsistent observations that account for age-dependent and inter-survey-time effects. We evaluate both methods using data from the Canadian Partnership for Tomorrows Health. In general, both methods substantially strengthen correlations between biologically related conditions and improve predictive performance across classification and regression tasks. In addition, high-reliability cohorts from reliability score-based stratification reveal more coherent and interpretable disease clustering networks, and Bayesian adjustment shows particularly notable gains when multiple inconsistent variables are adjusted simultaneously. Finally, we provide guidance on choosing between these methods for healthcare practitioners. Institutional Review Board (IRB)The study is approved by the University of British Columbia IRB (IRB #H23-03800). Data and Code AvailabilityCanPath data are available to researchers through a controlled access process via the CanPath Access Portal (https://portal.canpath.ca). The code is available at https://anonymous.4open.science/r/canpath-FCCF.

18

Hierarchical Multi-Omics Trajectory Prediction forFecal Microbiota Transplantation: A Novel MachineLearning Framework for Small-Sample LongitudinalMulti-Omics Integration

Zhou, Y.-H.; Sun, G.

2026-02-23 bioinformatics 10.64898/2026.02.21.707174 medRxiv

Top 0.1%

6.3%

Show abstract

Fecal microbiota transplantation (FMT) has emerged as a highly effective treatment for recurrent Clostridioides difficile infection and is being actively investigated for numerous other conditions. While multi-omics studies have revealed dynamic changes in microbial communities and host metabolism following FMT, existing approaches are primarily descriptive and lack the ability to predict individual patient trajectories or identify early biomarkers of treatment response. Small-sample, multi-omics, longitudinal prediction problems present unique computational challenges: high dimensionality, multi-omics integration, temporal dynamics, and interpretability. Here, we present Hierarchical Multi-Omics Trajectory Prediction (HMOTP), a novel machine learning framework specifically designed for small-sample, multi-omics, longitudinal prediction that addresses these challenges through hierarchical feature construction using domain knowledge, multi-level attention mechanisms, and patient-specific trajectory prediction. HMOTP integrates multi-omics data at multiple biological levels (raw features, aggregated classes/categories, and cross-level interactions) while preserving biological interpretability. The framework employs multi-head attention to learn feature importance at different hierarchy levels and integrates information across omics layers. Patient-specific trajectory prediction enables personalized predictions despite limited sample sizes through transfer learning. We evaluated HMOTP on a cohort of 15 patients with recurrent Clostridioides difficile infection who underwent fecal microbiota transplantation, with comprehensive lipidomics (397 features) and metagenomics (10,634 pathways) profiling at four timepoints spanning six months. Using leave-one-patient-out cross-validation, HMOTP achieved 96.67% {+/-} 10.54% accuracy, outperforming baseline methods including Random Forest (91.33% {+/-} 21.33%) and Logistic Regression (86.33% {+/-} 24.67%). The framework demonstrated robust generalization across timepoints. Through hierarchical interpretability, HMOTP identified key biomarkers and revealed mechanistically informative cross-omics associations, including 324 strong correlations (|r| > 0.7) involving top-predictive biomarkers, demonstrating its utility for both prediction and biological discovery in FMT applications. HMOTP provides a generalizable framework applicable to other small-sample multi-omics problems, offering a powerful tool for personalized medicine applications. Biographical NoteProf. Zhou is an interdisciplinary statistician and machine learning expert whose work develops innovative computational methods for multi-omics integration, biomedical prediction, and precision medicine applications. Key PointsOur novel framework, HMOTP, addresses this challenge through three key innovations: O_LIHierarchical feature construction using domain knowledge - Reduces dimensionality while preserving biological interpretability, unlike PCA-based methods C_LIO_LIMulti-level attention mechanisms - Learns feature importance at multiple biological scales (individual features [->] classes [->] cross-omics interactions) C_LIO_LIPatient-specific trajectory prediction with transfer learning - Enables personalized predictions despite limited sample sizes (parameter-sharing within the cohort, not external pre-training) C_LI

19

MetaOmixTools: an interactive web suite for meta-analysis of ranked features and functional enrichment

Grillo-Risco, R.; Kupchyk Tiurin, M.; Perpina-Clerigues, C.; Cordero Felipe, F. J.; Lozano, S.; de la Iglesia, M.; Garcia-Garcia, F.

2026-02-25 bioinformatics 10.64898/2026.02.24.707748 medRxiv

Top 0.1%

6.3%

Show abstract

The growing number of omics datasets in public repositories provides an opportunity to enhance data reusability through data integration; however, complex statistical barriers often hinder the effective combination of independent studies. To address this problem, we present MetaOmixTools, an interactive web-based suite that streamlines the meta-analysis of ranked feature lists and functional enrichment profiles. The platform integrates two primary modules - MetaRank and MetaEnrich within a code-free environment. MetaRank generates robust consensus rankings from multiple lists by implementing weighted (e.g., Rank Product) and unweighted (e.g., Robust Rank Aggregation) strategies, while MetaEnrich performs functional meta-analyses by combining probability values from individual over-representation analyses using established statistical techniques. Using case studies, we established consensus rankings for acute spinal cord injury across heterogeneous platforms, identifying conserved inflammatory marker genes in the upregulated gene list (e.g., Slpi, Ccl2, Msr1) and synaptic loss genes in the downregulated gene list (e.g., Kcna2, Dao, Ppp1r1b), and also characterized inverse functional intersections between melanoma brain metastasis and neurodegenerative diseases. By providing intuitive, real-time visualization and reproducible workflows, MetaOmixTools empowers the research community to extract consistent biological insights from multi-study data. We have made MetaOmixTools freely available at https://bioinfo.cipf.es/metaomixtools/. Graphical Abstract O_FIG O_LINKSMALLFIG WIDTH=165 HEIGHT=200 SRC="FIGDIR/small/707748v1_ufig1.gif" ALT="Figure 1"> View larger version (45K): org.highwire.dtl.DTLVardef@1fd5211org.highwire.dtl.DTLVardef@170c59org.highwire.dtl.DTLVardef@12bc6e7org.highwire.dtl.DTLVardef@10f9674_HPS_FORMAT_FIGEXP M_FIG C_FIG

20

Unsupervised Machine Learning for Adaptive Immune Receptors with immuneML

Pavlovic, M.; Wurtzen, C.; Kanduri, C.; Mamica, M.; Scheffer, L.; Lund-Andersen, C.; Gubatan, J. M.; Ullmann, T.; Greiff, V.; Sandve, G. K.

2026-04-18 bioinformatics 10.64898/2026.04.15.718648 medRxiv

Top 0.1%

6.2%

Show abstract

Machine learning (ML) enables adaptive immune receptor repertoires (AIRRs) analyses for biomarker identification and therapeutic development. With the majority of AIRR data partially or imperfectly labeled, unsupervised ML is essential for motif discovery, biologically meaningful clustering, and generation of novel receptor sequences. However, no unified framework for unsupervised ML exists in the AIRR field, hindering the assessment of model robustness and generalizability. Here, we present an immuneML release advancing unsupervised ML in the AIRR field through unified clustering workflows, interpretable generative modeling, integration with protein language model embeddings, dimensionality reduction, and visualization. We demonstrate immuneMLs utility in three use cases: (i) benchmarking generative models for epitope-specific sequence generation, assessing specificity and novelty, (ii) systematic evaluation of clustering approaches on experimental receptor sequences against biological properties, such as epitope specificity and MHC, and (iii) unsupervised analysis of an experimental AIRR dataset to examine potential confounding, a practice widespread in related fields but unexplored in AIRR analyses.