Patterns — Latest Matching Preprints

1

Harnessing AI and social media to understand real-world patient experiences in systemic lupus erythematosus

Yang, S.; Hawryluk, C.; Liu, J.; Eckert, N.; Otoo, J.; Vina, E. R.; Yao, L.

2026-02-22 rheumatology 10.64898/2026.02.20.26346724 medRxiv

Top 0.1%

23.3%

Show abstract

ObjectiveTo apply large language models (LLMs) to Reddit posts referencing systemic lupus erythematosus (SLE) to identify patient-expressed unmet medical needs, symptom experiences, and healthcare challenges, demonstrating how AI-enabled social media listening complements traditional patient-experience research. MethodsWe extracted 4,633 posts from ten SLE-related or health-focused Reddit communities using the public Reddit API (October-November 2025). After removing duplicates, promotional content, and posts with insufficient information, 2,603 posts remained. A thematic codebook was developed through manual review of 300 posts and iteratively refined. Two LLMs (Gemini 3.0 and GPT-5.2) were evaluated for automated thematic labeling using percent agreement, Cohens {kappa}, and a human-annotated reference set (n=100). The best-performing model was used to quantify theme prevalence, followed by qualitative review of representative narratives. ResultsGPT-5.2 demonstrated higher performance (F1=0.844) than Gemini 3.0 (F1=0.811), with substantial inter-model agreement across main themes (mean {kappa}=0.71). Posts reflected multidimensional experiences. The most frequent subtheme was Advice Seeking (84.1%), followed by Emotional Coping (55.6%). Common symptom-related themes included Pain (37.2%), Other Symptom Presentations (37.6%), Fatigue (24.7%), and Acute or Worsening Flares (30.2%). Diagnostic uncertainty was prominent, including confusion about laboratory results (24.0%) and emotional impact of uncertainty (33.0%). Qualitative review highlighted emotional distress, reliance on peer communities for interpretation of symptoms and labs, and difficulty managing complex treatment regimens. ConclusionLLM-enabled social media listening offers a scalable method for synthesizing large volumes of unstructured patient narratives, providing timely insights into lived experiences and unmet needs among individuals discussing lupus online. Findings align with established qualitative literature while highlighting persistent gaps in patient education, communication, and care coordination. This analytical framework can be applied across disease areas to support patient-centered care, measurement development, and evidence generation relevant to therapeutic and health-services research. What is already known on this topicO_LIPeople living with systemic lupus erythematosus (SLE) experience substantial unmet needs related to diagnostic uncertainty, symptom burden, emotional distress, medication challenges, and healthcare system barriers. C_LIO_LITraditional qualitative methods (e.g., interviews, focus groups, surveys) capture valuable patient perspectives but are limited by small sample sizes, recall bias, and restricted question frameworks. C_LIO_LISocial media listening has emerged as a promising way to collect real-time patient insights, and recent regulatory guidance acknowledges its value as patient experience data. However, systematic, scalable analysis of large patient-generated datasets has historically been constrained by analytic burden and variability. C_LI What this study addsO_LIThis study is among the first to apply state-of-the-art large language models (LLMs) to a large corpus of SLE-related social media posts, enabling scalable thematic analysis of thousands of patient narratives. C_LIO_LIIt provides a validated methodological framework for using dual-LLM agreement, human-annotated references, and performance benchmarking (precision, recall, F1) to ensure reliability in automated thematic labeling. C_LIO_LIFindings reveal a multidimensional patient burden consistent with prior studies while uncovering persistent gaps in patient education, confusion around laboratory testing, care coordination challenges, and heavy reliance on peer communities for advice. C_LIO_LIThe approach demonstrates that LLM-enabled social media listening can generate timely, granular, patient-prioritized insights at a scale unattainable by traditional methods. C_LI How this study might affect research, practice, or policyO_LIResearch: Establishes a reproducible, scalable framework for integrating LLM-based thematic analysis into patient-focused evidence generation, accelerating insight extraction from large unstructured datasets across disease areas. C_LIO_LIClinical practice: Highlights actionable gaps in patient education, communication, and care coordination, informing interventions to improve clinical encounters, shared decision-making, and symptom management support. C_LIO_LIPolicy and regulatory science: Demonstrates how social media-derived patient experience data, when paired with rigorous quality controls, can complement formal qualitative studies and support patient-focused drug development, measurement development, and health-services planning. C_LI

2

New Genetic Insights in Rheumatoid Arthritis using Taxonomy3(R), a Novel method for Analysing Human Genetic Data

Kozlowska, J.; Humphryes-Kirilov, N.; Pavlovets, A.; Connolly, M.; Kuncheva, Z.; Horner, J.; Sousa Manso, A.; Murray, C.; Fox, J. C.; McCarthy, A.

2023-02-24 rheumatology 10.1101/2023.02.21.23286176 medRxiv

Top 0.1%

19.6%

Show abstract

Genetic support for a drug target has been shown to increase the probability of success in drug development, with the potential to reduce attrition in the pharmaceutical industry alongside discovering novel therapeutic targets. It is therefore important to maximise the detection of genetic associations that affect disease susceptibility. Conventional statistical methods used to analyse genome-wide association studies (GWAS) only identify some of the genetic contribution to disease, so novel analytical approaches are required to extract additional insights. C4X Discovery has developed a new method Taxonomy3(R) for analysing genetic datasets based on novel mathematics. When applied to a previously published rheumatoid arthritis GWAS dataset, Taxonomy3(R) identified many additional novel genetic signals associated with this autoimmune disease. Follow-up studies using tool compounds support the utility of the method in identifying novel biology and tractable drug targets with genetic support for further investigation.

3

Decomposing Heterogeneity in Disease Progression Speeds and Pathways

Yada, Y.; Naoki, H.; The Pooled Resource Open-Access ALS Clinical Trials Consortium,

2026-02-01 health informatics 10.64898/2026.01.30.26345194 medRxiv

Top 0.1%

18.6%

Show abstract

Understanding why patients with the same diagnosis exhibit markedly different disease progression--some progressing rapidly, others slowly, and through distinct symptom patterns--remains a major challenge in medicine. Here, we developed a machine learning framework called DiSPAH (Disease-progression Speed and Pathway Analysis based on a Hidden Markov model) to estimate both the pathway and speed of disease progression in individual patients. DiSPAH models disease progression as transitions of latent states evolving over continuous time with a patient-specific progression speed. We applied DiSPAH to longitudinal clinical scores from an amyotrophic lateral sclerosis (ALS) cohort and successfully inferred each patients hidden disease trajectory and progression speed. These individualized dynamics were significantly associated with baseline clinical features and enabled prediction of future disease course from data available at the first clinical visit. Our results highlight that jointly modeling progression pathway and speed improves prediction of heterogeneous disease courses, offering a powerful tool for personalized care and research in ALS and other chronic conditions. O_FIG O_LINKSMALLFIG WIDTH=200 HEIGHT=104 SRC="FIGDIR/small/26345194v1_ufig1.gif" ALT="Figure 1"> View larger version (22K): org.highwire.dtl.DTLVardef@fb65ccorg.highwire.dtl.DTLVardef@d861b7org.highwire.dtl.DTLVardef@1f7670dorg.highwire.dtl.DTLVardef@18e95d9_HPS_FORMAT_FIGEXP M_FIG Graphical Abstract: Schematic illustration of the computational framework proposed in this paper C_FIG

4

SuReCAN: a suite of user-friendly Galaxy machine learning workflows to predict survival and treatment response of cancer patients

Ju, J.; Koppes, D.; Stubbs, A. P.; Li, Y.

2025-08-13 health informatics 10.1101/2025.08.12.25333156 medRxiv

Top 0.1%

18.5%

Show abstract

Cancer is one of the leading lethal causes worldwide, with enormous impact on healthcare, economy and society. One of the main challenges of clinical treatment planning is that patients usually have diverse clinical outcomes given the same diagnosis and treatments. To enable personalized cancer therapeutic planning, (bio)medical data analyses using machine learning (ML) models are introduced to efficiently extract informative biological patterns from the massive volume of complex biological data, aiding in cancer patients stratifications. For biomedical researchers without computational biology background, the gap between clinical practice and computational approaches is prominent and hinders the usage of machine learning in medical research. To fill this gap, we created a collection of ML workflows on the Galaxy platform named SUrvival and REsponse prediction for CANcer patients (SuReCAN) for clinicians and biologists to build and deploy predictive ML classifiers. Being freely available and accessible, SuReCAN automates the data analysis process and enables the clinicians and researchers to perform a broad range of predictive tasks. It contains a toolkit of three ML modules with various existing and newly implemented methods on Galaxy: A data normalization module, a feature selection module, and an ML classifier module. We exhibited the utility of SuReCAN with a few real-world datasets to identify pancreatic ductal adenocarcinoma (PDAC) patients survival-correlated subtypes and to predict drug response outcomes based on various omics data from patient tumor samples. As a result, all workflows achieved a median accuracy of over 0.8 in PDAC survival-correlated subtype classification. In particular, the workflow combining the feature selection method "SVM-based RFECV" and the Support Vector Machine classifier consistently outperformed the other workflows, while all classifiers have shown their superiority on different omics data. Importantly, SuReCAN is not only applicable for the clinical prediction tasks shown in the test cases but also suitable for new classifier development and deployment with clinical observations provided by the users. Providing a collection of user-friendly ML workflows, SuReCAN stratifies patients based on their biomedical profiling in a data-driven way and assists biomedical researchers with clinical decision-making and scientific discoveries.

5

Machine learning models predict long COVID outcomes based on baseline clinical and immunologic factors

Jayavelu, N. D.; Samaha, H.; Wimalasena, S. T.; Hoch, A.; Gygi, J. P.; Gabernet, G.; Ozonoff, A.; Liu, S.; Milliren, C. E.; Levy, O.; Baden, L. R.; Melamed, E.; Ehrlich, L. I. R.; McComsey, G. A.; Sekaly, R. P.; Cairns, C. B.; Haddad, E. K.; Schaenman, J.; Shaw, A. C.; Hafler, D. A.; Montgomery, R. R.; Corry, D. B.; Kheradmand, F.; Atkinson, M. A.; Brakenridge, S. C.; Higuita, N. I. A.; Metcalf, J. P.; Hough, C. L.; Messer, W. B.; Pulendran, B.; Nadeau, K. C.; Davis, M. M.; Geng, L. N.; Sesma, A. F.; Simon, V.; Krammer, F.; Kraft, M.; Bime, C.; Calfee, C. S.; Erle, D. J.; Langelier, C. R.; IMP

2025-02-13 health informatics 10.1101/2025.02.12.25322164 medRxiv

Top 0.1%

18.3%

Show abstract

The post-acute sequelae of SARS-CoV-2 (PASC), also known as long COVID, remain a significant health issue that is incompletely understood. Predicting which acutely infected individuals will go on to develop long COVID is challenging due to the lack of established biomarkers, clear disease mechanisms, or well-defined sub-phenotypes. Machine learning (ML) models offer the potential to address this by leveraging clinical data to enhance diagnostic precision. We utilized clinical data, including antibody titers and viral load measurements collected at the time of hospital admission, to predict the likelihood of acute COVID-19 progressing to long COVID. Our machine learning models achieved median AUROC values ranging from 0.64 to 0.66 and AUPRC values between 0.51 and 0.54, demonstrating their predictive capabilities. Feature importance analysis revealed that low antibody titers and high viral loads at hospital admission were the strongest predictors of long COVID outcomes. Comorbidities, including chronic respiratory, cardiac, and neurologic diseases, as well as female sex, were also identified as significant risk factors for long COVID. Our findings suggest that ML models have the potential to identify patients at risk for developing long COVID based on baseline clinical characteristics. These models can help guide early interventions, improving patient outcomes and mitigating the long-term public health impacts of SARS-CoV-2.

6

An Interpretable Deep Learning Framework for Biomarker Discovery in Complex Disease Survival Outcomes

Wan, S.; Mi, X.; Zou, F.; Zou, B.

2025-10-01 bioinformatics 10.1101/2025.09.30.679415 medRxiv

Top 0.1%

17.2%

Show abstract

Identification of important biomarkers associated with complex disease survival outcomes is fundamental for gaining an in-depth understanding of disease mechanisms and advancing precision medicine in conditions such as cancer and cardiovascular disorders. However, these tasks are complicated by the unique nature of time-to-event data, which captures both the occurrence and timing of clinical events. Notably, complex associations such as the non-linear and non-additive biomarker interactions and the high-dimensionality challenge conventional survival data modeling approaches. To address these difficulties, we propose SurvDNN, an enhanced deep neural network framework specifically designed for survival outcomes modeling. SurvDNN incorporates a bootstrapping-based regularization strategy to mitigate overfitting and a novel stability-driven filtering algorithm to improve model robustness. To enable interpretable biomarker discovery, we extend the Permutation-based Feature Importance Test (PermFIT) to survival settings, allowing rigorous quantification of individual biomarker contributions under complex biomarker-outcome associations. Through extensive simulations and applications to real-world datasets, SurvDNN consistently outperforms existing machine learning approaches in both biomarker identification and predictive accuracy. Our results demonstrate the potential of SurvDNN coupled with PermFIT as an interpretable, robust, and powerful tool for biomarker-driven survival modeling in complex diseases. An open-source R package implementing SurvDNN is publicly available on GitHub (https://github.com/BZou-lab/SurvDNN).

7

TLCellClassifier: Machine Learning Based Cell Classification for Bright-Field Time-Lapse Images

Jiang, Q.; Sudalagunta, P. R.; Meads, M.; Zhao, X.; Achille, A.; Noyes, D.; Silva, M.; Canevarolo, R. R.; Shain, K.; Silva, A.; Zhang, W.

2024-06-14 cancer biology 10.1101/2024.06.11.598552 medRxiv

Top 0.1%

14.8%

Show abstract

Immunotherapies have shown promising results in treating patients with hematological malignancies like multiple myeloma, which is an incurable but treatable bone marrow-resident plasma cell cancer. Choosing the most efficacious treatment for a patient remains a challenge in such cancers. However, pre-clinical assays involving patient-derived tumor cells co-cultured in an ex vivo reconstruction of immune-tumor micro-environment have gained considerable notoriety over the past decade. Such assays can characterize a patients response to several therapeutic agents including immunotherapies in a high-throughput manner, where bright-field images of tumor (target) cells interacting with effector cells (T cells, Natural Killer (NK) cells, and macrophages) are captured once every 30 minutes for upto six days. Cell detection, tracking, and classification of thousands of cells of two or more types in each frame is bound to test the limits of some of the most advanced computer vision tools developed to date and requires a specialized approach. We propose TLCellClassifier (time-lapse cell classifier) for live cell detection, cell tracking, and cell type classification, with enhanced accuracy and efficiency obtained by integrating convolutional neural networks (CNN), metric learning, and long short-term memory (LSTM) networks, respectively. State-of-the-art computer vision software like KTH-SE and YOLOv8 are compared with TLCellClassifier, which shows improved accuracy in detection (CNN) and tracking (metric learning). A two-stage LSTM-based cell type classification method is implemented to distinguish between multiple myeloma (tumor/target) cells and macrophages/monocytes (immune/effector cells). Validation of cell type classification was done both using synthetic datasets and ex vivo experiments involving patient-derived tumor/immune cells. Availability and implementationhttps://github.com/QibingJiang/cell classification ml

8

Exploratory electronic health record analysis with ehrapy

Heumos, L.; Ehmele, P.; Treis, T.; Upmeier zu Belzen, J.; Namsaraeva, A.; Horlava, N.; Shitov, V. A.; Zhang, X.; Zappia, L.; Knoll, R.; Lang, N. J.; Hetzel, L.; Virshup, I.; Sikkema, L.; Roellin, E.; Curion, F.; Eils, R.; Schiller, H. B.; Hilgendorff, A.; Theis, F.

2023-12-11 health informatics 10.1101/2023.12.11.23299816 medRxiv

Top 0.1%

14.4%

Show abstract

With progressive digitalization of healthcare systems worldwide, large-scale collection of electronic health records (EHRs) has become commonplace. However, an extensible framework for comprehensive exploratory analysis that accounts for data heterogeneity is missing. Here, we introduce ehrapy, a modular open-source Python framework designed for exploratory end-to-end analysis of heterogeneous epidemiology and electronic health record data. Ehrapy incorporates a series of analytical steps, from data extraction and quality control to the generation of low-dimensional representations. Complemented by rich statistical modules, ehrapy facilitates associating patients with disease states, differential comparison between patient clusters, survival analysis, trajectory inference, causal inference, and more. Leveraging ontologies, ehrapy further enables data sharing and training EHR deep learning models paving the way for foundational models in biomedical research. We demonstrated ehrapys features in five distinct examples: We first applied ehrapy to stratify patients affected by unspecified pneumonia into finer-grained phenotypes. Furthermore, we revealed biomarkers for significant differences in survival among these groups. Additionally, we quantify medication-class effects of pneumonia medications on length of stay. We further leveraged ehrapy to analyze cardiovascular risks across different data modalities. Finally, we reconstructed disease state trajectories in SARS-CoV-2 patients based on imaging data. Ehrapy thus provides a framework that we envision will standardize analysis pipelines on EHR data and serve as a cornerstone for the community.

9

Navigating the Privacy-Accuracy Tradeoff: Federated Survival Analysis with Binning and Differential Privacy

Gouthamchand, V.; van Soest, J.; Arcuri, G.; Dekker, A.; Damiani, A.; Wee, L.

2024-10-09 health informatics 10.1101/2024.10.09.24315159 medRxiv

Top 0.1%

14.4%

Show abstract

Federated learning (FL) offers a decentralized approach to model training, allowing for data-driven insights while safeguarding patient privacy across institutions. In the Personal Health Train (PHT) paradigm, it is local model gradients from each institution, aggregated over a sample size of its own patients that are transmitted to a central server to be globally merged, rather than transmitting the patient data itself. However, certain attacks on a PHT infrastructure may risk compromising sensitive data. This study delves into the privacy-accuracy tradeoff in federated Cox Proportional Hazards (CoxPH) models for survival analysis by assessing two Privacy-Enhancing Techniques (PETs) added on top of the PHT approach. In one, we implemented a Discretized Cox model by grouping event times into finite bins to hide individual time-to-event data points. In another, we explored Local Differential Privacy by introducing noise to local model gradients. Our results demonstrate that both strategies can effectively mitigate privacy risks without significantly compromising numerical accuracy, reflected in only small variations of hazard ratios and cumulative baseline hazard curves. Our findings highlight the potential for enhancing privacy-preserving survival analysis within a PHT implementation and suggest practical solutions for multi-institutional research while mitigating the risk of re-identification attacks.

10

Occupation Recognition and Exploitation in Rheumatology Clinical Notes: Employing Deep Learning Models for Named Entity Recognition and Knowledge Discovery in Electronic Health Records

Madrid Garcia, A.; Perez-Sancristobal, I.; Leon Mateos, L.; Abasolo Alcazar, L.; Fernandez Gutierrez, B.; Rodriguez Rodriguez, L.

2024-05-08 rheumatology 10.1101/2024.05.08.24306389 medRxiv

Top 0.1%

14.3%

Show abstract

Occupation is considered a Social Determinant of Health (SDOH) and its effects have been studied at multiple levels. Although the inclusion of such data in the Electronic Health Record (EHR) is vital for the provision of clinical care, specially in rheumatology where work disability prevention is essential, occupation information is often either not routinely documented or captured in an unstructured manner within conventional EHR systems. Encouraged by recent advances in natural language processing and deep learning models, we propose the use of novel architectures (i.e., transformers) to detect occupation mentions in rheumatology clinical notes of a tertiary hospital, and to whom those occupations belongs. We also aimed to evaluate the clinical and demographic characteristics that influence the collection of this SDOH; and the association between occupation and patients diagnosis. Bivariate and multivariate logistic regression analysis were conducted for this purpose. A Spanish pre-trained language model, RoBERTa, fine-tuned with biomedical texts was used to detect occupations. The best model achieved a F1-score of 0.725 identifying occupation mentions. Moreover, highly disabling mechanical pathology diagnoses (i.e., back pain, muscle disorders) were associated with a higher probability of occupation collection. Ultimately, we determined the professions most closely associated with more than ten categories of muscu-loskeletal disorders. HighlightsO_LIDeep learning models hold significant potential for structuring and leveraging information in rheumatology C_LIO_LIDiagnoses related to highly disabling mechanical pathology were associated with a higher probability of occupation collection C_LIO_LICleaners, helpers, and social workers occupations are linked to mechanical pathologies such as back pain C_LI

11

Parsing Neurometabolic Signatures of Multiple Sclerosis with MRSI and cPCA

Raghu, N.; Abbasi, M.; Tashi, Z.; Zamora, C.; Key, S.; Chong, C. D.; Zhou, Y.; Niklova, S.; Ofori, E.; Bartelle, B. B.

2026-02-16 radiology and imaging 10.64898/2026.02.13.26346248 medRxiv

Top 0.1%

14.2%

Show abstract

Magnetic Resonance Spectroscopy Imaging (MRSI) offers spatially-resolved, neurometabolic information, acquired non-invasively at whole-brain scales from human subjects. Analysis of MRSI however, is extremely challenging. The metabolic information is highly convolved, and sparsely distributed across millions of spatial-spectral datapoints, allowing for little direct human interpretation. Conversely, the overall low signal-to-noise with high-intensity artifacts can confound unsupervised machine learning approaches. These technical barriers have left much of the potential of MRSI unrealized. We acquired MRSI data from 4 human subjects with a diagnosis of multiple sclerosis (MS), incorporating experimental design into an informed machine learning approach. MRSI acquisitions were registered to anatomical MRI to label 105k spectra from brain tissue and 162 spectra from white matter hyperintensities (WMHs), an imaging biomarker associated with MS lesions. Spectral labels were then used in contrastive principal component analysis (cPCA) to filter artifacts and background features in the MRSI data from lesion salient features and clustered into statistically significant states based on features that could be interpreted from the original data. Our approach renders MRSI data into testable representations of neurometabolism, enabling the method for fundamental and clinical research. Graphical AbstractAnalysis workflow for neurometabolic profiling of MS lesions. MRSI and anatomical MRI is acquired and processed in parallel for spectral data and anatomical labels. Spectra are then labeled and separated into experimental vs background data for contrastive PCA. Spectra are clustered for similarity, further labeled, and projected onto a brain atlas for a neurometabolic view. O_FIG O_LINKSMALLFIG WIDTH=200 HEIGHT=71 SRC="FIGDIR/small/26346248v1_ufig1.gif" ALT="Figure 1"> View larger version (28K): org.highwire.dtl.DTLVardef@1300853org.highwire.dtl.DTLVardef@72922aorg.highwire.dtl.DTLVardef@1da30c3org.highwire.dtl.DTLVardef@1b77816_HPS_FORMAT_FIGEXP M_FIG C_FIG

12

g-CXR-Net: A Graphic Application for the Rapid Recognition of SARS-CoV-2 from Chest X-Rays

Gatti, D. L.

2021-06-09 radiology and imaging 10.1101/2021.06.06.21258428 medRxiv

Top 0.1%

12.9%

Show abstract

g-CXR-Net is a graphic application for the rapid recognition of SARS-CoV-2 from Antero/Posterior chest X-rays. It employs the Artificial Intelligence engine of CXR-Net (arXiv:2103.00087) to generate masks of the lungs overlapping the heart and large vasa, probabilities for Covid vs. non-Covid assignment, and high resolution heat maps that identify the SARS associated lung regions.

13

Correcting under-reported COVID-19 case numbers

Lachmann, A.

2020-03-18 health informatics 10.1101/2020.03.14.20036178 medRxiv

Top 0.1%

12.5%

Show abstract

The COVID-19 virus has spread worldwide in a matter of a few months, while healthcare systems struggle to monitor and report current cases. Testing results have struggled with the relative capabilities, testing policies and preparedness of each affected country, making their comparison a non-trivial task. Since severe cases, which more likely lead to fatal outcomes, are detected at a higher rate than mild cases, the reported virus mortality is likely inflated in most countries. Lockdowns and changes in human behavior modulate the underlying growth rate of the virus. Under-sampling of infection cases may lead to the under-estimation of total cases, resulting in systematic mortality estimation biases. For healthcare systems worldwide it is important to know the expected number of cases that will need treatment. In this manuscript, we identify a generalizable growth rate decay reflecting behavioral change. We propose a method to correct the reported COVID-19 cases and death numbers by using a benchmark country (South Korea) with near-optimal testing coverage, with considerations on population demographics. We extrapolate expected deaths and hospitalizations with respect to observations in countries that passed the exponential growth curve. By applying our correction, we predict that the number of cases is highly under-reported in most countries and a significant burden on worldwide hospital capacity. The full analysis workflow and data is available at: https://github.com/lachmann12/covid19

14

Comparative Analysis of Association Networks Using Single-Cell RNA Sequencing Data Reveals Perturbation-Relevant Gene Signatures

Nouri, N.; Gaglia, G.; Mattoo, H.; de Rinaldis, E.; Savova, V.

2023-09-12 bioinformatics 10.1101/2023.09.11.556872 medRxiv

Top 0.1%

12.4%

Show abstract

Single-cell RNA sequencing (scRNA-seq) data has elevated our understanding of systemic perturbations to organismal physiology at the individual cell level. However, despite the rich information content of scRNA-seq data, the relevance of genes to a perturbation is still commonly assessed through differential expression analysis. This approach provides a one-dimensional perspective of the transcriptomic landscape, risking the oversight of tightly controlled genes characterized by modest changes in expression but with profound downstream effects. We present GENIX (Gene Expression Network Importance eXamination), a novel platform for constructing gene association networks, equipped with an innovative network-based comparative model to uncover condition-relevant genes. To demonstrate the effectiveness of GENIX, we analyze influenza vaccine-induced immune responses in peripheral blood mononuclear cells (PBMCs) collected from recovered COVID-19 patients, shedding light on the mechanistic underpinnings of gender differences. Our methodology offers a promising avenue to identify genes relevant to perturbation responses in biological systems, expanding the scope of response signature discovery beyond differential gene expression analysis. HIGHLIGHTSO_LIConventional methods used to identify perturbation-relevant genes in scRNA-seq data rely on differential expression analysis, susceptible to overlooking essential genes. C_LIO_LIGENIX leverages cell-type-specific inferred gene association networks to identify condition-relevant genes and gene programs, irrespective of their specific expression alterations. C_LIO_LIGENIX provides insight into the gene-regulatory response to the influenza vaccine in naive and recovered COVID-19 patients, expanding on previously observed gender-specific differences. C_LI GRAPHICAL ABSTRACT O_FIG O_LINKSMALLFIG WIDTH=200 HEIGHT=115 SRC="FIGDIR/small/556872v1_ufig1.gif" ALT="Figure 1"> View larger version (27K): org.highwire.dtl.DTLVardef@1837d3eorg.highwire.dtl.DTLVardef@1937860org.highwire.dtl.DTLVardef@c40114org.highwire.dtl.DTLVardef@22d3b9_HPS_FORMAT_FIGEXP M_FIG C_FIG

15

MESSI: Multimodal Experiments with SyStematic Interrogation using nextflow

Liang, C.; Grewal, T.; Singh, A.; Singh, A.

2026-03-11 bioinformatics 10.64898/2026.03.09.710452 medRxiv

Top 0.1%

12.4%

Show abstract

BackgroundMultimodal biomedical studies increasingly profile multiple molecular and clinical modalities from the same samples, creating new opportunities for disease prediction and biological discovery. However, benchmarking multimodal integration methods remains difficult because studies often use inconsistent preprocessing, unequal tuning strategies, and non-comparable evaluation schemes, limiting fair assessment across methods. ResultsWe developed MESSI (Multimodal Experiments with SyStematic Interrogation), a reproducible Nextflow-based benchmarking framework for multimodal outcome prediction that standardizes data preparation, supports interoperable R and Python workflows, and enforces leakage-free nested cross-validation for model selection and model assessment. MESSI currently implements representative intermediate- and late-integration methods and supports bulk multiomics, bulk multimodal, and single-cell multiomics datasets. In simulation studies with known ground truth, most methods were well calibrated in the absence of signal and achieved high performance under strong signal, whereas differences emerged under weaker signal and in feature recovery. We then applied MESSI to 19 real datasets spanning cancer, neurodevelopmental, neurodegenerative, infectious, renal, transplant, and metastatic disease settings, with diverse modality combinations including transcriptomic, epigenomic, proteomic, imaging, electrical, clinical, and single-cell-derived features. Across bulk multimodal datasets, classification differences were generally modest, although DIABLO and multiview cooperative learning tended to rank highest, while MOFA+glmnet and MOGONET were weaker overall. Biological enrichment analyses revealed clearer differences: DIABLO, RGCCA, MOFA, and IntegrAO more consistently recovered significant Reactome, oncogenic, and tissue-relevant gene signatures. In single-cell multiomics benchmarks, method rankings were more dataset dependent, but DIABLO performed consistently well across all case studies, while RGCCA also showed strong performance in specific settings. Computational analyses further showed that DIABLO and MOFA had the most favorable runtime and memory profiles, whereas multiview was the most time-intensive and IntegrAO the most memory-demanding. ConclusionsMESSI provides a reproducible, extensible, and equitable framework for benchmarking multimodal integration methods under a common model assessment strategy. Our results indicate that no single method is uniformly optimal across datasets and objectives; instead, method choice should balance predictive performance, biological interpretability, and computational efficiency. MESSI establishes a foundation for transparent benchmarking and future extensions to broader multimodal learning tasks.

16

CovRadar: Continuously tracking and filtering SARS-CoV-2 mutations for molecular surveillance

Wittig, A.; Miranda, F.; Tang, M.; Hölzer, M.; Renard, B. Y.; Fuchs, S.

2021-02-03 bioinformatics 10.1101/2021.02.03.429146 medRxiv

Top 0.1%

12.4%

Show abstract

The SARS-CoV-2 pandemic underlined the importance of molecular surveillance to track the evolution of the virus and inform public health interventions. Fast analysis, easy visualization and convenient filtering of the latest virus sequences are essential for this purpose. However, access to computational resources, the lack of bioinformatics expertise, and the sheer volume of sequences in public databases complicate surveillance efforts. CovRadar combines an analytical pipeline and a web application designed for the molecular surveillance of the spike gene of SARS-CoV-2, an important vaccine target. The intuitive web front-end focuses on mutations rather than viral lineages and provides easy access to frequencies and spatio-temporal distributions from global sample collections. The data is regularly updated based on a scalable and reproducible analytical back-end. With this platform, we aim to give users, those with or without bioinformatics skills or sufficient computational resources, the possibility to track and explore mutational changes in the SARS-CoV-2 spike gene and to filter, download, and further analyze data that meet their questions and needs. Advanced computational users have the ability to apply the analytical pipeline and data visualization methods locally on their own data. CovRadar is freely accessible at https://covradar.net, source code is available at https://gitlab.com/dacs-hpi/covradar. GRAPHICAL ABSTRACT O_FIG O_LINKSMALLFIG WIDTH=200 HEIGHT=150 SRC="FIGDIR/small/429146v3_ufig1.gif" ALT="Figure 1"> View larger version (52K): org.highwire.dtl.DTLVardef@194a562org.highwire.dtl.DTLVardef@1f5f0d3org.highwire.dtl.DTLVardef@195b5dborg.highwire.dtl.DTLVardef@1d65231_HPS_FORMAT_FIGEXP M_FIG C_FIG

17

Infection Inspection: Using the power of citizen science to help with image-based prediction of antibiotic resistance in Escherichia coli

Farrar, A.; Feehily, C.; Turner, P.; Zagajewski, A.; Chatzimichail, S.; Crook, D.; Andersson, M.; Oakley, S.; Barrett, L.; El Sayyed, H.; Fowler, P. W.; Nellaker, C.; Kapanidis, A. N.; Stoesser, N.

2023-12-11 infectious diseases 10.1101/2023.12.11.23299807 medRxiv

Top 0.1%

12.3%

Show abstract

Antibiotic resistance is an urgent global health challenge, necessitating rapid diagnostic tools to combat its escalating threat. This study introduces innovative approaches for expedited bacterial antimicrobial resistance profiling, addressing the critical need for swift clinical responses. Between February and April 2023, we conducted the Infection Inspection project, a citizen science initiative in which the public could participate in advancing an antimicrobial susceptibility testing method based on single-cell images of cellular phenotypes in response to ciprofloxacin exposure. A total of 5,273 users participated, classifying 1,045,199 images. Notably, aggregated user accuracy in image classification reached 66.8%, lower than our deep learning models performance at 75.3%, but accuracy increased for both users and the model when ciprofloxacin treatment was greater than a strains own minimum inhibitory concentration. We used the users classifications to elucidate which visual features influence classification decisions, most importantly the degree of DNA compaction and heterogeneity. We paired our classification data with an image feature analysis which showed that most of the incorrect classifications were due to cellular features that varied from the expected response. This understanding informs ongoing efforts to enhance the robustness of our deep learning-based bacterial classifier and diagnostic methodology. Our successful engagement with the public through citizen science is another demonstration of the potential for collaborative efforts in scientific research, specifically increasing public awareness and advocacy on the pressing issue of antibiotic resistance, and empowering individuals to actively contribute to the development of novel diagnostics. Lay summaryAntibiotic resistance is a big health problem worldwide. We need fast ways to find out if bacteria are resistant to antibiotics. In our study, we develop new methods to do this quickly. We ran an online project called Infection Inspection from February to April 2023, in which 5,273 people took part. Together, they classified more than a million pictures of bacterial cells, helping our project use these pictures to detect antibiotic resistance. The volunteers performed well, getting near 67% of the answers right. We also learned which pictures helped or confused them. This will help us make our computer program better. This project didnt just help science; it also taught people about antibiotic resistance. Partnerships between the public and scientists can make a difference to developing technologies that protect our health.

18

AutoMRAI: A Multi-Omics Causal Inference Platform Using Structural Equation Modelling

ALI, F.

2025-11-19 health informatics 10.1101/2025.11.17.25340455 medRxiv

Top 0.1%

12.3%

Show abstract

The integration of causal inference, artificial intelligence (AI), and multi-omics data represents a transformative frontier for unravelling the complex mechanisms underlying health and disease. Traditional observational epidemiology is limited by confounding and reverse causation, while statistical frameworks such as Mendelian randomization (MR) and structural equation modelling (SEM) enable more robust causal inference. Recent advances in AI and machine learning have further expanded these capabilities, providing scalable solutions for biomarker discovery, therapeutic target identification, and precision medicine. Here, we introduce AutoMRAI, a unified platform that integrates SEM with multi-omics data analysis in an AI-augmented environment. AutoMRAI enables causal modelling across genomic, epigenomic, transcriptomic, proteomic, metabolomic, microbiome, and clinical layers, supported by directed acyclic graph (DAG) representation, robust statistical modelling, and interactive visualization. We provide a proof-of-concept demonstration using simulated datasets, highlighting the platforms ability to recover known causal pathways across omics layers. AutoMRAI addresses key challenges of reproducibility, scalability, and interpretability in causal inference and establishes a foundation for future integration of MR pipelines, Bayesian networks, and advanced AI workflows. The platform is offering researchers a scalable and accessible resource for multi-omics causal discovery.

19

EVA: a Foundation Model Advancing Translational Drug Development in Immuno-Inflammation

Fouche, A.; Bruley, A.; Corney, M.; Marschall, P.; Bouget, V.; Duquesne, J.

2025-05-07 bioinformatics 10.1101/2025.05.02.651839 medRxiv

Top 0.1%

12.2%

Show abstract

Drug development is a lengthy and high-risk process, with most investigational drug candidates failing in phase II randomized clinical trials (RCT) due to insufficient efficacy. It makes early prediction of trial outcomes crucial for reducing attrition and guiding strategic decisions, especially in immunology and inflammation (I&I) diseases. Herein, we present EVA, the first pre-trained foundation model in complex inflammatory diseases tailored to support drug development. EVA learns generalizable patterns from large-scale data of cell biology and immunology, enabling superior predictive performance and generalization compared to traditional approaches. EVA is pre-trained on tens of millions of single-cell RNA-seq samples and tens of thousands of bulk RNA-seq samples from I&I diseases patients, enabling it to learn disease-relevant transcriptomic patterns in this therapeutic area. By fine-tuning EVA in few-shot settings on both preclinical (mouse) and clinical (human) data and harnessing its wide pre-training knowledge, EVA predicts drug responses in I&I with high precision at both cohort and patient levels, as illustrated by accurate forecasting of anti-TNF therapeutic activity in ulcerative colitis. By deciphering its decision process, we further highlight that EVAs ability to stratify patients based on predicted drug response can also be leveraged to discover drug response biomarkers as early as preclinical stages. EVAs applications in precision immunology encompass therapeutic target validation prior to clinical entry, identification of patient subpopulations most likely to benefit from treatment, and comparative efficacy analysis against competitor compounds. EVAs versatility makes it an invaluable tool for strategic decision-making throughout the drug development pipeline: by leveraging it to prioritize the most promising drug candidates and optimize RCT designs, it can contribute to reduce late-stage failures and accelerate the delivery of effective therapies. Overall, this work represents a significant advancement in utilizing a pre-trained foundation model for precision drug development in complex inflammatory diseases. Graphical abstractEVA is a pre-trained foundation model specific to immune-mediated inflammatory diseases. It enables the prediction of therapeutic efficacy in patients leveraging data from preclinical disease models. O_FIG O_LINKSMALLFIG WIDTH=200 HEIGHT=73 SRC="FIGDIR/small/651839v1_ufig1.gif" ALT="Figure 1"> View larger version (29K): org.highwire.dtl.DTLVardef@1c783f1org.highwire.dtl.DTLVardef@1a760d9org.highwire.dtl.DTLVardef@1c7748aorg.highwire.dtl.DTLVardef@1b44f82_HPS_FORMAT_FIGEXP M_FIG C_FIG

20

Understanding Spatial Heterogeneity of COVID-19 Pandemic Using Shape Analysis of Growth Rate Curves

Srivastava, A.; Chowell, G.

2020-06-02 epidemiology 10.1101/2020.05.25.20112433 medRxiv

Top 0.1%

10.9%

Show abstract

The growth rates of COVID-19 across different geographical regions (e.g., states in a nation, countries in a continent) follow different shapes and patterns. The overall summaries at coarser spatial scales that are obtained by simply averaging individual curves (across regions) obscure nuanced variability and blurs the spatial heterogeneity at finer spatial scales. We employ statistical methods to analyze shapes of local COVID-19 growth rate curves and statistically group them into distinct clusters, according to their shapes. Using this information, we derive the so-called elastic averages of curves within these clusters, which correspond to the dominant incidence patterns. We apply this methodology to the analysis of the daily incidence trajectory of the COVID-pandemic at two spatial scales: A state-level analysis within the USA and a country-level analysis within Europe during mid-February to mid-May, 2020. Our analyses reveal a few dominant incidence trajectories that characterize transmission dynamics across states in the USA and across countries in Europe. This approach results in broad classifications of spatial areas into different trajectories and adds to the methodological toolkit for guiding public health decision making at different spatial scales. HighlightsO_LICoarsely summarizing epidemic data collected at finer spatial scales can result in a loss of heterogenous spatial patterns that exist at finer scales. For instance, the average curves may give the impression that the epidemics trajectory is declining when, in fact, the trajectory of the epidemic is increasing in certain areas. C_LIO_LIShape analysis of COVID-19 growth rate curves discovers significant heterogeneity in epidemic spread patterns across spatial areas which can be statistically clustered into distinct groups. C_LIO_LIAt a higher level, clustering spatial patterns into distinct groups helps discern broad trends, such as rapid growth, leveling off, and slow decline in epidemic growth curves resulting from local transmission dynamics. At a finer level, it helps identify temporal patterns of multiple waves that characterize rate curves for different clusters. C_LIO_LIQuantitative methods for characterizing the spatial-temporal dynamics of evolving epidemic emergencies provide an objective framework to understand transmission dynamics for public health decision making. C_LI