Back

Med

Elsevier BV

Preprints posted in the last 30 days, ranked by how well they match Med's content profile, based on 38 papers previously published here. The average preprint has a 0.03% match score for this journal, so anything above that is already an above-average fit.

1
The FEES Dysphagia Index: a bias-resilient continuous score that captures expert clinical judgment in 2,943 neurological inpatients

Werner, C. J.; Sanchez-Garcia, E.; Mall, B.; Meyer, T.; Pinho, J.; Schulz, J. B.; Schumann-Werner, B.

2026-04-21 neurology 10.64898/2026.04.20.26351259 medRxiv
Top 0.1%
8.4%
Show abstract

Multi-consistency testing during flexible endoscopic evaluation of swallowing (FEES) is clinically necessary but introduces selection bias: worst scores inflate severity because the number of consistencies tested covaries with disease severity. In this retrospective observational study of hospitalized neurological patients, we derived and validated the FEES Dysphagia Index (FDI) in two temporally independent cohorts (Cohort 1: 2013-2018, N=1,257; Cohort 2: 2021-2025, N=1,686) from a single center. FDI-S averages Penetration-Aspiration Scale (PAS) scores across tested consistencies (0-100 scale); FDI-E uses Yale Pharyngeal Residue scores; FDI-C combines both. Selection bias was quantified using sequential branching-tree inverse probability weighting (IPW). Worst PAS overestimated severity by 24%; FDI deviated by <2%. FDI-C was significantly superior to Worst PAS for hospital-acquired pneumonia (HAP; AUC 0.70 vs. 0.60, p<0.001), mortality (0.71 vs. 0.62, p=0.040), and restricted oral intake (0.90 vs. 0.74, p<0.001), and statistically equivalent to clinician-rated severity. FDI-C mapped linearly onto ordinal Functional Oral Intake Scale values (FOIS; proportional odds RCS p=0.99). With functional status and diagnosis, FDI-C reconstructed the clinicians oral intake recommendation with AUC up to 0.93. The FDI-C-mortality relationship was sigmoidal with a clinically relevant transition zone between [~]50 and [~]85. FDI-C is a bias-resilient, bedside-calculable score with interval-scale properties that captures expert clinical judgment, suitable as both a clinical decision support tool and a continuous research endpoint.

2
The Gut-Vascular Axis in Intracranial Aneurysm Rupture: A Systematic Review and Meta-analysis of Human Microbiome Evidence

Fahim, F.; Hemmati, M.; Heshmaty, S.; Sharvirani, A.; Shahini, A.; Hosseini, A.; Hosseini Marvast, S. M.; Mojtahedzadeh, A.; Konarizadeh, M.; Dorisefat, F.; Maham, N.; Omranisarduiyeh, A.; Oveisi, S.; Fadaei Juibari, F.; Malekipour Kashan, B.; Sharifi, G.; Zali, A.

2026-04-07 neurology 10.64898/2026.04.05.26350207 medRxiv
Top 0.1%
5.0%
Show abstract

Background Intracranial aneurysm rupture is the leading cause of spontaneous subarachnoid hemorrhage and is associated with substantial mortality and long term neurological disability. Emerging evidence suggests that the gut microbiome may influence vascular inflammation and endothelial integrity through immune and metabolic pathways, yet human evidence linking gut microbial alterations to intracranial aneurysm remains fragmented and inconsistent. Objective This systematic review and meta analysis aimed to synthesize available human evidence on the association between gut microbiome alterations and intracranial aneurysm formation or rupture, with a primary focus on microbial dysbiosis and differences in gut microbial alpha diversity. Methods This study was conducted according to PRISMA 2020 guidelines and the protocol was prospectively registered in PROSPERO (CRD420261360785). A comprehensive search of PubMed, Scopus, Web of Science, Embase, and Cochrane CENTRAL was performed from database inception until April 1, 2026, with additional screening of grey literature sources. Observational human studies evaluating gut microbiome characteristics in patients with intracranial aneurysm were included. Mendelian randomization (MR) studies investigating genetically predicted microbial taxa and aneurysm outcomes were also reviewed. Random effects meta analysis using standardized mean differences (SMD) was performed for alpha diversity outcomes. MR taxa reported in at least two independent studies were quantitatively synthesized using inverse variance weighting of log odds ratios. Results The systematic search identified 396 records. After removal of duplicates and eligibility screening, 20 studies met inclusion criteria, including 12 observational clinical studies and 8 Mendelian randomization analyses. Meta analysis of three microbiome sequencing studies demonstrated significantly reduced gut microbial alpha diversity in patients with ruptured intracranial aneurysms compared with controls. Sensitivity analyses confirmed the robustness of pooled estimates. In addition, MR evidence identified several microbial taxa, including Ruminococcus1, Bilophila, Fusicatenibacter, and Porphyromonadaceae, as potentially protective factors against aneurysm related outcomes. Across observational studies, gut dysbiosis was frequently associated with inflammatory pathways and alterations in microbial metabolites implicated in vascular dysfunction. Conclusion Current human evidence suggests a potential association between gut microbiome dysbiosis and intracranial aneurysm pathophysiology, particularly in relation to aneurysm rupture. Reduced microbial diversity and specific microbial taxa may influence vascular inflammation and aneurysm wall stability. However, existing evidence remains limited and heterogeneous. Large prospective cohorts and mechanistic studies are required to clarify causal relationships and evaluate whether microbiome targeted interventions could contribute to aneurysm risk stratification or prevention strategies.

3
Scaling Multiplex qPCR Primer Design to 1000-plex using the Degenerate Incomplete Multiplex Primer List Extension (DIMPLE) Algorithm

Pinto, A.; Dong, X.; Wu, W.; Johnson, S. J.; Wen, Q.; Zhang, C.; Havey, J.; Wang, B.; Tang, G.; Farhat, A.; Zhang, D. Y.; Issa, G. C.; Zhang, X.

2026-04-21 bioengineering 10.64898/2026.04.17.719221 medRxiv
Top 0.1%
5.0%
Show abstract

Massively multiplexed qPCR is primarily constrained by increasing primer dimer formation as the number of distinct primers in a single reaction increases. Previous multiplex primer design algorithms either fail to sufficiently suppress primer dimers at 100+ plex, or take exceedingly high amounts of computational resources to complete. Here, we present DIMPLE, a linear-runtime primer design algorithm that effectively generates 10,000+ primers to amplify thousands of potential amplicons in a single qPCR reaction. As one clinical demonstration of this algorithm, we designed an assay to detect 2,302 distinct KMT2A gene fusion subtypes using 204 primers in a single tube. In contrast to FISH and convention NGS approaches with 2% variant allele frequency (VAF) limit of detection, our DIMPLE qPCR assay was able to analytically detect gene fusions down to 0.05% VAF. We also constructed proof-of-concept multiplex qPCR panels for additional oncology gene fusions, multiplex pathogen detection, and DNA methylation markers. The scalability and low computational cost DIMPLE are complementary to new instrument platforms for massively multiplex qPCR readout for enabling rapid, point-of-care nucleic acid testing.

4
Cerebrospinal fluid metabolomic profiles associate with neurological recovery after shunt surgery in normal pressure hydrocephalus

Duan, L.; Tiemeyer, M. E.; Leary, O. P.; Hasbrouck, A.; Sayied, S.; Amaral-Nieves, N.; Meier, R.; Brook, J. R.; Kanarek, N.; Alushaini, S.; Guglielmo, M.; Svokos, K. A.; Klinge, P. M.; Fleischmann, A.; Ruocco, M. G.; Petrova, B.

2026-03-31 neurology 10.64898/2026.03.29.26349660 medRxiv
Top 0.1%
4.3%
Show abstract

Normal pressure hydrocephalus (NPH) is a potentially reversible neurological disorder characterized by urinary incontinence, gait impairment, and cognitive decline. However, postoperative improvement after shunt placement is variable, and reliable preoperative predictors are lacking, leaving patients exposed to uncertain surgical benefit and procedural risk. We therefore asked whether preoperative cerebrospinal fluid (CSF) metabolic profiles capture biological states associated with recovery potential. We analyzed ventricular CSF from patients undergoing shunt placement and identified metabolic patterns that differed between patients who improved postoperatively and those who did not. These signatures were detectable prior to intervention and were consistent across analytical approaches and patient cohorts. Multivariate models based on metabolite features were associated with postoperative improvement, with strongest performance observed for cognitive outcomes. Pathway-level analyses indicated coordinated alterations in processes related to redox balance, immune-metabolic signaling, and energy substrate utilization. These findings indicate that preoperative CSF metabolite profiles reflect biological states associated with recovery potential in NPH. The results further suggest that metabolic and immune-metabolic processes contribute to variability in surgical responsiveness and support the development of predictive biomarkers for patient stratification.

5
HLA-Resolve: High-Resolution HLA Haplotyping Using Long-Read Hybrid Capture

Glasenapp, M. R.; Yee, M.-C.; Symons, A. E.; Cornejo, O. E.; Garcia, O. A.

2026-03-30 genetic and genomic medicine 10.64898/2026.03.27.26349549 medRxiv
Top 0.1%
3.8%
Show abstract

Accurate HLA typing is critical for transplantation, pharmacogenomics, and disease risk prediction, yet short-read approaches cannot resolve the HLA region's extreme polymorphism. Long-read sequencing improves resolution, but its adoption has been limited by higher cost, reduced base accuracy, limited throughput, and reliance on long-range PCR. To overcome these limitations, we present a multiplexed long-read hybrid capture workflow for PacBio and Oxford Nanopore sequencing that enriches all classical HLA loci and the complete HLA Class III region. A single-step enzymatic fragmentation and barcoding strategy enables automated library prep. We also introduce HLA-Resolve, an HLA typing program optimized for HiFi reads, and validate workflow performance against the Genome in a Bottle, Human Pangenome Reference Consortium, and International Histocompatibility Working Group benchmarks using 32 geographically diverse samples. These advances offer a cost-effective approach for high-resolution HLA typing with clinical applicability and enable investigation of the role of HLA Class III variation in disease.

6
Colibactin-associated mutations in the human colon appear to reflect anatomy and early exposure, not oncogenesis

Hiatt, L.; Peterson, E. V.; Happ, H. C.; Major-Mincer, J.; Avvaru, A.; Goclowski, C. L.; Garretson, A.; Sasani, T. A.; Hotaling, J. M.; Neklason, D. W.; Uchida, A. M.; Quinlan, A. R.

2026-04-15 genetic and genomic medicine 10.64898/2026.04.13.26350783 medRxiv
Top 0.1%
3.6%
Show abstract

Colorectal cancer (CRC) is the second leading cause of cancer death globally and the number one cause of cancer death in people under 50 years old. The reasons for the rise of early-onset CRC are unknown, and while anatomically distinct subtypes of CRC have substantial clinical and molecular associations, the etiology of region-specific disease, such as early-onset CRC's enrichment in the distal colon, remains unclear. Understanding regional mutagenesis may identify risk factors for this public health concern and CRC more broadly. To evaluate mutational dynamics across the premalignant colon, we performed whole-genome sequencing of 125 individual colon crypts taken from six standardized regions biopsied during colonoscopy, collected from 11 donors without polyps and 10 with polyps. We observed mutation spectra and accumulation rates consistent with previous whole-organ studies, with greater subclonal mutation capture enabled by experimental design. T>[A,C,G] mutations, which are associated with colibactin genotoxicity from pks+ Escherichia coli, were significantly enriched in the rectum of donors with and without polyps (adjusted p-values < 0.01). Moreover, when comparing findings to crypts from individuals with CRC and sequenced CRC tumors, we observed consistent enrichment of the colibactin-associated mutational signature "ID18" in the rectum in both normal colon crypts and CRC tumors, without significant difference in colibactin-specific single nucleotide variant or insertion-deletion burden in crypts across the three clinical groups (i.e., no polyp, polyp, and CRC). These findings argue against a causal or prognostic role for colibactin in CRC, instead indicating that the proposed association with early-onset disease reflects anatomic specificity rather than cancer-specific clinical relevance.

7
Inflammatory Biomarkers & Interpretable ML for SAP Risk Stratification in AIS Patients Undergoing Bridging Therapy

Wang, X.-Y.; Li, M.-M.; Zhao, S.-M.; Jia, X.-Y.; Yang, W.-S.; Chang, L.-L.; Wang, H.-M.; Zhao, J.-T.

2026-04-17 neurology 10.64898/2026.04.15.26350997 medRxiv
Top 0.1%
3.5%
Show abstract

Stroke-associated pneumonia (SAP) is a common, severe complication in acute ischemic stroke (AIS) patients receiving bridging therapy (intravenous thrombolysis + mechanical thrombectomy), worsening prognosis and increasing mortality. Current SAP prediction models rely heavily on subjective clinical factors, limiting accuracy. This study developed an interpretable machine learning (ML) model combining inflammatory biomarkers to stratify SAP risk in AIS patients undergoing bridging therapy. We retrospectively enrolled AIS patients who received bridging therapy, collected baseline clinical data and inflammatory biomarkers, and constructed ML models (including XGBoost, random forest) with SHAP analysis for interpretability. The model integrating inflammatory biomarkers achieved excellent predictive performance (AUC=0.XX, 95%CI: XX-XX), outperforming traditional clinical models. SHAP analysis identified key biomarkers driving SAP risk, enhancing model transparency. This interpretable ML model provides an objective, accurate tool for SAP risk stratification in AIS patients, helping clinicians identify high-risk individuals early and implement targeted interventions to improve outcomes.

8
RD-Embed: Unified representations of rare-disease knowledge from clinical records

Groza, T.; Tan, F.; Lim, N. T. R.; Shanmugasundar, M. W.; Kappaganthu, J.; Lieviant, J. A.; Karnani, N.; Chen, H.; Wong, T. Y.; Jamuar, S. S.

2026-04-04 genetic and genomic medicine 10.64898/2026.04.02.26350083 medRxiv
Top 0.1%
3.2%
Show abstract

Rare diseases often present with incomplete, evolving symptoms and signs scattered across clinical notes and coded records, making diagnosis and gene discovery difficult even when genomic data are available. Existing approaches either depend on curated phenotype profiles or use general biomedical language models that are not aligned to rare-disease knowledge, limiting performance in early or ambiguous clinical presentations. Here, we show that RD-Embed - a three-stage representation framework that builds a base space that preserves domain knowledge, aligns clinical text and SNOMED-derived signals, and refines relationships with graph-based learning - enables robust rare-disease retrieval from heterogeneous clinical records. Across ten rare-disease datasets, RD-Embed attains up to >50% top-ten diagnostic retrieval using combined text and phenotype features, compared with ~30% on average for other embedding models and similarly sized large language models. On an EHR stress test, clinical alignment substantially improves text-based retrieval compared with ontology-only representations, supporting use in routine EHR data. We suggest RD-Embed is lightweight model that can be incorporated into existing hospital systems that supports rare disease identification and diagnosis, and gene prioritization.

9
HealthFormer: Dual-level time-aware Transformers for irregular electronic health record events

Körösi-Szabo, P.; Kovacs, G.; Csiszarik, A.; Forrai, B.; Laki, J.; Szocska, M.; Kovats, T.

2026-03-27 health informatics 10.64898/2026.03.25.26349262 medRxiv
Top 0.1%
2.9%
Show abstract

Longitudinal electronic health records (EHRs) form irregular event sequences that mix multiple clinical coding systems and care settings. Learning transferable patient representations requires modeling both within-encounter code composition and long-range temporal dependencies. We aim to develop a pretraining framework that preserves event structure and explicitly uses elapsed time, while remaining straightforward to fine-tune for new supervised endpoints without task-specific feature engineering. We propose HealthFormer, a dual-level Transformer for event-centric EHR modeling. An Intra-Event Encoder aggregates heterogeneous domain tokens within each typed clinical event into an event embedding via code-specific embedding modules and attention pooling. Event embeddings are combined with a Date Encoder and a continuous-time attention bias based on attention with linear biases (ALiBI) inside an Inter-Event Encoder. We pretrain on Hungarian national administrative health records from a large-scale nationwide longitudinal cohort (spanning millions of individuals over a decade) using multi-task self-supervision with (i) per-domain masked token prediction (masked language modeling, MLM), (ii) event-type prediction under full-event masking (Event-level MLM), (iii) next-event type prediction, and (iv) time-to-next-event ({Delta}t) regression. Pretraining induces hierarchy-consistent organization in learned diagnosis (ICD-10) embedding geometry conducive to analysis and interpretation. On incident cancer prediction, end-to-end fine-tuning achieves test AUCs of 0.81/0.75/0.73 for colorectal cancer (CRC) and 0.94/0.87/0.84 for prostate cancer across 30/60/90-day horizons on balanced cohorts, outperforming logistic-regression baselines, including time-decayed bag-of-codes. HealthFormer provides an event-centric, time-aware representation that transfers via standard fine-tuning without endpoint-specific designs. Using ICD-10 diagnoses and ATC codes can facilitate adoption beyond Hungary. Learned diagnosis embeddings align with the hierarchy, enabling clinical inspection. Broader benchmarking across endpoints remains needed.

10
Methylation profiling in the Million Veteran Program: design, quality control, and smoking-associated epigenetic signatures

Schreiner, P. A.; Markianos, K.; Francis, M.; Despard, B.; Gorman, B. R.; Said, I.; Dong, F.; Gautam, S.; Dochtermann, D.; Shi, Y.; Devineni, P.; Kirkpatrick, C.; Khazanov, N.; Moser, J.; Million Veteran Program, ; Huang, G. D.; Muralidhar, S.; Tsao, P. S.; Pyarajan, S.

2026-04-23 genetic and genomic medicine 10.64898/2026.04.22.26351491 medRxiv
Top 0.1%
2.7%
Show abstract

The Million Veteran Program (MVP) represents the largest and one of the most diverse single cohorts associated with longitudinal Electronic Health Record data (EHR) data. We profiled a subset of samples from MVP using the Illumina Infinium MethylationEPIC Beadchip (EPIC array) to generate one of the largest single cohort methylation dataset to-date. Methylation profiles were analyzed for 45,460 total individuals, with the most populous ancestries composed of 27,455 Europeans, 11,798 African Americans, and 4,859 Admixed Americans. We detail the strict quality control standards implemented to ensure the most robust method of methylation profiling of the MVP cohort. This dataset was then applied to evaluate the effects of smoking exposure on DNA methylation in MVP participants. Ancestry-stratified epigenome-wide association studies (EWAS) of smoking status (ever/never) were performed using over 750,000 probes with certifiable signal. Our multi-ancestry meta-analysis demonstrates replicability with existing EWAS and identifies 3,207 novel probe-smoking associations unlocked via the depth and breadth of data in this cohort.

11
TELF: An End-to-End Temporal Encoder with Late Fusion for Interpretable Disease Risk Prediction from Longitudinal Real-World Data

Liu, Y.; Zhang, Z.

2026-04-06 health informatics 10.64898/2026.04.04.26350180 medRxiv
Top 0.1%
2.6%
Show abstract

Deep learning models utilizing longitudinal healthcare data have significantly advanced epidemiological research. However, contemporary transformer-based models increasingly rely on computationally intensive pre-training steps that entail processing massive real-world datasets with cost-prohibitive hardware. We introduce the Temporal Encoder with Late Fusion (TELF), a lightweight end-to-end predictive model featuring an encoder-only architecture for processing medical codes, followed by post-encoder concatenation with demographic variables. TELF learns code embeddings on-the-fly, thereby bypassing the resource-intensive pre-training bottleneck. Furthermore, its late-fusion design preserves the integrity of the temporal attention mechanism before integrating static demographic predictors. We evaluated TELF using an administrative claims database across three distinct cohorts: pancreatic cancer (n=53,661), type 2 diabetes (n=78,756), and heart failure (n=72,540). TELF consistently outperformed traditional machine learning baselines, including XGBoost, LightGBM, and logistic regression. Specifically, TELF achieved AUCs of 0.9150, 0.8199, and 0.8721 for pancreatic cancer, type 2 diabetes, and heart failure, respectively, compared with 0.9044, 0.7908, and 0.8535 for XGBoost and 0.9014, 0.7800, and 0.8466 for logistic regression. Beyond predictive superiority, TELF's isolated temporal attention mechanism enables population-level motif mining. By extracting high-attention temporal sequences, we mapped aggregated patient journey pathways, revealing interpretable clinical trajectories preceding disease onset. Collectively, these results demonstrate that TELF provides a resource-efficient and accessible framework for advanced temporal modeling in clinical and epidemiological research.

12
High-Throughput Observational Evidence Generation Using Linked Electronic Health Record and Claims Data

Gombar, S.; Shah, N.; Sanghavi, N.; Coyle, J.; Mukerji, A.; Chappelka, M.

2026-04-07 health informatics 10.64898/2026.04.07.26350300 medRxiv
Top 0.1%
2.4%
Show abstract

Background: The observational literature on comparative effectiveness is expanding rapidly but remains difficult to synthesize. Discordant findings often stem from structural differences in cohort definitions, inclusion criteria, and follow up windows, leaving stakeholders without a cohesive evidence base. Furthermore, studies typically focus on a narrow subset of outcomes, neglecting the broader needs of diverse healthcare stakeholders 1,2,3,4. Methods We developed a high throughput evidence generation workflow using linked EHR and administrative claims data. The cornerstone is a prespecified measurement architecture applied uniformly across clinical scenarios: six post index windows (acute to two year follow.up); 28 Elixhauser comorbidities; 14 healthcare resource utilization (HCRU) categories; 29 laboratory measures with 52 binary thresholds; and 42 adverse event categories. We generated unadjusted treatment comparisons across ~1,038 outcomes per scenario, including effect-measure modification (EMM) assessments across 130 baseline features. Results Across 40 clinical domains, the workflow produced approximately 32,982,552 outcome evaluations. An evaluation included a treatment comparison outcome population effect estimate with uncertainty bounds and supporting diagnostics. Approximately 5,000 narrative summaries underwent structured clinical and statistical quality control before dissemination. Conclusions Standardized, high throughput workflows can shift evidence generation away from fragmented studies toward comprehensive evidence packages. This shared evidence base supports precision medicine by making treatment effect heterogeneity visible across clinically meaningful subpopulations, reducing the need for redundant, stakeholder-specific studies.

13
LLM-Driven Target Trial Emulation with Human-in-the-Loop Validation for Randomized Trial: Automated Protocol Extraction and Real-World Outcome Evaluation{Psi}

Dey, S. K.; Qureshi, A. I.; Shyu, C.-R.

2026-04-13 health informatics 10.64898/2026.04.09.26350523 medRxiv
Top 0.1%
2.2%
Show abstract

Target trial emulation (TTE) enables causal inference from observational data but remains bottlenecked by manual, expert-dependent protocol operationalization. While large language models (LLMs) have advanced clinical knowledge extraction and code generation, their ability to automate end-to-end TTE workflows remains largely unexplored. We present an LLM-driven framework using retrieval-augmented generation to extract the five core TTE design parameters from the Carotid Revascularization and Medical Management for Asymptomatic Carotid Stenosis Trial (CREST-2) protocol and generate executable phenotyping pipelines for real-world EHR data. The performance of the framework was evaluated along two dimensions. First, protocol extraction accuracy was assessed against a gold-standard checklist of trial design components using precision, recall, and F1-score metrics. Second, outcome validity was evaluated through population-level concordance analyses comparing EHR-derived outcomes with published trial endpoints using standardized mean difference, observed-to-expected ratios, confidence interval overlap, and two-proportion z-tests. Further, Human-in-the-loop validation assessed the correctness of extracted clinical logic and phenotype definitions. Together, these evaluations demonstrate a structured approach for assessing LLM-driven protocol-to-pipeline translation for scalable real-world evidence generation.

14
Medicalbench: Evaluating Large Language Models Towards Improved Medical Concept Extraction

Yang, Z.; Lyng, G. D.; Batra, S. S.; Tillman, R. E.

2026-04-16 health informatics 10.64898/2026.04.12.26350704 medRxiv
Top 0.1%
2.2%
Show abstract

Medical concept extraction from electronic health records underpins many downstream applications, yet remains challenging because medically meaningful concepts, such as diagnoses, are frequently implied rather than explicitly stated in medical narratives. Existing benchmarks with human-annotated evidence spans underscore the importance of grounding extracted concepts in medical text. However, they predominantly focus on explicitly stated concepts and provide limited coverage of cases in which medically relevant concepts must be inferred. We present MedicalBench, a new benchmark for medical concept extraction with evidence grounding that evaluates implicit medical reasoning. MedicalBench formulates medical concept extraction as a verification task over medical note concept pairs, coupled with sentence level evidence identification. Built from MIMIC-IV discharge summaries and human verified ICD-10 codes, the dataset is curated through a multi stage large language model (LLM) triage pipeline followed by medical annotation and expert review. It deliberately includes implicit positives, semantically confusable negatives, and cases where LLM judgments disagree with medical expert assessments. Annotators provide sentence level evidence spans and concise medical rationales. The final dataset contains 823 high quality examples. We define two complementary evaluation tasks: (1) medical concept extraction and (2) sentence level evidence retrieval, enabling assessment of both correctness and interpretability. Benchmarking state-of-the-art LLMs and a supervised baseline reveals that performance remains modest, highlighting the difficulty of extracting implicitly expressed concepts. We further show that explicitly incorporating reasoning cues and prompting to extract implicit evidence substantially improves medical concept extractions, while performance is largely invariant to note length, indicating that MedicalBench isolates reasoning difficulty rather than superficial confounders. MedicalBench provides the first systematic benchmark for implicit, evidence-grounded medical concept extraction, offering a foundation for developing medical language models that can both identify medically relevant concepts and justify their predictions in a transparent and medically faithful manner.

15
Genome-Wide Variations of End Motif in Cell-Free DNA Fragments Distinguish Immunotherapy Responders from Non-Responders in Head and Neck Cancer: A Multi-Institute Prospective Study

Bandaru, R.; Fu, H.; Zheng, H.; Liang, J.; Wang, L.; Gulati, S.; Hinrichs, B. H.; Teng, M.; Zhang, B.; Kocherginsky, M.; Lin, D.; Hildeman, D. A.; Worden, F. P.; Old, M. O.; Dunlap, N. E.; Kaczmar, J. M.; Gillison, M.; El-Gamal, D.; Wise-Draper, T.; Liu, Y.

2026-03-30 genetic and genomic medicine 10.64898/2026.03.24.26348354 medRxiv
Top 0.1%
2.1%
Show abstract

Reliable, minimally invasive biomarkers for predicting immunotherapy response in head and neck squamous cell carcinoma (HNSCC) remain an unmet clinical need. Here, using patients from a prospective, multi-institutional phase II clinical trial (NCT02641093), we performed whole genome sequencing of 185 plasma cell-free DNA (cfDNA) samples collected longitudinally from 68 patients with locally advanced, surgically resectable HNSCC undergoing neoadjuvant and adjuvant pembrolizumab treatment. We developed the regional motif diversity score (rMDS), a novel fragmentomic metric quantifying the entropy of cfDNA 5' end motifs across genomic regions. Remarkably, unsupervised analysis revealed that rMDS robustly distinguished immunotherapy responders from non-responders, outperforming established cfDNA fragmentomic metrics and copy number alterations, while demonstrating independence from technical confounders. Longitudinal analysis revealed dynamic rMDS changes in genomic regions enriched for immune, lectin, and keratinization-related genes, hallmarks of squamous cell carcinoma, reflecting the interplay between tumor and peripheral immunity during the immunotherapy treatment. Interestingly, the regions with the most dynamic rMDS changes were highly enriched in telomere proximal loci, suggesting a novel link between telomere biology and cfDNA fragmentation. A machine learning classifier based on rMDS achieved robust predictive performance across multiple validation settings (AUC 0.89-0.99), with the highest accuracy at post-treatment timepoints and superior to PD-L1 expression and tumor fraction in the same sample. Predicted responders demonstrated significant trends toward improved disease-free survival (log rank test p=0.035, hazard ratio: 2.67, 95% confidence interval: 1.03-6.92), underscoring the clinical utility of rMDS-based stratification. These findings position rMDS as a biologically meaningful and clinically actionable biomarker for immunotherapy response in HNSCC, supporting its integration into future risk assessment frameworks and broader cancer care.

16
SPLIT: Safety Prioritization for Long COVID Drug Repurposing via a Causal Integrated Targeting Framework

Pinero, S. L.; Li, X.; Lee, S. H.; Liu, L.; Li, J.; Le, T. D.

2026-04-16 health informatics 10.64898/2026.04.12.26350701 medRxiv
Top 0.2%
2.0%
Show abstract

Long COVID affects millions of people worldwide, yet no disease-modifying treatment has been approved, and existing interventions have shown only modest and inconsistent benefits. A key reason for this limited progress is that current computational drug repurposing pipelines do not match well with the clinical reality of Long COVID. These patients often have persistent, multisystemic symptoms and may already be taking multiple medications, making treatment safety a primary concern. However, most repurposing workflows still treat safety as a downstream filter and rely on disease-associated targets rather than causal drivers. They also assume that the findings of one analysis would generalize across the diverse presentations of Long COVID. We introduce SPLIT, a safety-first repurposing framework that addresses these limitations. SPLIT prioritizes safety at the start of the candidate evaluation, integrates complementary causal inference strategies to identify likely driver genes, and uses a counterfactual substitution design to compare drugs within specific cohort contexts. When applied to cognitive and respiratory Long COVID cohorts, SPLIT revealed three main findings. First, drugs with similar predicted efficacy could have very different predicted safety profiles. Second, the drugs flagged as unfavorable were often different between the two cohorts, showing that drug prioritization is phenotype-specific. Third, SPLIT flagged 18 drugs currently under active investigation in Long COVID trials as having unfavorable predicted profiles. SPLIT provides a practical framework to identify safer, more context-appropriate candidates earlier in the process, supporting more targeted and better-tolerated treatment strategies for Long COVID.

17
Dissecting clinical reasoning failures in frontier artificial intelligence using 10,000 synthetic cases

Auger, S. D.; Varley, J.; Hargovan, M.; Scott, G.

2026-04-23 neurology 10.64898/2026.04.22.26351488 medRxiv
Top 0.2%
2.0%
Show abstract

Background: Current medical large language model (LLM) evaluations largely rely on small collections of cases, whereas rigorous safety testing requires large-scale, diverse, and complex cases with verifiable ground truth. Multiple Sclerosis (MS) provides an ideal evaluation model, with validated diagnostic criteria and numerous paraclinical tests informing differential diagnosis, investigation, and management. Methods: We generated synthetic MS cases with ground-truth labels for diagnosis, localisation, and management. Four frontier LLMs (Gemini 3 Pro/Flash, GPT 5.2/5 mini) were instructed to analyse cases to provide anatomical localisation, differential diagnoses, investigations, and management plans. An automated evaluator compared these outputs to the ground-truth labels. Blinded subspecialty experts validated 70 cases for realism and automated evaluator accuracy. We then evaluated LLM decision-making across 1,000 cases and scaled to 10,000 to characterise rare, catastrophic failures. Results: Subspecialist expert review confirmed 100% synthetic case realism and 99.8% (95% CI 95.5 to 100) automated evaluation accuracy. Across 1,000 generated MS cases, all LLMs successfully included MS in the differential diagnoses for more than 91% cases. However, diagnostic competence did not associate with treatment safety. Gemini 3 models had low rates of clinically appropriate steroid recommendations (Flash: 7.2% 95% CI 5.6 to 8.8; Pro: 15.8% 95% CI 13.6 to 18.1) compared to GPT 5 mini (23.5% 95% CI 20.8 to 26.1), frequently overlooking contraindications like active infection. OpenAI models inappropriately recommended acute intravenous thrombolysis for MS cases (9.6% GPT 5.2; 6.4% GPT 5 mini) compared to below 1% for Gemini models. Expanded evaluation (to 10,000 cases) probed these errors in detail. Thrombolysis was recommended in 10.1% of cases lacking symptom timing information and paradoxically persisted (2.9%) even when symptoms were explicitly documented as more than 14 days old. Conclusion: Automated expert-level evaluation across 10,000 cases characterised artificial intelligence clinical blind spots hitherto invisible to small-scale testing. Massive-scale simulation and automated interrogation should become standard for uncovering serious failures and implementing safety guardrails before clinical deployment exposes patients to risk.

18
OpenEvidence errs on the safe side in a structured test of triage recommendations

Jia, E.; Omar, M.; Barash, Y.; Brook, O. R.; Ahmed, M.; Kruskal, J. B.; Gorenshtein, A.; Klang, E.

2026-04-24 health informatics 10.64898/2026.04.23.26351526 medRxiv
Top 0.2%
1.9%
Show abstract

Ramaswamy et al. recently reported in Nature Medicine that ChatGPT Health, a consumer-facing health AI tool, undertriaged 51.6% of true emergencies. It was also susceptible to social anchoring in a structured stress test of triage recommendations. We applied the same vignette-based benchmark to OpenEvidence, a widely used physician-facing AI platform for clinical decision support. The benchmark included 960 prompts across 21 clinical domains (Supplementary Table S3). OpenEvidence undertriaged 12.5% of emergencies, a four-fold reduction relative to ChatGPT Health. It also showed no anchoring effect. Its errors skewed in a safer direction, including 68.0% overtriage of Home presentations. In 65 of 960 responses (6.8%), it declined to assign a triage level. These refusals occurred only in symptom-only prompts and never in urgent or emergency cases. Performance improved when objective clinical data were provided. Under the same benchmark, a widely used physician-facing system showed a different safety profile from a consumer-facing one. This suggests that who a health AI is built for can shape how it fails.

19
Ensemble Approaches to Screening, Diagnosis, and Subtyping of Multiple Sclerosis

Yang, I. Y.; Patil, A.; Jin, O.; Loud, S.; Buxhoeveden, S.; Zhang, D. Y.

2026-04-21 genetic and genomic medicine 10.64898/2026.04.19.26351230 medRxiv
Top 0.2%
1.8%
Show abstract

Multiple sclerosis (MS) is a debilitating disease affecting more than 1 million Americans, and today is assessed primarily through magnetic resonance imaging (MRI) and observational clinical symptoms. Given the autoimmune nature of MS, we hypothesized that high-dimensional gene expression data from peripheral blood mononuclear cells (PBMCs), when analyzed with the assistance of AI, may collectively serve as valuable biomarkers for the real-time risk and progression of MS. Here, we present PBMC RNA sequencing (RNAseq) results from N=997 samples, including 540 MS, 221 neuromyelitis optica (NMO), and 149 healthy controls. We constructed and optimized ensemble models for three clinical outcomes: (1) discrimination of early MS (EDSS [&le;] 2.0) from healthy individuals with 74% AUC at 100% coverage, (2) differential diagnosis of MS from NMO with 91% AUC at 80% coverage, and (3) subtyping RRMS from progressive MS with 79% AUC at 80% coverage. To our knowledge, no prior molecular test has been reported for any of these three MS clinical tasks, and these results may have immediate impact on clinical management of MS patients. Two innovations that improved the stratification accuracy of our models: selection of gene sets based on expression variance in disease states, and use of non-linear rank sort and conviction weighting in the ensemble score calculation.

20
Distinct Metabolic Signatures Distinguish Lung, Colorectal and Ovarian Cancer

Tsiara, I.; Vouzaxaki, E.; Ekström, J.; Rameika, N.; Yang, F.; Jain, A.; Iglesias Alonso, A.; Sjöblom, T.; Globisch, D.

2026-04-13 oncology 10.64898/2026.04.08.26350309 medRxiv
Top 0.2%
1.8%
Show abstract

Cancer-related casualties are the most common cause of death worldwide. The discovery of biomarkers is of utmost importance for diagnosis and disease monitoring. Herein, we performed a comprehensive metabolomics biomarker discovery effort in plasma from 615 lung, ovarian and colorectal cancer patients at diagnosis and 95 non-cancerous control subjects. This pan-cancer investigation identified specific panels of metabolites in the entire sample cohort with a high discriminating power and demonstrated by combined ROC AUC values of up to 0.95. The identified metabolites are mainly associated with lipid and amino acid metabolism as well as xenobiotic transformation. These metabolite panels of high predictive power provide new metabolic insights in these cancers and demonstrate the potential of metabolomics for improved diagnosis and monitoring disease progression.