Med — Latest Matching Preprints

1

High Norovirus False Discovery Rates and Noro-1 Assay Cross-Reactivity in the BioFire FilmArray Gastrointestinal Panel

Mauer, C.; Reed, J. C.; Mack, A. R.; Theriault, E. A.; Tansarli, G. S.; Fang, F. C.; Bourassa, L.; Greninger, A. L.

2026-05-20 infectious diseases 10.64898/2026.05.15.26353342 medRxiv

Top 0.1%

7.2%

Show abstract

Molecular syndromic panels such as the BioFire FilmArray Gastrointestinal Panel (BF-GIP) have been widely adopted for gastrointestinal illness diagnosis due to their fast turnaround times and broad pathogen coverage. Recently, the BF-GIP demonstrated increased rates of norovirus false-positive detections, prompting a Class II recall of more than two million tests in February 2024. We examined the prevalence of BF-GIP norovirus false positives across four hospitals from December 2024 to June 2025. Among 185 BF-GIP norovirus-positive results confirmed with the BD MAX Enteric Viral Panel, the false discovery rate ranged from 31 to 74% across sites, with the highest rate seen at a specialized cancer care hospital. Deep sequencing of BF-GIP pouches (n=42) confirmed the Noro-1 assay as the primary source of off-target amplification, identifying 78 off-target species, predominantly commensal stool bacteria, compared to only two species for the Noro-2 assay. Off-target species amplified by the Noro-1 assay were recovered from both false-positive and true-negative pouches, suggesting no single species accounted for the false-positive results. Partial primer complementarity at off-target loci and amplicon Tm values within the acceptable range support mispriming of gut microbiota as the underlying cause. False-positive pouches exhibited significantly higher Cp values than true positives for both assays (Noro-1: 26.6 vs. 11.1, p=0.013; Noro-2: 30.0 vs. 13.1, p<0.001), consistent with low-level off-target amplification. These findings highlight the high false discovery rate of the Noro-1 assay, identify bacterial species involved in mispriming, and demonstrate the need to redesign this assay to ensure reliable testing and improved patient care.

2

A liquid biopsy-centered, pan-cancer, open next generation sequencing panel to support clinical decision-making (LION panel)

Feierabend, S.; Künstner, A.; Forster, M.; Helbing, T.; Gebauer, N.; Gemoll, T.; Axt, F.; Nimmagadda, S. C.; Ranganathan, L.; Schwandt, J.; Heber, M.; Szymczak, S.; Hohensee, I.; Fliedner, S. M. J.; Scherer, F.; Oberländer, M.; Derer-Petersen, S.; Busch, H.; von Bubnoff, N.; Dazert, E.

2026-06-08 oncology 10.64898/2026.06.05.26354976 medRxiv

Top 0.1%

7.1%

Show abstract

Cancer treatment has shifted toward personalized therapy based on molecular profiling, particularly in advanced disease. Existing circulating tumor DNA panels are often broad, generating many non-actionable variants and incurring costs that limit routine use in molecular tumor boards. We developed and validated a manufacturer-independent, 109-gene liquid biopsy-centered pan-cancer open next generation sequencing panel (LION panel), combined with an in-house bioinformatic pipeline to support clinical decision-making. A total of 87 samples were analyzed, including 17 reference samples, 21 healthy blood donor controls, and 49 patient samples including nine tumor entities. The LION panel achieved 92% sensitivity and 99% specificity in reference samples, with high concordance to digital droplet PCR (r = 0.99). It detected variant allele frequencies as low as 0.05% (tumor-informed) and 0.5% (tumor-uninformed). Clinical concordance reached 82% with blood-based digital droplet PCR and 75% with whole exome tissue sequencing. In representative cases, variant dynamics correlated with disease progression and revealed additional targetable variants. Overall, the LION panel supports clinical decision-making by enabling identification of targetable variants, disease monitoring, and detection of treatment resistance, particularly when tumor tissue is unavailable.

3

The SARS-CoV-2 Integrated Genomic Epidemiology Database (IGED): Linking viral genomes with patient-level metadata to advance statewide genomic surveillance in California

Ryder, R.; Elder, J.; Panditrao, M.; Grosgebauer, K.; Katz, R.; Tello, L.; Carroll, E.; Borthwick, D.; Kaur, C.; Smith, R.; Shiau, V.; Wheeler, W.; Reilly, E.; Myers, J.; Nelson, L.; Lim, E.; Arunleung, P.; Baylis, E.; Gilliam, S.; Hennesy-Burt, T.; Bregman, B.; Silver, E.; Kapsak, C.; Wright, S.; Leon, T.; Bell, J.; Morales, C.; Wadford, D. A.

2026-05-19 health informatics 10.64898/2026.05.14.26353263 medRxiv

Top 0.1%

6.3%

Show abstract

In July 2021, the California Code of Regulations Title 17 required all laboratories performing SARS-CoV-2 whole genome sequencing (WGS) to report their sequencing results to the California Department of Public Health (CDPH). These viral genomic data and patient metadata were compiled into the Integrated Genomic Epidemiology Database (IGED). Linking anonymized viral sequences with patient-level information enabled monitoring of infectiousness, pathogenicity, transmission dynamics, evolution, and vaccine evasion among emerging SARS-CoV-2 lineages. Laboratories performing SARS-CoV-2 WGS transmitted sequencing results to CDPH through Electronic Laboratory Reporting (ELR) and non-ELR pathways. CDPH applied uniform reporting requirements but allowed flexibility in specific data formats to accommodate diverse data systems. To preserve data quality and interoperability across heterogeneous sources, CDPH implemented standardization, validation, and deduplication protocols. Snowflake, a cloud-based data storage and analytics platform, and Posit Connect, a cloud deployment and automation platform, supported the management, processing, and integration of data within the IGED. The IGED established links between SARS-CoV-2 WGS data and epidemiologic metadata for 801,418 sequences, representing 81.7% of all sequences reported in California. Lineages reported to the IGED showed strong concordance with lineage proportions in GISAID. Sequences reported to the IGED had average turnaround times longer than one month, and the majority of sequencing was performed in Southern California and Los Angeles. The IGED enhanced genomic surveillance through predictive modeling and monitoring concerning evolutionary trends such as recombination and saltations in persistent infections. Development of the IGED highlighted the need for standardized data requirements, sustained funding for sequencing, incentives for data submission, and interdisciplinary collaboration to build an effective genomic surveillance system. This framework for linking genomic and epidemiologic data has not only generated critical insights for SARS-CoV-2 but also provided the foundation for CDPH and other public health organizations to develop similar IGED-like systems for other priority pathogens as genomic surveillance expands.

4

Noninvasive MRD monitoring and profiling of clonal evolution by ctDNA in patients with advanced cancers treated within molecular tumor boards

Ranganathan, L.; Kuehn, J. C.; Klingler, C.; Pauli, T.; Metzger, P.; Bleul, S.; Philipp, U.; Hummel, F.; Weinschenk, S.; Deuter, M.; Rapp, J.; Winter, C.; Sueltmann, H.; Tinhofer, I.; Mouliere, F.; Rawluk, J.; von Bubnoff, N.; Dazert, E.; Illert, A. L.; Nieters, A.; Wehrle, J.; Peters, C.; Brummer, T.; Schultheis, A.; Lassmann, S.; Miething, C.; Becker, H.; Werner, M.; Boerries, M.; Duyster, J.; the MTB-FR Network, ; the DKTK EXLIQUID consortium, ; Scherer, F.

2026-06-03 oncology 10.64898/2026.05.27.26353937 medRxiv

Top 0.1%

4.9%

Show abstract

Circulating tumor DNA (ctDNA) from blood plasma has emerged as a promising biomarker for noninvasive profiling of tumor mutational landscapes and disease monitoring across cancers. In this study, we developed a targeted next-generation sequencing approach to explore the role of ctDNA for comprehensive tumor genotyping, early response prediction, and characterization of clonal heterogeneity in patients with advanced and rare cancers treated within molecular tumor boards. We applied our technology to 157 plasma specimens from 57 patients at distinct disease milestones and detected tumor variants in 96% of baseline samples, with 65% of them harboring actionable aberrations. Longitudinal monitoring of baseline mutations in on-treatment plasma revealed that ctDNA dynamics were significantly associated with clinical outcomes and enabled early prediction of disease progression. Finally, we observed substantial clonal heterogeneity over time, identifying emerging mutations in all analyzed plasma samples obtained at progression, including potentially targetable variants for subsequent personalized therapies.

5

Beyond Identifier Matching: An Empirical Characterization of Failure Modes in Biomedical Knowledge Graph Integration

Hu, S.; Cheng, H.; Gillenwater, L.; Manpearl, K.; Mandava, A.; Wang, Y.; Pividori, M.; Stranger, B.; Krishnan, A.; Greene, C.; Gao, Y.

2026-05-28 health informatics 10.64898/2026.05.26.26354182 medRxiv

Top 0.1%

4.4%

Show abstract

Objective. Biomedical knowledge graphs (KGs) such as PrimeKG, Hetionet, UMLS, and PharmGKB are increasingly used as the substrate for downstream machine-learning, retrieval-augmented generation, drug-repurposing, and electronic health record (EHR) augmentation pipelines. The dominant assumption in published work is that integrating two or more such KGs is a tractable engineering step solved by identifier (ID) matching. This paper interrogates that assumption empirically. We quantify how much concept overlap survives realistic alignment, and we characterize the new failure modes introduced by the methods that practitioners reach for when ID matching is insufficient. Materials and Methods. We compared four widely used biomedical KGs (PrimeKG, Hetionet v1.0, the full UMLS Metathesaurus, and PharmGKB) across eleven node types using a tiered alignment pipeline: (1) direct ID matching for nodes sharing a primary vocabulary; (2) cross-ontology bridging using standard mappings (e.g., MONDO-DOID, HPO-UMLS, HPO-UMLS-MeSH for side effects, NCBI Gene-HGNC-UMLS, UBERON-FMA/SNOMEDCT_US/NCI/MeSH for anatomy); (3) ClinicalBERT cosine-similarity grouping at threshold >= 0.98 for over-segmented disease nodes, with a deterministic suffix-stripping canonicalizer; (4) exact name matching for ontology-poor types (anatomy, REACTOME pathways); and (5) embedding-based fuzzy matching with UMLS lookup (SapBERT and ClinicalBERT) for free-text microbiome concepts. We applied the pipeline to a 698-concept gut-microbiome benchmark spanning taxa, pathways, and disease labels, validated grouping decisions against the curated SSSOM mappings released by the MONDO project, and audited the ClinicalBERT consolidation against five clinical-genetics case studies drawn from the literature. Results. Per-type pairwise coverage was strikingly asymmetric. Genes/proteins and the three Gene Ontology categories aligned cleanly across PrimeKG and Hetionet (mutual coverage 94-99%), but disease overlap was sparse: only 0.7% of PrimeKG individual disease nodes mapped to Hetionet, rising to 2.0% after MONDO grouping (versus 78.7% and 18.4% from the Hetionet side). PrimeKG-to-UMLS coverage spanned 100% (effect/phenotype via HPO) down to 20.8% (REACTOME pathways), with drugs at 73.7% and anatomy at 58.8%. PrimeKG-to-PharmGKB drug coverage required up to two bridging hops (DrugBank -> UMLS -> RxNorm/ATC/MeSH). Bigger was not uniformly more complete: on a 698-concept microbiome drug benchmark, Hetionet missed 0 concepts while PrimeKG missed 16. ClinicalBERT-based grouping consolidated 22,205 raw MONDO disease nodes into 17,080 groups but introduced three reproducible failure modes documented in case studies: (i) peer over-merging: for example, all 22 osteogenesis imperfecta subtypes collapsed into a single node despite distinct severity classes; (ii) parent-child collapse: e.g. acute myeloid leukemia merged with myeloid leukemia, erasing the acute/chronic distinction that drives clinical management; and (iii) lexical false positives: neurofibromatosis and schwannomatosis grouped together despite cellular-pathology differences. Discussion. Identifier matching alone is a weak baseline for biomedical KG integration. Cross-ontology bridges and embedding-based consolidation expand coverage but do so at the cost of clinically meaningful resolution, and the resulting failures are systematic rather than random. Reporting only aggregate coverage statistics obscures these losses, which propagate silently into downstream tasks. Conclusion. We provide reusable per-type coverage tables, a taxonomy of three integration failure modes, and concrete recommendations for downstream studies that depend on a unified biomedical KG. We argue that future KG integration work should report per-type coverage and per-cluster confidence rather than aggregate match rates.

6

Automated Macrolinguistic Discourse Analysis for Transdiagnostic Detection of Language Impairments

Lee, S. H.; Wang, S.; Varkanitsa, M.; Kiran, S.

2026-05-21 neurology 10.64898/2026.05.19.26353614 medRxiv

Top 0.1%

4.4%

Show abstract

Macrolinguistic discourse analysis offers valuable insight into how patients with neurogenic communication disorders organize and produce informative speech, yet it remains a largely manual and labor-intensive process. We report an automated pipeline for macrolinguistic discourse analysis for individuals with aphasia and dementia that integrates automatic speech recognition (ASR), utterance segmentation, sentence-level embeddings, centroid-based main-concept matching, and rule-based coherence error classification. These algorithms were applied to Cinderella story retellings from 309 participants (113 controls, 102 post-stroke aphasia (PWA), and 94 dementia). The algorithm reliably identified main concepts (83% accuracy against human labels) and derived interpretable features such as semantic distance to a main concept centroid, main concept coverage, and coherence error rates. Crucially, diagnostic classification results showed that logistic-regression classifiers trained on 10 macrolinguistic features distinguished aphasia from controls with high accuracy (AUC {approx} 0.94) but showed weaker separation for dementia (controls vs dementia AUC {approx} 0.66; aphasia vs dementia AUC {approx} 0.58). Semantic distance to the centroid emerged as a robust, informative predictor for diagnostic classification, demonstrating that the ability to produce narrative-aligned speech is clinically important. The automated pipeline enables scalable macrolinguistic discourse analysis that could support screening and longitudinal monitoring of discourse impairments across neurogenic populations.

7

Tracking the Dynamic Trajectories: A Global-to-Local Pharmacovigilance Analysis of GLP-1 Receptor Agonists

Lu, S.; Ruan, X.; Wang, L.; Wang, X.; Sameer, M.; Liu, H.

2026-06-01 health informatics 10.64898/2026.05.28.26354401 medRxiv

Top 0.1%

4.3%

Show abstract

Although GLP1/GIP receptor agonists demonstrate unprecedented weight loss efficacy, their rapid clinical adoption has revealed significant real-world tolerability challenges. To evaluate their dynamic safety profiles, we developed a macro to micro pharmacovigilance framework by combining global FAERS reports with local UT Physician EHR. Macroscopically, we distilled 17 shared adverse events across the drug class from FAERS with disproportionality analysis. Microscopically, local EHR data (289,655 longitudinal treatment sessions across 71,316 patients) revealed 51.6% of GLP1 sessions terminated within 90 days. Furthermore, temporal stratified logistic regression demonstrated that initial exposure (0 to 30 days) correlated strongly with nausea and vomiting, which attenuated in extended sessions, whereas extended exposure (>2 years) uncovered late onset risks, notably incident hepatic steatosis. Ultimately, this time aware framework reveals that GLP1 safety profiles are profoundly duration dependent, providing critical insights into both acute intolerances and long-term medication safety.

8

Analytical Validation of Minimally Invasive Capillary Blood Microsampling using Tasso+ for Multiplexed Neurological Biomarkers

Swann, O.; Hicks, S.; Lynch, C.; Wallman-Jones, A.; Shoai, M.; Mulvaney, R.; Fernandes Gomes, B.; Kodosaki, E.; Tecilla, M.; Ghajari, M.; Jones, B.; Kemp, S.; TBI-REPORTER Biomarker group, ; Sylvester, R.; Cross, M.; Stokes, K.; Wilson, M. G.; Menon, D. K.; Heslegrave, A.; Zetterberg, H.; Sharp, D. J.; Parker, T. D.

2026-05-15 neurology 10.64898/2026.05.15.26353201 medRxiv

Top 0.1%

4.1%

Show abstract

Blood-based biomarkers are increasingly used to investigate brain health, but collecting venous blood is difficult in remote and field settings. Capillary microsampling offers a practical alternative, although the ability to delay processing and its agreement with gold-standard venous blood require validation. We evaluated Tasso+, a minimally invasive upper-arm capillary blood collection system, for measuring neurological and host-response biomarkers in plasma and serum during an exercise-based protocol. Sampling occurred before, immediately after, and approximately 24-to-36 hours after exercise; Tasso+ samples were processed with or without a 72-hour room-temperature delay. Tasso+ samples were compared with matched venous blood, and Capitainer SEP10 dried plasma spots were also evaluated, using Quanterix Simoa and Alamar Biosciences NULISAseq CNS panel. Tasso+ enabled reliable measurement of several key biomarkers, including GFAP and NfL, even after delayed processing. These findings support capillary microsampling for neurological biomarker studies where venepuncture is challenging, including field-based research and participant-led remote sampling.

9

General-purpose large language models can achieve physician-level accuracy in complex medical data extraction

Rajeev, M.; Narayan, A.

2026-06-10 gastroenterology 10.64898/2026.06.06.26354838 medRxiv

Top 0.1%

4.0%

Show abstract

Background: Unstructured data represent about 80% of total electronic health records (EHR) data. Structuring this free text is essential for advancing clinical research, including cohort selection for trials, retrospective studies, and the development of disease registries. While manual chart review (MCR) remains the gold standard for extracting this clinical data, the process is inherently slow, resource-intensive, and susceptible to errors from human fatigue. We evaluated the extraction accuracy, safety, and efficiency of the HeLIX (Hepatology Logic-Integrated Extraction) framework, a Large Language Model (LLM) protocol using Google Gemini 3 Pro, compared to a gold-standard Manual Chart Review (MCR). Methods: A prospective validation study was conducted using 50 high-complexity, simulated hepatology discharge summaries designed to replicate the real-world heterogeneity of EHRs. The HeLIX framework employed a Zero-Shot, Structured Chain-of-Thought (CoT) prompting strategy enforced by a three-layer architecture: Clinical Reasoning Trace, Schema Enforcement, and Evidence Verification. The model extracted 45 distinct clinical variables. Performance was benchmarked against a consensus MCR. Results: Across 2,250 evaluated data points, the model achieved an overall Extraction Accuracy of 99.24% (95% CI: 98.8%-99.5%), with perfect concordance in 35/45 (77.8%) variables. For binary diagnostic variables, the model demonstrated an overall F1-score of 0.98, Recall of 0.99 and substantial inter-rater reliability (Cohens {kappa} = 0.97). Hallucinations were exceptionally rare (2/2250; 0.08%). Critical errors affecting clinical management occurred in only 2 instances (<0.1% of total data), both involving etiological misattribution in complex multifactorial diagnoses. The AI workflow was 13.4-fold faster and 95.1% more cost-effective than manual extraction. Conclusion: The HeLIX framework demonstrates physician-level accuracy and reliability in extracting complex hepatology data. It offers a scalable, efficient, and economical alternative to manual chart review. Such frameworks could accelerate clinical research, enabling healthcare systems globally to build comprehensive patient registries for a fraction of the traditional cost.

10

An interpretable and interactive clinical AI agent for personalized anti-infective decision support in carbapenem-resistant Gram-negative bacterial infection

Cao, X.; Shi, D.; Du, Z.; Zhou, J.; Wang, Z.; Liu, Z.; Wang, Q.

2026-05-19 health informatics 10.64898/2026.05.18.26353005 medRxiv

Top 0.1%

4.0%

Show abstract

Carbapenem-resistant Gram-negative bacteria (CRGNB) infections remain difficult to manage because treatment decisions must balance heterogeneous patient risk, limited antibiotic options, potential toxicity and emerging resistance. Clinical care in this setting requires not only single-endpoint risk prediction, but also decision-support frameworks that can jointly enable prognosis assessment, result interpretation, and individualized treatment comparison. Here we present Dr.BUG, an interactive clinical AI agent for personalized decision support in CRGNB infection. Dr.BUG integrates stable feature-set selection, multi-task prognostic modelling, interpretability analysis and model-based simulation of antibiotic regimen recommendation into a unified workflow. Using a development cohort, a temporally independent validation cohort, and external cohorts from the MIMIC-IV dataset, we developed and validated models for four clinically relevant tasks: clinical efficacy, survival outcome, polymyxin resistance and treatment duration. Model inputs were derived primarily from routinely available and relatively low-cost clinical variables, supporting translational feasibility. Across the major tasks, selected-feature models matched or exceeded the performance of their full-feature counterparts while using fewer variables, as reflected in 82.0% of optimized-metric comparisons in the development cohort, and remained robust in both temporal and external validation. Dr.BUG further provided both population-level and patient-level interpretability and generated individualized rankings of candidate antibiotic regimens. In the retrospective analysis of non-survivors, clinician review suggested that regimens recommended by Dr.BUG might be associated with higher predicted survival probabilities. These findings support a broader role for clinical AI in complex drug-resistant infections, extending its utility from offline risk prediction to interpretable, deployable, and personalized decision support.

11

Multi-region sampling of the human small intestine using an ingestible device

Fu, B.; DeSchepper, L. B.; Sun, J.; McKeithen-Mead, S. A.; Kapili, B.; Ochoa-Andersen, P.; Spencer, S. P.; Fardeen, T.; Ricardo, M.; El Kamari, V.; Sinha, S.; Relman, D. A.; Grembi, J. A.; Shalon, D.; Estrela, S.; Huang, K. C.

2026-06-10 gastroenterology 10.64898/2026.06.09.26353912 medRxiv

Top 0.1%

3.7%

Show abstract

The human small intestine (SI) plays a central role in nutrient processing, host-microbe interactions, and immune regulation, yet remains poorly characterized due to the lack of minimally disruptive sampling methods. Here, we present a protocol for deploying, recovering, and analyzing samples collected using an ingestible device that enables multi-region, lumen-targeted SI sampling during normal digestion. The device incorporates a ~30-cm collapsible tube wound into pH- or time-responsive layers that sequentially unfurl in situ, typically capturing three spatially ordered samples with high yield and reliable retrieval. This protocol outlines study design, participant handling, device recovery, contamination control, and standardized workflows for analyses, including cell quantification, culturomics, sequencing, and metabolomics. We further describe benchmarking approaches for evaluating spatial resolution and strategies for assay prioritization when sample volume is limiting. By reducing participant burden and facilitating integration with stool, saliva, and clinical metadata, this approach enables longitudinal and large-cohort studies linking SI microbial ecology and host physiology to human health.

12

Genosolver: Rare Disease Diagnosis through Holistic Integration of Unstructured Clinical Narratives Using Large Language and Reasoning Models

Islam, T.; Danner, M.; Ziad, Z.; Begemann, M.; Beijer, D.; Lischka, A.; Lausberg, E.; Mattern, L.; Suh, J.; Wittig, P.; Guezel, N.; Schlaich, E.; Karaivanova, R.; D'Augello, S.; Franken, L.; Ruedebusch, J.; Mueller, R.; Perchalla, E.; Zempel, H.; Haag, N.; Eggermann, K.; Eggermann, T.; Meyer, R.; Kraft, F.; Elbracht, M.; Kurth, I.; Krause, J.

2026-06-05 health informatics 10.64898/2026.06.04.26354845 medRxiv

Top 0.1%

3.6%

Show abstract

Background: Molecular medicine has made genetic diagnostics crucial for rare diseases, but the majority of patients remains without diagnosis even after state-of-the-art assessment. Standardized systems for integrating clinical features, such as the Human Phenotype Ontology (HPO), offer assistance, but are often insufficiently detailed and fail to capture crucial clinical parameters such as age at onset, longitudinal changes in symptoms, detailed characteristics of a clinical symptom, or the absence of a feature. Results: We present Genosolver an integrated workflow that utilizes machine learning to address this bottleneck. Using Large Language Models (LLMs) and Large Reasoning Models (LRMs) on unstructured clinical notes and electronic health care data, we generate a workflow that unifies phenotype extraction, generates differential diagnosis, and prioritizes genetic variants from genome data. We evaluated the performance on 233 previously genetically solved cases, where Genosolver ranked the causative gene first in 72% of cases and in 94% of cases in the top 10 gene list, outperforming the existing benchmarking tool Exomiser by 9%. Semi-automated reanalysis of 1,875 unsolved rare disease cases yielded an additional diagnostic rate of 1.7%. Incorporating rich, unstandardized clinical narratives substantially enhanced model performance beyond HPO-only inputs and demonstrated competitive results using data security compliant local models. Conclusion: Integrating unstandardized clinical data with local LLMs and reasoning offers a scalable, data-secure workflow that increases molecular diagnoses in rare diseases.

13

Liquid Biopsy of HPV Cell-Free DNA Enables Blood-Based Early Detection and Molecular Stratification of HPV-Associated Cancer and Precancer Stages

Wang, Q.; Eldfors, S.; Lee, S. S.; Das, D.; Al-Inaya, Y.; Lumaj, G.; Epstein, E. T.; Shukla, S.; Ricart, E.; Dhillon, H.; Lake, J.; Hirayama, S.; Adalsteinsson, V. A.; Drage, M. G.; Gulhan, D. C.; Davis, B. T.; Faden, D.

2026-05-14 oncology 10.64898/2026.05.11.26352922 medRxiv

Top 0.1%

3.6%

Show abstract

Liquid biopsies targeting circulating tumor DNA enable noninvasive cancer detection but lack sensitivity in pre- and early-cancer stages, where clinical benefits would be greatest. Human papillomavirus (HPV) causes six cancer types, accounting for 5% of all cancers worldwide. Targeting HPV cell-free (cf)DNA offers a compelling opportunity to overcome current liquid biopsy constraints due to its unique tumor-specific origin, lack of sequence homology to the human genome, and the high viral-to-human copy ratio per cell. Utilizing HPV-associated anal cancer and precancer as a model, here we applied a custom, multi-feature HPV whole-genome liquid biopsy to biobanked and prospective screening cohorts spanning the HPV infection-precancer-cancer continuum. HPV cfDNA was detected years before cancer diagnosis and as early as the infection stage, with increasing detection as stages advanced. Genomic hallmarks of HPV malignancy, including HPV integration, PIK3CA mutations, and 3q amplification, were detected exclusively in cancer, while precancers exhibited distinct HPV genotypes. Fragmentomics analysis of HPV cfDNA revealed stage-informative signatures reflecting viral epigenetic changes during carcinogenesis. A unified classifier incorporating genomic and fragmentomics features achieved a mean AUC of 0.77 for identifying cancer and high-grade precancer, stages requiring clinical intervention. Together, these findings demonstrate the feasibility of blood-based screening and molecular risk stratification for HPV-associated cancer and precancer. TeaserProfiling blood HPV cell-free DNA detects cancer years early and distinguishes precancers needing intervention from surveillance

14

Agentic Authoring of OMOP Concept Sets from Natural Language

Chen, H.; He, X.; Dai, H.; Huang, Y.; Liu, M.; Bian, J.

2026-06-03 health informatics 10.64898/2026.06.02.26354704 medRxiv

Top 0.1%

3.1%

Show abstract

Authoring OMOP concept sets from free-text descriptions remains a major bottleneck in scalable computable phenotyping for observational research. Existing tools support parts of this workflow but are designed primarily for interactive expert use rather than autonomous large language model (LLM) agents. We present an agentic framework that automatically generates OMOP concept sets by combining vocabulary tools, ontology extensions (RxClass, LOINC, and Disease Ontology), and procedural guidance. In ablation studies, the best configuration achieved Recall@100 of 0.965 and AP@100 of 0.875 on the development set. Cohort-level validation against OMOP-mapped EHR data yielded precision of 0.970, recall of 0.998, and a Jaccard index of 0.968. On an independent silver-standard benchmark of 457 concept-vocabulary pairs from 15 AD/ADRD target trial emulation studies, Recall@100 reached 0.835 and AP@100 reached 0.786. Task-specific tools outperformed unrestricted SQL access and PHOEBE 2.0, while progressive guidance performed best.

15

A Longitudinal Clinical Foundation Model on Nationwide Veteran Health Trajectories

Zamora-Resendiz, R.; Yin, J.; Kimbrel, N. A.; Beckham, J. C.; Crivelli, S.

2026-05-17 health informatics 10.64898/2026.05.13.26353133 medRxiv

Top 0.1%

3.1%

Show abstract

We present VA-LLM, a 1.62-billion-parameter autoregressive transformer pre-trained from scratch on 1.74 trillion tokens of clinical text spanning 22 years of care for 13.8 million patients in the Veterans Health Administration, with mortality outcomes confirmed through the National Death Index for 7.8 million patients. In a retrospective-prospective evaluation on 107,555 withheld patients, VA-LLM achieved higher 5-year AUPRC than Llama-2 (7 billion parameters), BioGPT _large (1.57 billion parameters), and GatorTron (3.91 billion parameters), matching GatorTron's 100,000-patient performance with only 10,000 labeled patients. In a clinical validation against the VA's operational Care Assessment Need (CAN) score on 5.5 million patients one year beyond the pre-training corpus, VA-LLM achieved a 90-day mortality AUROC of 90.00% versus 87.74% (p < 0.001) and a 45% relative improvement in AUPRC; post-hoc recalibration recovered calibration comparable to CAN (Brier 0.0091 versus 0.0093) without sacrificing discrimination. Across 21 pre-training checkpoints, discriminative performance correlated more strongly with cumulative mortality experience (CME), the total person-years contributed by patients with confirmed deaths, than with token count ({Delta}R2 = 0.15; Williams p < 10-6). Performance plateaued once marginal cohorts added fewer confirmed deaths, even as pre-training loss continued to decrease. These findings suggest that the clinical composition of pre-training data, particularly the completeness of documented patient trajectories, correlates with predictive performance more closely than corpus size alone.

16

Personalized clinical reference intervals for routine precision medical care

Zhang, C.; Chen, Y.-L.; Jamilov, A.; Liu, E.; Shree, S.; Lam, B. D.; Foy, B. H.

2026-05-30 health informatics 10.64898/2026.05.28.26354363 medRxiv

Top 0.1%

3.0%

Show abstract

Most routine clinical markers are interpreted using population-based reference intervals, despite being regulated around patient-specific homeostatic setpoints. This mismatch obscures physiologic shifts, inhibiting detection of early disease signatures. Here, we develop a novel Bayesian inference method that adaptively constructs personalized reference intervals using each patients existing health records. In analysis of >100 million lab tests in >800,000 patients, these personalized intervals can be accurately constructed with only minimal prior data, meaning this method can be applied near universally. We show that across 43 common lab markers, patient setpoints are strongly associated with future morbidity, with signal strength increasing as more test data is collected. Deviation from personalized reference intervals provides strong and novel risk signatures across diverse disease states, including hypothyroidism, hematologic cancers, kidney disease, and pregnancy complications. Importantly, personalized reference intervals capture a different risk signature to existing population-based approaches, with the highest risk patients being those who deviate from both intervals simultaneously. In a targeted clinical use case study of iron infusion, use of personalized reference intervals greatly improved prediction of treatment efficacy and allowed precise tracking of treatment responses. Our results illustrate how existing health records can be used to construct personalized benchmarks for nearly all common clinical tests, driving a new paradigm for precision laboratory medicine.

17

Integrative Genetic Analyses of Lipid Metabolism and Multiple Sclerosis Severity Using Metabolome-Wide and Cis-Mendelian Randomization

Noroozi, R.; Higgins Tejera, C.; Chen, M.; Briggs, F. B. S.; Bhargava, P.; Fitzgerald, K. C.

2026-05-29 neurology 10.64898/2026.05.27.26354239 medRxiv

Top 0.1%

2.7%

Show abstract

The course of multiple sclerosis (MS) is highly heterogeneous, yet the biological mechanisms underlying this variability remain incompletely understood. Although metabolic alterations have increasingly been associated with disease progression, existing observational evidence is limited by confounding, reverse causation, and an inability to establish causal mechanisms. To bridge this gap, we used a metabolome-wide Mendelian Randomization (MR) framework, including thorough sensitivity analyses, to identify metabolites genetically linked to MS severity that can causally affect it. Bidirectional MR analyses revealed a subset of amino acid and lipid pathways with strong, consistent effects across different MR approaches, confirmed by tests for heterogeneity, horizontal pleiotropy, and LD confounding. For metabolites prioritized by metabolome-wide MR with evidence of causal effects, we conducted genetic colocalization at loci encompassing proximal enzyme-encoding genes, leveraging the corresponding instrumental variants to assess shared underlying genetic signals. This process revealed shared genetic signals between metabolite levels and MS severity, mapped to the FADS1/2 and CYP4F2 loci. A subsequent pathway-resolved set of cis-MR analyses across FADS1/2-derived polyunsaturated fatty acid (PUFA) metabolites, using a functional variant that proxies reduced {triangleup}5-desaturase activity, showed consistent effects indicating that FADS1 perturbation is associated with MS severity. Collectively, these results highlight FADS1 as a key driver of PUFA-related causal effects on MS severity in both systemic (circulating metabolites) and brain cell-specific contexts. Additional supportive cis-MR evidence implicates the disruption of CYP4F2 as another PUFA-metabolizing enzyme.

18

A genome-resolved view of the wastewater RNA virome

Kantor, R. S.; Shakya, M.; Ruth, N.; Rothman, J. A.; Rushford, C.; Gregory, D. A.; Epstein, A.; Kaufman, J. T.; Allen, J. E.; Chain, P. S. G.; O'Connor, D. H.; Johnson, M. C.

2026-05-22 infectious diseases 10.64898/2026.05.19.26353600 medRxiv

Top 0.1%

2.4%

Show abstract

Sequencing-based wastewater surveillance is emerging as an important tool in pathogen-agnostic threat detection, potentially enabling early identification before capture through clinical surveillance systems. However, virus sequences of human pathogens are typically low in abundance in wastewater while much of the data is unclassifiable at the read level. This presents a challenge because genomes may not assemble well for novel pathogens of interest, but read-based methods cannot currently separate novel from previously seen unclassified sequences. Using ultra-deep untargeted sequencing of the wastewater RNA virome performed by the CASPER consortium (321 samples), we constructed a wastewater virus genome database (WVDB) with the goal of expanding the set of available high-quality non-redundant reference genomes. The first version of this database contains 21,015 near-complete viral genomes, of which the majority are ssRNA bacteriophage (79%). We additionally recovered genomes for putative plant and vertebrate-infecting viruses, human enteric viruses, and viruses whose host could not be predicted. Fewer than 4000 genomes had matches in previously published virus genome databases, and WVDB captured around one fifth of the reads that could not be classified by Kraken2. Further expansion of WVDB will provide a comprehensive resource of RNA virus genomes for characterization of viral diversity and dynamics in wastewater across space and time.

19

Plasma Proteomic Networks Reveal Shared Biology with Brain Linked to Alzheimer's Disease Pathology

Guo, Q.; Ping, L.; Rathore, S.; Duong, D. M.; Shantaraman, A.; Fox, E. J.; Johnson, E. C.; Lah, J. J.; Levey, A. I.; Seyfried, N. T.

2026-06-02 neurology 10.64898/2026.05.26.26353866 medRxiv

Top 0.1%

2.4%

Show abstract

Alzheimer's disease (AD) drives widespread molecular changes beyond the brain that are increasingly detectable in plasma. To map plasma proteomic signatures of AD in a broadly unbiased manner with high depth and reproducibility, we profiled plasma from 214 individuals spanning cognitively normal controls, mild cognitive impairment, and AD using microbead-based enrichment and data-independent acquisition mass spectrometry (DIA-MS). We reliably quantified 5,823 proteins across samples, and network analysis identified 29 plasma modules enriched for functions related to lipid metabolism, extracellular matrix remodeling, immune signaling, mitochondrial function, and proteostasis. Several modules were associated with cognition, APOE4, sex, race, and cerebrospinal fluid (CSF) amyloid and tau biomarkers. Among 129 individuals with paired CSF and plasma biomarker measurements, over 1,500 proteins differed between CSF biomarker positive and negative groups, including amyloid-linked matrisome proteins such as SMOC1, FRZB, SPON1 and CTHRC1. A 10-protein plasma panel classified CSF biomarker positivity with performance similar to plasma pTau217 (AUC = 0.91), and combining both improved accuracy (AUC = 0.99). Integration with a human brain proteomic network revealed that two-thirds of plasma modules were preserved in brain, with many AD-altered modules changing concordantly across compartments. This study establishes a scalable DIA-MS plasma proteomics platform that captures systemic and brain-linked AD biology and identifies complementary biomarkers beyond phosphorylated tau.

20

Extraction of Human Phenotype Ontology (HPO) Concepts from Clinical Notes Utilizing Large Language Models (LLM) with Model Context Protocol (MCP)

Larsen, M. E.; Campbell, I. M.; Orlando, L. A.; Robinson, P.; Walton, N. A.

2026-05-25 health informatics 10.64898/2026.05.23.26353963 medRxiv

Top 0.1%

2.2%

Show abstract

Background: Accurate extraction of Human Phenotype Ontology (HPO) terms from clinical notes is essential for variant prioritization and genetic diagnosis. Large language models (LLMs) often struggle to balance precision, hallucination avoidance, and ontology mapping accuracy, and prior work has shown that retrieval-based grounding can improve performance for individual models. We hypothesized that real-time ontology grounding through external tools would improve these metrics across heterogeneous LLMs, and we evaluated the Model Context Protocol (MCP), a standardized open framework for integrating external tools, as a vendor-agnostic mechanism for delivering such grounding. Methods: Five LLMs (Claude Sonnet 4.5, GPT-5.1, Gemini 2.5 Pro, Grok 4.1, and Qwen3 30B) extracted HPO terms from four synthetic clinical genetics notes under two conditions: baseline ("No Tools," internal knowledge only) and tool-augmented ("With Tools"), with real-time HPO retrieval delivered through MCP for models with native support and through functionally equivalent native tool-calling interfaces otherwise. Each model performed [≥]50 runs per note per condition (>2,000 total runs). Performance was evaluated using Precision, Recall, and F1-score. Outputs were manually adjudicated to classify mapping errors and hallucinations. Results were benchmarked against a commercial EHR-based HPO extraction tool. Results: Tool augmentation significantly improved performance across all models. Mean aggregate F1-score increased from 0.46 (SD 0.22) in the baseline condition to 0.72 (SD 0.15) with tools (p < 0.001). Mapping Error Rate decreased from 40.9% to 7.8% (p < 0.001), and Precision increased from 56% to 90%. Performance gains were observed across all model families, including the open-weight Qwen3 model (F1 0.11[->]0.50). For inferred phenotypes, F1 improved from 0.20 to 0.34 (p < 0.001) without a significant increase in hallucination rate (p = 0.08). Compared with the commercial benchmark, tool-augmented LLMs achieved higher F1-scores and substantially greater recall for inferred phenotypes. Conclusions: Real-time ontology grounding substantially improves HPO extraction across diverse LLMs by reducing mapping errors and enhancing phenotype inference. The Model Context Protocol provides a standardized, interoperable mechanism for delivering such grounding, supporting reproducible, vendor-agnostic deployment of clinical LLM pipelines in genomic medicine.