Back

Patterns

15 training papers 2019-06-25 – 2026-03-07

Top medRxiv preprints most likely to be published in this journal, ranked by match strength.

1
Exploring Zero-Shot Cross-Lingual Biomedical Concept Normalization via Large Language Models
2025-02-27 health informatics 10.1101/2025.02.27.25323007
#1 (5.7%)
Show abstract

Over the past few years, discriminative and generative large language models (LLMs) have emerged as the predominant approaches in natural language processing. However, despite significant advancements, there remains a gap in comparing the performance of discriminative and generative LLMs in cross-lingual biomedical concept normalization. In this paper, we perform a comparative study across several LLMs on the challenging task of cross-lingual biomedical concept normalization via dense retrieval....

2
LLM-Based Web Data Collection for Research Dataset Creation
2025-05-25 health informatics 10.1101/2025.05.23.25328249
#1 (3.9%)
Show abstract

Researchers across many fields rely on web data to gain new insights and validate methods. However, assembling accurate and comprehensive datasets typically demands manual review of numerous web pages to identify and record only those data points relevant to specific research objectives. The vast and scattered nature of online information makes this process time-consuming and prone to human error. To address these challenges, we present a human-in-the-loop framework that automates web-scale data...

3
How does DeepSeek-R1 perform on USMLE?
2025-02-10 health informatics 10.1101/2025.02.06.25321749
#1 (3.9%)
Show abstract

DeepSeek, a Chinese artificial intelligence company, released its first free chatbot app based on its DeepSeek-R1 model. DeepSeek provides its models, algorithms, and training details to ensure transparency and reproducibility. Their new model is trained with reinforcement learning, allowing it to learn through interactions and feedback rather than relying solely on supervised learning. Reports showcase that DeepSeeks model shows competitive performances against established large language models...

4
Probing Hidden States for Calibrated, Alignment-Resistant Predictions in LLMs
2025-09-19 health informatics 10.1101/2025.09.17.25336018
#1 (3.8%)
Show abstract

Scientific applications of large language models (LLMs) demand reliable, well-calibrated predictions, but standard generative approaches often fail to fully access relevant knowledge contained in their internal representations. As a result, models appear less capable than they are, with useful information remaining latent. We present PING (Probing INternal states of Generative models), an open-source framework that trains lightweight probes on frozen, HuggingFace-compatible transformers to deliv...

5
A clinical specific BERT developed with huge size of Japanese clinical narrative
2020-07-09 health informatics 10.1101/2020.07.07.20148585
#1 (3.8%)
Show abstract

Generalized language models that pre-trained with a large corpus have achieved great performance on natural language tasks. While many pre-trained transformers for English are published, few models are available for Japanese text, especially in clinical medicine. In this work, we demonstrate a development of a clinical specific BERT model with a huge size of Japanese clinical narrative and evaluated it on the NTCIR-13 MedWeb that has pseudo-Twitter messages about medical concerns with eight labe...

6
STM-GNN: Space-Time-and-Memory Graph Neural Networks for Predicting Multi-Drug Resistance Risks in Dynamic Patient Networks
2025-05-28 health informatics 10.1101/2025.05.27.25327491
#1 (3.8%)
Show abstract

Hospital-acquired infections (HAIs), particularly those caused by multidrug-resistant (MDR) bacteria, pose significant risks to vulnerable patients. Accurate predictive models are important for assessing infection dynamics and informing infection prediction and control (IPC) programmes. Graph-based methods, including graph neural networks (GNNs), offer a powerful approach to model complex relationships between patients and environments but often struggle with data sparsity, irregularity, and het...

7
Stanford Screenomics: An Open-source Platform for Unobtrusive Multimodal Digital Trace Data Collection from Android Smartphones
2025-06-26 health informatics 10.1101/2025.06.24.25329707
#1 (3.8%)
Show abstract

Smartphone-based digital trace data can offer powerful insights for identifying behavioral patterns and health risks. However, existing tools for comprehensive data collection lack scalability, customizability, transparency, and accessibility. To address these gaps, we developed an open-source platform that enables in-situ capture of multimodal digital traces from smartphones (e.g., moment-by-moment capture of screenshots, application usage logs, interaction histories, and phone sensor readings)...

8
Detection of Patients at Risk of Enterobacteriaceae Infection Using Graph Neural Networks: a Retrospective Study
2023-06-04 health informatics 10.1101/2023.06.01.23290386
#1 (3.7%)
Show abstract

While Enterobacteriaceae bacteria are commonly found in healthy human gut, their colonisation of other body parts can potentially evolve into serious infections and health threats. We aim to design a graph-based machine learning model to assess risks of inpatient colonisation by multi-drug resistant (MDR) Enterobacteriaceae. The colonisation prediction problem was defined as a binary classification task, where the goal is to predict whether a patient is colonised by MDR Enterobacteriaceae in an ...

9
Network-based proactive contact tracing: A pre-emptive, degree-based alerting framework for privacy-preserving COVID-19 apps
2025-08-05 health informatics 10.1101/2025.08.01.25332740
#1 (3.6%)
Show abstract

Most COVID-19 exposure-notification apps still use binary contact tracing (BCT): once a test is positive, every contact whose accumulated risk exceeds a fixed threshold receives the same quarantine order. Because those alerts are late and blunt, BCT can miss early spread while triggering mass isolation. We propose Network-based Proactive Contact Tracing (NPCT), a privacy-preserving, fully decentralized intervention scheme that can run on existing exposure-notification infrastructure. Each users ...

10
An Information-Theoretic Perspective on Multi-LLM Uncertainty Estimation
2025-07-10 health informatics 10.1101/2025.07.09.25331207
#1 (3.6%)
Show abstract

Large language models (LLMs) often behave inconsistently across inputs, indicating uncertainty and motivating the need for its quantification in high-stakes settings. Prior work on calibration and uncertainty quantification often focuses on individual models, overlooking the potential of model diversity. We hypothesize that LLMs make complementary predictions due to differences in training and the Zipfian nature of language, and that aggregating their outputs leads to more reliable uncertainty e...

11
An Interpretable Risk Prediction Model for Healthcare with Pattern Attention
2020-07-29 health informatics 10.1101/2020.07.26.20162479
#1 (3.1%)
Show abstract

BackgroundThe availability of massive amount of data enables the possibility of clinical predictive tasks. Deep learning methods have achieved promising performance on the tasks. However, most existing methods suffer from three limitations: (i) There are lots of missing value for real value events, many methods impute the missing value and then train their models based on the imputed values, which may introduce imputation bias. The models performance is highly dependent on the imputation accurac...

12
Quantum Neural Network Tuning and Performance Evaluation for a Breast Cancer Dataset
2025-10-09 health informatics 10.1101/2025.10.03.25336905
#1 (3.1%)
Show abstract

Model tuning with the optimization of pipeline configuration is a well-established practice for the development of machine learning models. However, this often entails an exhaustive search process, especially as the parameter space expands with increasing model complexity. In the emerging field of quantum machine learning (QML), there is limited literature on the effects of configuration parameters, especially quantum-specific ones, and their choices on model performance. To address this gap, he...

13
EchoGraph: A Specialized Solution for Automatic Echocardiography Report Quality Evaluation
2025-05-08 cardiovascular medicine 10.1101/2025.05.07.25327158
#1 (3.0%)
Show abstract

Generative AI needs automatic clinical text accuracy metrics, but none exist for echocardiography. To address this, we developed EchoGraph, a BERT-based model trained on 600 densely annotated echocardiography reports from the Mayo Clinic (2017), split 7:2:1 for training, validation, and testing, using a tailored schema with 48,256 entities and 29,731 relations annotated. Sixty random MIMIC-EchoNote reports were annotated (3,672 entities and 2,360 relations) for external validation. EchoGraph dem...

14
From Rule-Based to DeepSeek R1: A Robust Comparative Evaluation of Fifty Years of Natural Language Processing (NLP) Models To Identify Inflammatory Bowel Disease Cohorts
2025-07-07 health informatics 10.1101/2025.07.06.25330961
#1 (2.8%)
Show abstract

1.11.1.1 BackgroundNatural language processing (NLP) can identify cohorts of patients with inflammatory bowel disease (IBD) from free text. However, limited sharing of code, models, and datasets continues to hinder progress, and bias in foundation large language models (LLMs) remains a significant obstacle. 1.1.2 ObjectiveTo evaluate 15 open-source NLP models for identifying IBD cohorts, reporting on document-to-patient-level classification, while exploring explainability, generalisability, bia...

15
Worldwide and Regional Forecasting of Coronavirus (Covid-19) Spread using a Deep Learning Model
2020-05-26 health informatics 10.1101/2020.05.23.20111039
#1 (2.8%)
Show abstract

In December 2019, Covid-19 epidemic was identified in Wuhan, China. Covid-19 may cause fatality especially among elderly, and people with chronic health problems. After human to human transmissions of the disease, it has rapidly spread throughout China, and then the outbreak has reached to neighboring countries in Asia. Now, the spread of the virus is accelerating in the world, and increasing number of new cases has been reported daily in Europe, Middle East, Africa and America regions. Recently...

16
Increasing the Value of Digital Phenotyping Through Reducing Missingness: A Retrospective Analysis
2022-05-17 psychiatry and clinical psychology 10.1101/2022.05.17.22275182
#1 (2.8%)
Show abstract

ObjectivesDigital phenotyping methods present a scalable tool to realize the potential of personalized medicine. But underlying this potential is the need for digital phenotyping data to represent accurate and precise health measurements. This requires a focus on the data quality of digital phenotyping and assessing the nature of the smartphone data used to derive clinical and health-related features. DesignRetrospective cohorts. Representing the largest combined dataset of smartphone digital p...

17
Large Scale Application of Named Entity Recognition to Biomedicine and Epidemiology.
2022-09-24 health informatics 10.1101/2022.09.22.22280246
#1 (2.7%)
Show abstract

BackgroundDespite significant advancements in biomedical named entity recognition methods, the clinical application of these systems continues to face many challenges: (1) most of the methods are trained on a limited set of clinical entities; (2) these methods are heavily reliant on a large amount of data for both pretraining and prediction, making their use in production impractical; (3) they do not consider non-clinical entities, which are also related to patients health, such as social, econo...

18
The Wearipedia Project: a free and open-source resource for understanding and using wearables in decentralized clinical trials
2025-05-13 health informatics 10.1101/2025.05.12.25327465
#1 (2.7%)
Show abstract

BackgroundFinding the optimal wearable biomedical sensor (ref. wearable) for a clinical research study can be challenging. Many wearables are consumer electronics and are not designed for clinical research and their clinical variables vary widely. We aimed to build a resource for clinical researchers to select the best device for their research study, and programming tools to facilitate wearable research. MethodsFor each wearable entry, we document the following-- Open-source coding tools: we b...

19
Leveraging Generative Artificial Intelligence for Enhanced Data Augmentation in Emotion Intensity Classification: A Comprehensive Framework for Cross-Dataset Transfer Learning
2026-03-03 health informatics 10.64898/2026.02.23.26346928
#1 (2.1%)
Show abstract

Data scarcity and stylistic heterogeneity pose major challenges for emotion intensity classification. This paper presents a cross-dataset augmentation framework that leverages prompt-conditioned generative models alongside deterministic and heuristic transformations to synthesize target-style examples for improved transfer learning. We introduce a unified taxonomy of augmentation strategies--Heuristic Lexical Perturbation (HLA), Prompt-Conditioned Generative Augmentation (CGA), Sequential Hybrid...

20
DeepSOCIAL: Social Distancing Monitoring and Infection Risk Assessment in COVID-19 Pandemic
2020-09-01 health informatics 10.1101/2020.08.27.20183277
#1 (2.1%)
Show abstract

Social distancing is a recommended solution by the World Health Organisation (WHO) to minimise the spread of COVID-19 in public places. The majority of governments and national health authorities have set the 2-meter physical distancing as a mandatory safety measure in shopping centres, schools and other covered areas. In this research, we develop a generic Deep Neural Network-Based model for automated people detection, tracking, and inter-people distances estimation in the crowd, using common C...