Neurophotonics — Latest Matching Preprints

1

VOGeo-Gaze: Calibration-Free, Geometry-Aware Deep Learning for Real-Time Gaze Tracking in Clinical Video-Oculography

Zhao, J.; Ahmadi, S.-A.; Decker, J.; Zwergal, A.; Eulenburg, P. z.; Flanagin, V. L.; Wuehr, M.

2026-05-29 health informatics 10.64898/2026.05.27.26354254 medRxiv

Top 0.2%

2.6%

Show abstract

Quantitative eye movement analysis is important for neuro- logical diagnostics, yet existing video-oculography (VOG) systems typ- ically require calibration, device-specific settings, or accurate gaze la- bels. We present VOGeo-Gaze, a real-time, calibration-free, geometry- aware neural network that estimates gaze by reconstructing anatomi- cally meaningful eyeball parameters from image features. The method combines segmentation-driven projection geometry, a refraction-aware pupil correction module, and temporal anatomical stabilization, so gaze is derived from interpretable eye geometry rather than direct angular regression. Trained only on the public TEyeD dataset with weak gaze supervision, VOGeo-Gaze was evaluated on 116 clinical recordings from 17 patients and 19 healthy subjects using EyeSeeCam, a clinical gold- standard VOG system. It achieved median absolute angular errors of 0.33{whitebullet} horizontally and 0.35{whitebullet} vertically, with nearly 92% of recordings below 1{whitebullet} error while operating at >300 FPS. These results demonstrate sub-degree clinical gaze estimation without subject-specific calibration, camera intrinsics, or accurate gaze labels, providing a scalable and inter- pretable alternative to conventional VOG pipelines. Code is available at https://github.com/DSGZ-MotionLab/VOGeo-Gaze.

2

Personalized Brain-Based Analgesia Detection with Portable fNIRS and AI

Minoccheri, C.; Joo, P.; Hu, X.-S.; Affendi, H.; Elayyan, F.; Harville, A.; McDonald, N. J.; Botero, T.; DaSilva, A. F.

2026-05-28 dentistry and oral medicine 10.64898/2026.05.20.26353377 medRxiv

Top 0.3%

1.4%

Show abstract

Neuroimaging based pain decoding faces two underappreciated challenges: between subject variability that prevents classifiers from generalizing across patients, and within session cross validation designs that inflate reported accuracy by conflating within person and between person variance. Here we address both using portable functional near infrared spectroscopy (fNIRS) during pharmacologically verified local nerve anesthesia. Twentyfive patients with clinically painful teeth underwent 36 channel bilateral fNIRS during percussion before ("Pre") and after ("Post") local nerve anesthesia. In 13 block-success patients, a paired Pre versus Post comparison with healthy tooth control identified three temporal hemodynamic response function (HRF) features (late slope, mean first derivative, and baseline normalized amplitude) whose analgesia interaction effects (d = 0.63 to 0.79) exceeded that of raw general linear model (GLM) amplitude (d = 0.56), with a significant difference-in-differences interaction (p = 0.011). Per-patient calibration with these features yielded leave one subject out (LOSO) AUC = 0.68 to 0.76 for nonlinear classifiers (permutation p = 0.002), with HbO-specific feature selection achieving the best performance (RF AUC = 0.760); a healthy tooth negative control was non-significant. End to end deep learning on raw time series (CNN LSTM AUC = 0.719) was competitive with feature based classifiers, while linear models did not reach significance. Critically, head to head comparison of within-session CV and LOSO on the same data revealed mean inflation of +0.13 AUC across all model types, including deep learning, demonstrating that high within session accuracy alone does not establish subject-independent validity. Exploratory analyses suggested complementary roles for oxyhemoglobin (HbO; within patient analgesia detection) and deoxyhemoglobin (HbR; cross patient information), and that trial to trial response variability may complement amplitude for cross patient pain detection. These results show that per patient calibration with temporal HRF features supports subject independent analgesic-state detection under strict LOSO evaluation, and that within-session validation (standard in the fNIRS pain- decoding literature) can substantially overestimate performance.

3

Intravital mid-infrared biosensing by normalized spatial probing of self-referenced optothermal signals

Berger, C. G.; Puttfarcken, B.; Qiu, J.; Hauer, I.; Herr, S.; Juestel, D.; Pleitez, M. A.

2026-05-28 endocrinology 10.64898/2026.05.27.26354202 medRxiv

Top 0.4%

1.0%

Show abstract

We present a compact pump-and-probe mid-infrared Optothermal Spectrometer (OTHES) equipped with Spatial Probing and Autocorrection (SPAC) optimized for robust intravital application in humans. SPAC-OTHES facilitates alignment stability and spectral comparability across different measurement sessions involving different skin types. Contrary to state-of-the-art, SPAC-OTHES uses camera-based beam detection and an auto-calibration mechanism that enables ca. 73% better spectral reproducibility in intravital measurements in human volunteers than non-calibrated readouts. Moreover, SPAC-OTHES has the potential to lower the glucose quantification error, as demonstrated here in artificial skin phantoms, where an improvement of 52% compared to conventional diode-based detection was observed. The compactness of OTHES, combined with reliable SPAC-readout, has the potential to accelerate commercialization and broad application of biosensors based on mid-infrared spectroscopy.

4

High Resolution Multi-depth Quantification of the Retinal Nerve Fiber Layer

Callet, C.; Bertrand, M.; Guzman, K.; Mece, P.; Rossi, E. A.; Grieve, K.

2026-06-01 ophthalmology 10.64898/2026.05.22.26353127 medRxiv

Top 0.5%

0.8%

Show abstract

The retinal nerve fiber layer, composed of axon bundles converging toward the optic nerve, is a key biomarker for diagnosing and monitoring glaucoma and other neurodegenerative diseases. High-resolution en face imaging of individual nerve fiber bundles offers morphological information beyond what conventional optical coherence tomography provides, yet clinical integration remains limited by the lack of automated analysis tools and normative data. Here, we imaged 14 healthy volunteers using time-domain full-field optical coherence tomography and adaptive optics scanning laser ophthalmoscopy, and developed automated pipelines to quantify bundle width, trajectory, tortuosity, and orientation. Bundles were on average 25% wider at shallower retinal depths, width measurements were consistent across imaging modalities, and estimated axon count per bundle decreased significantly with age. Global trajectory analysis revealed systematic deviations of high resolution data from existing mathematical models, particularly in the temporal sector, leading us to propose two refined trajectory models. These normative results provide a foundation for high resolution biomarkers for use in investigations of retinal neurodegeneration.

5

The emotional impact of gambling-related advertising: an experimental functional Near-Infrared Spectroscopy study protocol

Daniel, L.-I.; Ros-Leon, A.; Molina-Rodriguez, S.; Pellicer-Porcar, O.; Cabrera-Perona, V.; Ibanez-Ballesteros, J.

2026-05-27 addiction medicine 10.64898/2026.05.20.26353682 medRxiv

Top 0.8%

0.5%

Show abstract

The proliferation of gambling advertising has intensified concerns regarding its influence on vulnerable populations, yet the neural mechanisms underlying cue-reactivity to these stimuli remain underexplored in ecologically valid settings. This study protocol proposes a novel methodological framework to investigate prefrontal cortical responses to gambling advertisements in individuals with varying degrees of gambling experience. Materials and methods: This cross-sectional study will recruit 44 participants, divided into a clinical group (individuals with high-frequency gambling or gambling disorder) and a matched control group. Neural activity will be recorded using fNIRS while participants view gambling-related, neutral, violent, and sexual stimuli. Secondary measures include validated scales for gambling severity (SOGS), impulsivity, sensation seeking, and alexithymia. Data analysis will primarily utilize inter-subject correlation (ISC) to quantify neural synchronization and multiband frequency decomposition to capture dynamic affective processing. Advanced preprocessing, including short-channel regression, will be applied to ensure signal robustness. Discussion: By combining portable neuroimaging with a data-driven ISC approach, this study aims to identify objective neural markers of gambling vulnerability. The findings will provide novel insights into the idiosyncratic processing of commercial stimuli, potentially informing public health policies and the development of more effective evidence-based regulations for gambling marketing.

6

Assessing Lipid Core Burden Index with Depolarization-Sensitive Optical Frequency Domain Imaging

Jones, G.; Otsuka, K.; Fujisawa, N.; Yamaura, H.; Matsumoto, K.; Okamoto, A.; Yamaguchi, T.; Shimada, T.; Kagawa, S.; Yamazaki, T.; Akasaka, T.; Bouma, B. E.; Villiger, M.; Fukuda, D.

2026-06-01 cardiovascular medicine 10.64898/2026.05.22.26353889 medRxiv

Top 0.9%

0.5%

Show abstract

Background: Quantitative lipid assessment is central to identifying rupture-prone coronary plaques and represents a therapeutic target for lipid-lowering therapy. Near-infrared spectroscopy (NIRS)-derived lipid core burden index (LCBI) is well validated and widely used for detecting lipid-rich lesions. Optical frequency domain imaging (OFDI) is increasingly adopted for guiding percutaneous coronary intervention (PCI) due to its high-resolution structural imaging capabilities. Depolarization-sensitive OFDI (depOFDI) provides intrinsic lipid contrast and may enable combined structural and compositional plaque characterization within a single OFDI-based platform. Objective: To define an OFDI-derived lipid metric and evaluate its agreement with NIRS-derived LCBI. Methods: Thirty-three patients underwent both polarization-sensitive OFDI and NIRS-intravascular ultrasound imaging during PCI. After exclusion of 4 datasets, 29 co-registered pullbacks were analyzed. A signal-to-noise-corrected depolarization metric was used to identify lipid-rich regions and generate depOFDI chemograms. maxLCBI4mm value and location, as well as total LCBI, were computed and compared with NIRS. Results: depOFDI demonstrated strong agreement with NIRS, showing high correlation for maxLCBI4mm (r^2 = 0.862) and total LCBI (r^2 = 0.867), along with strong spatial concordance for the location of the maxLCBI4mm (r^2 = 0.900). Bland-Altman analysis of LCBI4mm showed minimal bias (10.7) with 95% limits of agreement of [81.4 to 102.8]. Conclusions: depOFDI enables accurate quantification of lipid burden alongside the high-resolution structural information inherently provided by OFDI. Because depolarization metrics can be derived from polarization-diverse detection available in many commercial OFDI systems, this approach provides a practical pathway toward comprehensive plaque characterization within existing PCI workflows, without the need for additional imaging modalities.

7

Comparing Pathway-Informed Polygenic Risk Score Strategies: A multi-cohort evaluation of Amyloid-β

Zhang, X.; Goudey, B.; Laws, S.; Masters, C.; Baldwin, T.; Faux, N.

2026-05-27 health informatics 10.64898/2026.05.25.26354071 medRxiv

Top 2%

0.2%

Show abstract

Objective: To systematically evaluate pathway-informed polygenic risk score (PRS) strategies and determine which approaches most effectively leverage biological annotations for risk prediction, using brain amyloid-beta positivity as a case study. Methods: We systematically benchmarked approaches for integrating pathway information into PRS construction to predict brain A{beta} positivity. Using two cohorts, the Alzheimer's Disease Neuroimaging Initiative (ADNI, n = 969) and Australian Imaging, Biomarkers and Lifestyle (AIBL, n = 251), we compared Apolipoprotein E (APOE) genetic risk score (GRS), clumping and thresholding (C+T) PRS, pathway-guided single nucleotide polymorphism (SNP) selection PRS, and pathway-specific PRSs ensembled via machine learning. Pathways were derived from manually curated literature or from pathway databases via Functional Mapping and Annotation (FUMA). Results: In cross-validation on the ADNI cohort, pathway-informed PRS using a narrow-set of pathways to guide SNP selection (PathPRS-SNPLit without APOE locus) significantly outperformed the standard PRS model (median AUC = 0.742, p = 0.006) and the APOE locus model (median AUC = 0.736, p = 5.1 x 10-5) based on the Mann-Whitney U test, achieving a median AUC of 0.763. This model showed enhanced ability to identify subgroups within the 10% lowest- and highest-risk groups compared to the current standard of APOE locus alone (odds ratio = 0.67, 95% CI: 0.56-0.81; and OR = 13.23, 95% CI: 10.23-17.11), highlighting its clinical potential. Using a focused set of literature-curated pathways outperformed using a broader set of database-derived pathways across configurations. When contrasting strategies for aggregating information across pathways, we observed that using pathways to guide selection of SNPs and then building a single PRS performed comparably to building PRS for each pathway and using machine learning (ML) to aggregate these, though the latter enabled pathway-level interpretability. Similar trends were observed in the external AIBL validation dataset. Interpretation: Pathway-informed PRS can meaningfully improve genetic risk enrichment for A{beta} positivity beyond APOE and standard C+T approaches, provided pathway definitions are carefully curated. The choice of pathway source has the strongest impact on predictive performance, with aggregation strategies or ML model choice having far less impact. Our findings highlight the utility of literature-curated, pathway-informed PRSs for A{beta} prediction and offer practical guidance for pathway-informed PRS construction in other polygenic traits.

8

Optical coherence tomography as a biomarker for frontotemporal dementia: a systematic review & meta-analysis

Wang, E.; Kohli, A.; Taha, H. B.

2026-05-27 neurology 10.64898/2026.05.19.26353366 medRxiv

Top 2%

0.2%

Show abstract

Background: Frontotemporal dementia (FTD) lacks widely accessible disease-specific biomarkers. Optical coherence tomography (OCT) and OCT angiography (OCTA) may provide non-invasive measures of retinal changes associated with neurodegeneration. We conducted a systematic review and meta-analysis evaluating retinal biomarkers in FTD compared with Alzheimer disease (AD) and controls. Methods: A systematic search of PubMed and Embase was conducted through April 25, 2026 according to PRISMA guidelines. Studies evaluating OCT/OCTA biomarkers in FTD with comparator groups were included. Inverse weighted random-effects models, publication bias assessments, and meta-regressions were performed. Results: Ten studies involving 139 individuals with FTD, 87 with AD, 29 with mild cognitive impairment, 14 with TDP-43 proteinopathy, 5 with tauopathy, and 255 controls were included in the systematic review; five studies were eligible for meta-analysis. Compared with AD, individuals with FTD demonstrated significantly thinner retinal nerve fiber layer (RNFL) thickness (SMD = -0.61, 95% CI -0.98, -0.24). Compared with controls, individuals with FTD exhibited significantly thinner ganglion cell layer-inner plexiform layer (GCL-IPL) thickness (SMD = -0.55, 95% CI -1.02, -0.08), whereas pooled analyses across multiple retinal biomarkers were non-significant (SMD = -0.19, 95% CI -0.52, 0.14). RNFL thickness correlated negatively with female % in FTD and positively with age in both AD and controls. Conclusions: Individuals with FTD exhibit lower RNFL thickness than AD and lower GCL-IPL thickness than controls, suggesting retinal alterations may reflect neurodegeneration. However, larger longitudinal studies with standardized OCT/OCTA protocols are needed to determine the diagnostic and prognostic utility of retinal biomarkers in FTD

9

Preliminary Reliability and Validity of SynapTrack, a Smartphone-Based Digital Biomarker Platform for Remote Assessment of Cervical Spondylotic Myelopathy

Yakdan, S.; Singh, P.; Arkam, F.; Chen, E.; Lewis, A.; Steel, B.; Becker, I.; Guo, W.; Naveed, H.; Wang, C.; Yang, D.; Wang, Z.; Ray, W. Z.; Hassenstab, J.; Steinmetz, M. P.; Ghogawala, Z.; Kelleher, C.; Greenberg, J.

2026-06-01 surgery 10.64898/2026.05.29.26354454 medRxiv

Top 2%

0.1%

Show abstract

Background and Objectives: Cervical spondylotic myelopathy (CSM) is a leading cause of neurological disability in older adults. However, validated, scalable tools to quantify disease severity and changes over time are lacking. Recent advances in smartphone technology have opened new avenues for longitudinal, objective, and remote monitoring of neurological conditions. We performed a preliminary evaluation of the reliability and validity of SynapTrack, a smartphone-based digital platform for objective remote CSM assessments. Methods: In this single-center prospective cohort study, 265 participants (151 with CSM, 114 healthy controls) completed in-person SynapTrack assessments related to tapping, pinching, and vibratory detection, along with reference laboratory measures of dexterity (Box and Block Test, 9-Hole Peg Test) and vibratory sensation (tuning fork). A subset completed repeated home-based testing to assess test-retest reliability. We evaluated convergent validity, construct validity against the modified Japanese Orthopedic Association (mJOA) score, known-groups validity, and test-retest reliability (intraclass correlation coefficient, ICC). Results: Smartphone-derived metrics demonstrated good-to-excellent test-retest reliability, with the strongest stability for vibratory detection threshold (ICC = 0.92), overall and non-dominant tapping speed (ICC = 0.90 each), and pinching successful targets (ICC = 0.90). Convergent validity was supported by moderate-to-strong correlations between digital metrics and reference laboratory dexterity tests ({rho} up to 0.60 for tapping speed; up to -0.65 for the vibratory threshold). Construct validity against the mJOA was strongest for the vibratory threshold ({rho} = -0.53 to -0.54) and Level 2 non-dominant pinching errors ({rho} = -0.45). Selected metrics distinguished CSM patients from controls with good discrimination, including non-dominant tapping speed (AUROC = 0.76, 95% CI 0.68-0.85), Level 2 dominant pinching successful targets (AUROC = 0.78, 95% CI 0.62-0.94), and the non-dominant vibratory threshold (AUROC = 0.77, 95% CI 0.64-0.90). Conclusions and Relevance: A smartphone-based battery of upper-extremity sensorimotor tasks demonstrated preliminary reliability and validity in CSM. Furthermore, to our knowledge, the novel vibratory detection task represents the first smartphone-based sensory assessment used for CSM. Collectively, these findings position SynapTrack as a scalable platform for objective, remote neurological monitoring of CSM.

10

Using artificial intelligence for radiotherapy clinical trial quality assurance: analysis of a multi-institutional clinical trial for neurovascular-sparing prostate stereotactic ablative radiotherapy

Doucette, M.; Zhang, Y.; Liao, C.-Y.; Lin, M.-H.; Yan, Y.; Dess, R. T.; Tendulkar, R. D.; Garant, A.; Hannan, R.; Jiang, S.; Nguyen, D.; Desai, N.; Yang, D. X.

2026-05-29 health informatics 10.64898/2026.05.27.26354252 medRxiv

Top 2%

0.1%

Show abstract

Our study evaluated whether a deep learning auto segmentation model combined with machine learning triage can streamline radiotherapy clinical trial quality assurance (QA). We analyzed 107 stereotactic ablative radiotherapy (SABR) cases from a multi-institutional phase II clinical trial of neurovascular sparing prostate SABR, focusing on physician contours of the internal pudendal artery (IPA) as a novel organ-at-risk with substantial interobserver variability. Contours were scored by the trial principal investigator as Per-Protocol or Minor Deviation/Unacceptable. We applied a deep learning model for IPA auto-segmentation. Agreement between human and AI contours was then quantified using 14 overlap, distance, and surface metrics, and a supervised classifier was trained on these metrics to flag clinical trial protocol deviations. While AI segmentation achieved only modest geometric accuracy with mean Dice similarity coefficient of 0.446 and 95th percentile Hausdorff distance of 14.23, when incorporating all 14 metrics, a machine learning classifier yielded AUROC of 0.836, flagging all Minor Deviation/Unacceptable cases with 100% sensitivity on the 27 case hold-out set with 6 false positives and no false negatives. AI segmentation combined with metrics-based machine learning can triage protocol deviations within a multi-institution radiotherapy clinical trial, supporting prospective evaluation of AI-assisted trial QA.

11

Towards A Foundation Model for Clinical Voice Biomarkers

Elemento, O.; Sigaras, A.; Colonel, J.; Hajirasouliha, I.; Ghosh, S.; Bensoussan, Y.; Bridge2AI-Voice Consortium, ; Rameau, A.

2026-05-30 health informatics 10.64898/2026.05.28.26354346 medRxiv

Top 3%

0.1%

Show abstract

Vocal biomarkers, encompassing voice and speech, have largely been developed for individual conditions in isolation, limiting their generalizability across diseases and recording settings. To address this, we introduce VoiceFM, a contrastive model that learns general-purpose clinical voice representations by aligning audio embeddings with rich clinical metadata. Using the Bridge2AI-Voice dataset (984 primarily English-speaking adult participants, 846 used for training and 138 held out as a temporally separated validation cohort, 40,056 recordings totaling 176 hours across 5 academic medical centers), VoiceFM pairs a fine-tuned Whisper large-v2 encoder with a tabular transformer over 44 clinical features via symmetric InfoNCE loss. Linear probes on frozen VoiceFM embeddings achieve mean AUROC 0.952 +/- 0.005 across five evaluation tasks (control vs disease screening plus four disease categories), significantly outperforming Frozen Whisper (0.926 +/- 0.013, p = 0.013), Frozen HuBERT (0.885 +/- 0.017, p = 0.0009), and the contrastively trained VoiceFM-HuBERT (0.938 +/- 0.006, p = 0.012). On the 138-participant held-out cohort, VoiceFM-Whisper achieves AUROCs of 0.99 for Alzheimer's/dementia/MCI and 0.89 for airway stenosis, demonstrating that the learned representations generalize to participants the model has never seen. VoiceFM representations transfer to three external datasets without retraining and improve few-shot classification. Recording task attribution identifies a small set of speech tasks that match or exceed the full battery's performance, suggesting shorter screening protocols are feasible. Trained predominantly on English audio, VoiceFM transfers without fine-tuning to Spanish-language Parkinson's disease (PD) detection (NeuroVoz, 107 participants, AUROC 0.93 +/- 0.02), with the signal dominated by articulatory rather than phonatory features. A fine-tuned classifier achieves participant-level AUROC 0.87 (sustained 0.85, countdown 0.80) on the mPower smartphone study (585 held-out participants). Together, these results show that contrastive alignment between voice and rich clinical metadata can serve as the basis for a clinical voice foundation model, producing a single set of transferable representations that generalize across diseases, languages, recording conditions, and patients enrolled after model freeze.

12

Bridging Acoustic and Semantic Spaces for Interpretable Voice Scoring via Zero-Shot Semantic Expansion

Hsiao, C.; Cheng, Y.-R.; Yang, C.-Y.; Hsu, F.-S.

2026-06-01 health informatics 10.64898/2026.05.29.26354442 medRxiv

Top 3%

0.1%

Show abstract

Subjective auditory-perceptual evaluation and uninterpretable deep learning models limit the clinical assessment of voice disorders. This study proposes a two-phase zero-shot framework to evaluate voice pathology. First, an Audio Spectrogram Transformer is fine-tuned on the Perceptual Voice Quality Database to generate an acoustic latent space. Second, Orthogonal Procrustes analysis maps these acoustic embeddings directly onto the semantic space of a pre-trained Sentence Transformer. The geometric alignment produced continuous semantic axes that outperformed a supervised machine learning baseline in regressing clinician-rated GRBAS (Grade, Roughness, Breathiness, Asthenia, and Strain) severity scales. Furthermore, these axes correlate with traditional acoustic measures, including Harmonics-to-Noise Ratio and local jitter, while remaining robust when applied to aperiodic signals by not requiring fundamental frequency extraction. Most importantly, the model achieved zero-shot semantic expansion, successfully evaluating voices using an untrained, natural clinical vocabulary beyond the GRBAS scale. External validation on the Voice ICarus Database confirmed cross-corpus stability and demonstrated the capacity for zero-shot differential phenotyping of specific etiologies, such as hypokinetic dysphonia and reflux laryngitis. By bridging acoustic and semantic latent spaces, this framework offers an objective, continuous, and transparent metric for evaluating voice quality using voice descriptive vocabulary.

13

PFAS exposure and neuroimmune and Alzheimers Disease related plasma biomarkers in a rural, cognitively unimpaired population: a pilot study

Souza-Talarico, J. N.; Lehmler, H.-J.; Li, X.; Hefti, M.; Fu, Y.; Harb, A.; Hein, M.; Ding, L.; Perkhounkova, Y.

2026-06-01 neurology 10.64898/2026.05.23.26353843 medRxiv

Top 3%

0.0%

Show abstract

INTRODUCTION: Alzheimers disease (AD) is a multifactorial disorder, yet current research largely focuses on downstream biomarkers with limited attention to environmental contributors. Experimental studies suggest that per and polyfluoroalkyl substances (PFAS) may contribute to neuroimmune and neurodegenerative pathways relevant to AD. OBJECTIVE: To examine associations between PFAS exposure and neuroimmune and AD related plasma biomarkers in cognitively unimpaired rural adults. METHODS: In a cross sectional pilot study (n=48), serum concentrations of 33 PFAS were measured, including four legacy compounds (PFOS, PFHxS, PFOA, PFNA). Plasma neuroimmune related (ITGB2, SMOC1, TREM2, GFAP) and AD related biomarkers (Ab42/40, ptau217) were detected using proteomic analysis. RESULTS: PFOS showed moderate associations with ITGB2, SMOC1, and Ab42/40 in unadjusted analyses, which attenuated after adjustment for age. PFOA and PFNA demonstrated consistent inverse associations with TREM2 before and after adjustment. DISCUSSION: Findings suggest possible compound specific PFAS associations with immune and amyloid related biomarkers, supporting further investigation in longitudinal and PFAS mixture based studies.

14

The dangers of data double dipping in assessing the classification accuracies of blood biomarkers in Alzheimer's disease and related disorder research

Liu, T.; Zeng, X.; Snitz, B. E.; Karikari, T. K.; Deek, R. A.

2026-06-01 neurology 10.64898/2026.05.22.26353848 medRxiv

Top 3%

0.0%

Show abstract

Blood biomarker models are increasingly used in Alzheimer's disease and related dementia translational research, but predictive performance can be inflated when the same dataset is used for both model development and evaluation. We assess the effect of data double dipping using simulations and NULISA proteomic data from the MYHAT-NI community-based cohort to predict brain amyloid-beta neuroimaging status. In both settings, training AUC increased as more biomarkers were added, while testing AUC peaked earlier and then declined. These findings show that data double dipping can inflate model performance and highlight the need for external validation or internal validation with data partitioning.

15

SeGA-GNN: Semantically Gated Augmented Graph Neural Networks for Wearable-Based Emotion Detection

Kurt, F.; Subasi, S. N.; Yakisan, E. S.; Subasi, A.

2026-06-01 health informatics 10.64898/2026.05.29.26354434 medRxiv

Top 3%

0.0%

Show abstract

Background: Wearable technologies enable scalable and continuous monitoring of emotional states through passive sensing of physiological and behavioral signals. However, conventional learning approaches often struggle to model the complex temporal, contextual, and relational dependencies underlying human emotions. To address these limitations, we propose a graph-based framework that represents multimodal wearable observations as heterogeneous knowledge graphs enriched with semantic information derived from Large Language Models (LLMs), enabling richer contextual understanding beyond raw sensor measurements. Methods: We constructed a heterogeneous knowledge graph using multimodal Fitbit physiological signals and affective self-report data collected from 45 users. Framing mood prediction and emotion detection was formulated as both binary and ternary node classification tasks. We evaluated five baseline heterogeneous Graph Neural Network (GNN) architectures and compared them with the proposed Semantically Gated Augmented Graph Neural Network (SeGA-GNN) framework, which dynamically integrates LLM-generated semantic embeddings into graph representations through a gated cross-modal fusion mechanism. Results: The baseline GNN models achieved strong performance, with classification accuracies ranging from 0.7525 to 0.9739 for binary classification and 0.6249 to 0.9699 for ternary classification. The proposed SeGA framework consistently improved predictive performance across most architectures. In particular, semantic augmentation transformed the HAN model from moderate baseline performance into near-perfect emotion recognition capability, achieving SeGA-HAN Accuracy = 0.9988 and AUC = 1.0000 for binary classification and Accuracy = 0.9979 and AUC = 1.0000 for ternary classification. Discussion and Conclusion: Integrating LLM-derived semantic contextualization into heterogeneous graph learning enables effective modeling of contextual information that is not directly captured by wearable physiological signals alone. The proposed SeGA-GNN framework demonstrates that adaptive semantic fusion substantially improves the accuracy, robustness, and interpretability of wearable-based emotion detection. These findings establish a promising direction for next-generation wearable affective computing systems and intelligent emotion-aware applications.

16

Quantifying the Optimism of Naive Cross-Validation for Binary Outcome Prediction with Repeated-Measures Predictors: A Simulation Study and Clinical Illustration

Hagan, J.

2026-05-29 epidemiology 10.64898/2026.05.27.26354222 medRxiv

Top 3%

0.0%

Show abstract

Background. Cross-validation (CV) is widely used to estimate predictive performance, but can overestimate performance when applied at the observation level to repeated-measures data. When continuous predictor variables are measured repeatedly within subjects and the binary outcome is defined at the subject level, naive observation-level CV introduces data leakage through within-subject dependence, producing optimistically biased estimates of the area under the receiver operating characteristic curve (AUROC). The magnitude of this bias and the performance of alternative partitioning strategies have not been formally characterized for this data structure. Methods. Three CV strategies were compared for estimating subject-level AUROC in ridge logistic regression models: naive observation-level 10-fold CV, subject-level 10-fold CV, and leave-one-cluster-out (LOCO) CV. The framework was applied to a motivating clinical dataset of daily oxygenation measures and retinopathy of prematurity outcomes among 101 extremely low birth weight infants. A factorial simulation study was conducted across 162 parameter combinations varying cluster count (20-150), intraclass correlation (0.1-0.5), within-cluster autocorrelation (0.2-0.8), and outcome prevalence (10-35%), with 500 simulated datasets per condition (76,389 valid datasets total). Results. In the motivating dataset, naive CV produced optimism of +0.078 AUROC units for severe ROP prediction (15 events, 101 subjects) and +0.031 for any ROP prediction (48 events). Subject-level 10-fold CV closely approximated LOCO (deviation [≤] 0.015). In the simulation, naive CV optimism ranged from +0.039 to +0.204 across all conditions, increasing monotonically with higher ICC, higher autocorrelation, fewer clusters, and lower event rates. Subject-level 10-fold CV was essentially unbiased relative to LOCO across all 162 conditions (mean absolute deviation = 0.002). Conclusions. Naive observation-level CV meaningfully overestimates discriminative performance in the repeated-measures binary outcome setting and should not be used. Subject-level CV partitioning effectively eliminates this bias. Accordingly, subject-level partitioning should be considered essential, not optional, when validating prediction models using repeated-measures data with subject-level outcomes.

17

Deriving OCT-Equivalent Retinal Nerve Fiber Layer Thickness Maps from Fundus Photographs with Deep Learning Improves Glaucoma Diagnosis

Shi, L.; Shi, M.; Chung, I. Y.; Pasquale, L. R.; Shen, L. Q.; Wang, M.

2026-05-27 ophthalmology 10.64898/2026.05.26.26354047 medRxiv

Top 3%

0.0%

Show abstract

Purpose: To develop and evaluate a deep learning model that predicts optical coherence tomography (OCT)-equivalent retinal nerve fiber layer thickness (RNFLT) maps directly from color fundus photographs and to assess their diagnostic value for glaucoma detection. Design: Retrospective model development and evaluation study. Participants: 15,031 paired fundus photographs and spectral-domain OCT scans collected at Massachusetts Eye and Ear between 2011 and 2022. Methods: Paired fundus and OCT images were used to train a U-Net-based model to predict pixel-wise RNFLT maps with artifact-corrected supervision. Diagnostic performance was evaluated across single-modality models (fundus photos only, real RNFLT maps, predicted RNFLT maps) and multimodal fusion models (fundus + predicted RNFLT maps). Stratified analyses examined model performance across glaucoma severity and demographic subgroups. Glaucoma was defined based on standard criteria applied to Humphrey 24-2 visual field testing. Main Outcome Measures: Mean absolute error (MAE) and structural similarity index (SSIM) for RNFLT map prediction. Area under the ROC curve (AUC) and accuracy for glaucoma detection. Results: RNFLT map prediction achieved a MAE = 15.4 m and a SSIM = 0.65, measured against artifact-corrected RNFLT maps derived from OCT. For glaucoma detection, the predicted RNFLT-only classifier outperformed the fundus-only classifier (AUC 0.889 vs 0.883, p < 0.005; Accuracy 82.0% vs 78.0%), but performed worse than the real-RNFLT-only classifier (AUC 0.889 vs 0.903, p < 0.005). Multimodal fusion of fundus images with predicted RNFLT maps improved performance, achieving an AUC of 0.909, outperforming all single-modality inputs (p < 0.005 vs fundus-only, predicted-RNFLT-only, and real-RNFLT-only). Performance gains between the fundus-only and the multimodal classifier were greater in early-stage glaucoma compared to severe cases: accuracy increased from 55.3% to 64.0% in mild cases, from 71.5% to 80.4% in moderate cases, and from 90.0% to 94.6% in severe cases. Conclusions: Predicted RNFLT maps derived from fundus photographs provide quantitative, OCT-like structural information and improve glaucoma detection. Unlike prior work that predicted only summary RNFLT values, our model generates full RNFLT maps that better support glaucoma classification than fundus images alone. This approach offers a scalable pathway for early glaucoma screening and expands diagnostic access in resource-limited settings.

18

Developing and Evaluating Deep Learning Approaches for Visual Field Denoising in Glaucoma

Baek, J. S.; Lokhande, A.; Neuenschwander, D.; Shi, M.; Wang, M.

2026-06-01 ophthalmology 10.64898/2026.05.29.26354019 medRxiv

Top 3%

0.0%

Show abstract

Purpose To investigate the relative efficacy of nine distinct visual field (VF) denoising artificial intelligence (AI) methods and a pathology-aware AI strategy to discourage over-correction of glaucomatous defects. Design Retrospective study. Participants 87,940 paired visual field (VF) and optical coherence tomography (OCT) samples from a tertiary academic center. Methods Denoising models were trained on a separate VF-only dataset and evaluated on an independent structure-function dataset of paired VF-OCT samples. We implemented and evaluated nine distinct VF denoising strategies representing three broad categories: baseline measurements, self-supervised and image restoration models (including Noise2Noise, Noise2Void, and NAFNet), and latent variable compression-based models (autoencoders and variational autoencoders). All models were designed to reconstruct VF sensitivity maps. We then predicted retinal nerve fiber layer thickness (RNFLT) maps from the denoised VFs using a fixed, independently trained VF-to-RNFLT prediction model. Main Outcome Measures Predicted VF and RNFLT maps and resultant evaluation metrics. Results The raw VF baseline achieved a global R2 of 0.5468 and MAE of 16.83 um. Restoration-based models maintained or slightly improved concordance, with the pathology-aware NAFNet achieving the highest global R2 of 0.5485 and a comparable MAE of 16.82 um. In contrast, compression-based models degraded concordance, with CNN-VAE showing a significant reduction (R2 approximately 0.50). In severe glaucoma, concordance decreased across all methods; however, compression architectures exhibited disproportionately greater degradation compared with restoration-based approaches. Conclusions We present a comparative benchmark of AI-based VF denoising strategies paired with structure-function evaluation. While restoration-based models can reduce variability without loss of biological signal, latent compression risks attenuating clinically meaningful defects. Visually smoother fields are not necessarily more biologically accurate.

19

Noninvasive Hypokalemia Detection from Single-Lead AI-ECG: Development, Multicenter Validation, and Prospective Pilot Study in the Emergency Department

Tang, G.; Li, X.; Xiao, Y.; Wang, K.; Wu, M.; Wei, Z.; Yu, M.; Chen, X.; Hong, W.; Cheng, F.; Li, X.; Zhang, J.; Wu, X.; Hong, S.

2026-06-01 health informatics 10.64898/2026.05.23.26353774 medRxiv

Top 4%

0.0%

Show abstract

Hypokalemia is a common and potentially life-threatening electrolyte abnormality in emergency care, yet rapid noninvasive screening remains difficult in time-critical triage settings. We developed PocketED-K, a single-lead AI-ECG prescreening model initialized from ECGFounder, and evaluated it in retrospective multicenter cohorts and a prospective handheld pilot. Retrospective development and validation included 37,115 patients from MC-MED and MIMIC-ED, and the pilot enrolled 18 patients at Peking University First Hospital. Hypokalemia was defined as venous serum potassium < 3.5 mmol/L. PocketED-K achieved AUROCs of 0.8189 (95% CI 0.8172--0.8207) in internal testing, 0.8104 (95% CI 0.8092--0.8115) in temporal validation, and 0.7889 (95% CI 0.7692--0.8074) in independent external validation; external negative predictive value was 0.9911 (95% CI 0.9895--0.9925). Higher predicted risk was associated with ST-segment depression, T-wave flattening or inversion, and relative U-wave prominence. The prospective handheld pilot provided an initial signal of workflow feasibility in real-world acquisition. These findings support single-lead AI-ECG as a low-burden prescreening tool to prioritize potassium testing in emergency care.

20

Frontier Large Language Models for Comprehensive Medication Review in CKD Patients with Polypharmacy: A Trap-Embedded Synthetic Benchmark

Chuang, K.-C.; Lin, H.-J.; Lin, H.-M.

2026-05-26 health informatics 10.64898/2026.05.23.26353939 medRxiv

Top 4%

0.0%

Show abstract

Background: Patients with CKD and polypharmacy face high rates of drug-related problems, yet comprehensive medication review remains time-intensive and inconsistently performed. Large language models (LLMs) may augment this process, but existing benchmarks use multiple-choice formats that do not reflect open-ended, nephrology-specific review. We developed a trap-embedded synthetic CKD benchmark and evaluated five current-generation LLMs (GPT-5.4, Claude Sonnet 4.6, Gemini 3.1 Pro, Grok 4.1 Fast, DeepSeek R1; tested April-May 2026) for open-ended medication review. Methods: Fifty synthetic CKD cases across three complexity groups (G3a-G3b [n=20], G4 [n=15], G5/G5D/transplant [n=15]) with 8-12 medications and [≥]2 embedded clinical traps each were scored against nephrologist-adjudicated gold standards. Each model produced three independent responses per case (temperature 0; 750 total outputs). Primary endpoint was per-case macro F1; secondary endpoints were safety-critical omission rate, PI-adjudicated hallucination rate, and intra-model consistency. Blinded inter-rater reliability for gold-standard item detection was assessed on a 30% sample. Results: Consensus-level macro F1 ranged from 0.41 (Claude Sonnet 4.6) to 0.49 (Grok 4.1 Fast) (Friedman P < 0.001). Phosphate binder timing (11%) and hyperkalemia combinations (33%) were poorly detected across all models. Safety-critical omission rate ranged from 22% to 48% (P < 0.001); PI-adjudicated hallucination ranged from 0% (GPT-5.4) to 54% (DeepSeek R1), including fabricated dose caps and non-existent guideline citations. Blinded reliability for gold-standard item detection was high (kappa = 0.934, n = 92). Conclusions: This nephrology-specific benchmark exposes clinically important LLM blind spots that generic multiple-choice evaluations would not detect. Heterogeneous hallucination and omission rates indicate that model selection and domain-specific guardrails should precede any clinical deployment of LLM-assisted CKD medication review. Prospective validation with real patient data and human comparators is required before deployment recommendations can be made.