Back

Model Development and Real-World Deployment of Multimodal Input-Based Subtyping of Depression in Tele-Counseling for Scalable Mental Health Assessment

Francis, A. J. A.; Raza, A.; Patel, N.; Gajbhiye, R.; Kumar, V.; T, A.; Saikia, A.; Mibang, O.; K, V.; Joshi, K.; Tony, L.; Balasubramani, P. P.

2026-02-18 psychiatry and clinical psychology
10.64898/2026.02.11.25342657 medRxiv
Show abstract

The rapid growth of tele-counseling and the use of lay counselors in high-volume, low-resource mental health services has created a need for scalable tools for early detection and triage. Effective personalization now requires stratifying individuals by dominant symptom profiles, such as appetite, agency, anxiety, and sleep disturbances. Depression symptoms vary widely, even among those with similar scores, reflecting distinct psychophysiological and cognitive-affective patterns. In tele-mental-health settings, where contextual cues are limited, multimodal behavioral signals from natural interactions can complement traditional assessments. Using synchronized audio, video, and text data from the EDAIC dataset (N=275), we propose a multimodal learning framework to classify five clinically validated outcomes: Depression, Appetite disturbance, Agency impairment, Anxiety, and Sleep problems. We developed a comprehensive multimodal machine-learning pipeline, incorporating automated dataset construction, modality-specific feature extraction (acoustic, facial action unit, linguistic), and supervised learning with cross-validation. Labels were derived from validated scoring rules to ensure clinical relevance. Sentiment analysis revealed lower sentiment scores in participants with high Depression, Anxiety, or Agency scores, but no significant differences in Appetite or Sleep severity. Model performance was assessed across three scenarios: text (transcripts), phone calls (audio + transcript), and video calls (audio + video + transcript). Temporal models (CNN+BiLSTM) achieved over 65% accuracy across modalities, while a fine-tuned temporal model for depression detection using video calls reached an accuracy of 81% with an f1-score of 0.79, demonstrating that our approach performs on par with state-of-the-art methods. XGBoost excelled in phone and video calls, while Ridge classifiers performed best for text-based inputs. SHAPley analysis identified key audio and video features for detecting Depression and other symptoms. A translational avatar-based interface validated system operability, demonstrating the potential for scalable, objective mental-health assessment in tele-counseling.

Matching journals

The top 3 journals account for 50% of the predicted probability mass.