Trustworthy Multimodal Depression Screening via Cross-Attention Fusion and Calibrated Uncertainty

Aakash Gupta; Umme Salma M. Pirzada

doi:10.36548/jtcsst.2026.3.004

Trustworthy Multimodal Depression Screening via Cross-Attention Fusion and Calibrated Uncertainty

Open Access

https://doi.org/10.36548/jtcsst.2026.3.004

Vol. 8, No. 3 (2026)

Published: 19 June, 2026

Pages: 494-516

Aakash Gupta , Aakash Gupta

School of Engineering and Technology, Navrachana University, Vadodara, India

School of Engineering and Technology, Navrachana University, Vadodara, India
Umme Salma M. Pirzada Umme Salma M. Pirzada

School of Engineering and Technology, Navrachana University, Vadodara, India

School of Engineering and Technology, Navrachana University, Vadodara, India

view PDF

How to Cite

Gupta, Aakash, and Umme Salma M. Pirzada. 2026. “Trustworthy Multimodal Depression Screening via Cross-Attention Fusion and Calibrated Uncertainty”. Journal of Trends in Computer Science and Smart Technology 8 (3): 494-516. https://doi.org/10.36548/jtcsst.2026.3.004.

Keywords

Multimodal Depression Detection

Cross-Attention Fusion

Trustworthy AI

Speech and Language Processing

Uncertainty-Aware Deep Learning

Explainable AI

DAIC-WOZ

Mental Health Screening

Abstract

Conventional automated systems for screening for depression leverage speech/text-based features, which makes such systems sensitive to external noises, potential mistakes in automatic speech recognition (ASR), and other modality-related limitations. Besides, most current approaches fail to provide any form of uncertainty estimates, properly calibrated output probabilities, and explanation capabilities for their predictions, limiting the use of these tools within a clinical environment. In this work, we present an approach for building trustworthy depression screening systems based on a fusion of acoustic and linguistic features using a novel cross-attention-based method. Specifically, self-supervised learning techniques like wav2vec 2.0 and HuBERT models are used for extracting acoustic features from raw audio recordings. For text processing, our framework leverages DistilBERT and RoBERTa language representation models. By employing a multi-head cross-attention module, we allow our model to effectively exploit interactions between linguistic content and acoustic features. Predictive uncertainty estimates are produced by incorporating Monte Carlo dropout into the model architecture. Temperature scaling is applied for proper calibration of output probabilities. Token-level attributions are used for explaining predictions made for linguistic input, while attention coefficients for segments of audio signal correspond to explanation. Experiments conducted on a dataset of clinical interviews from the DAIC-WOZ corpus show that our method significantly outperforms audio-only, text-only, and fusion baselines, reaching an accuracy, Macro-F1, Weighted-F1, AUROC, and ECE of 0.82, 0.80, 0.81, 0.87, and 0.034 respectively. Our system also shows increased robustness against noisy audio conditions, ASR-based transcripts, and missing data.

References

World Health Organization. "Depression." WHO Fact Sheet / Topic Page.
Kraus, Christoph, Bashkim Kadriu, Rupert Lanzenberger, Carlos A. Zarate Jr, and Siegfried Kasper. "Prognosis and Improved Outcomes in Major Depression: A Review." Translational psychiatry 9, no. 1 (2019): 127.
Kroenke, Kurt, Tara W. Strine, Robert L. Spitzer, Janet BW Williams, Joyce T. Berry, and Ali H. Mokdad. "The PHQ-8 as a Measure of Current Depression in the General Population." Journal of affective disorders 114, no. 1-3 (2009): 163-173.
Chancellor, Stevie, and Munmun De Choudhury. "Methods in Predictive Techniques for Mental Health Status on Social Media: A Critical Review." NPJ digital medicine 3, no. 1 (2020): 43.
Traum, David, Albert Rizzo, Margaux Lhommet, Jon Gratch, Alesia Gainer, David DeVault, Rachel Wood et al. "SimSensei Kiosk: A Virtual Human Interviewer for Healthcare Decision Support." Adaptive Agents and Multi-Agents Systems (2014).
Cummins, Nicholas, Stefan Scherer, Jarek Krajewski, Sebastian Schnieder, Julien Epps, and Thomas F. Quatieri. "A Review of Depression and Suicide Risk Assessment Using Speech Analysis." Speech communication 71 (2015): 10-49.
Atrey, Pradeep K., M. Anwar Hossain, Abdulmotaleb El Saddik, and Mohan S. Kankanhalli. "Multimodal Fusion for Multimedia Analysis: A Survey." Multimedia systems 16, no. 6 (2010): 345-379.
Yang, Ying, Catherine Fairbairn, and Jeffrey F. Cohn. "Detecting Depression Severity from Vocal Prosody." IEEE transactions on affective computing 4, no. 2 (2012): 142-150.
Menne, Felix, Felix Dörr, Julia Schräder, Johannes Tröger, Ute Habel, Alexandra König, and Lisa Wagels. "The Voice of Depression: Speech Features as Biomarkers for Major Depressive Disorder." BMC psychiatry 24, no. 1 (2024): 794.
Devlin, Jacob, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. "Bert: Pre-Training of Deep Bidirectional Transformers for Language Understanding." In Proceedings of the 2019 conference of the North American chapter of the association for computational linguistics: human language technologies, volume 1 (long and short papers), 2019, 4171-4186.
Liu, Yinhan, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. "Roberta: A Robustly Optimized Bert Pretraining Approach." arXiv preprint arXiv:1907.11692 (2019).
Eyben, Florian, Martin Wöllmer, and Björn Schuller. "Opensmile: The Munich Versatile and Fast Open-Source Audio Feature Extractor." In Proceedings of the 18th ACM international conference on Multimedia, 2010, 1459-1462.
Pérez-Rosas, Verónica, and Rada Mihalcea. "Evaluating Automatic Speech Recognition Quality and its Impact on Counselor Utterance Coding." In Proceedings of the Seventh Workshop on Computational Linguistics and Clinical Psychology: Improving Access, 2021, 159-168.
Pennebaker, James. "The Development and Psychometric Properties of LIWC2007." (2007).
Vaswani, Ashish, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Łukasz Kaiser, and Illia Polosukhin. "Attention is All You Need." Advances in neural information processing systems 30 (2017).
Tsai, Yao-Hung Hubert, Shaojie Bai, Paul Pu Liang, J. Zico Kolter, Louis-Philippe Morency, and Ruslan Salakhutdinov. "Multimodal Transformer for Unaligned Multimodal Language Sequences." In Proceedings of the 57th annual meeting of the association for computational linguistics, 2019, 6558-6569.
Gratch, Jonathan, Ron Artstein, Gale M. Lucas, Giota Stratou, Stefan Scherer, Angela Nazarian, Rachel Wood et al. "The Distress Analysis Interview Corpus of Human and Computer Interviews." In Lrec, vol. 14, 2014, 3123-3128.
Gal, Yarin, and Zoubin Ghahramani. "Dropout as a Bayesian Approximation: Representing Model Uncertainty in Deep Learning." In international conference on machine learning, PMLR, 2016, 1050-1059.
Khera, Rohan, Melissa A. Simon, and Joseph S. Ross. "Automation Bias and Assistive AI: Risk of Harm From AI-Driven Clinical Decision Support." Jama 330, no. 23 (2023): 2255-2257.
Ribeiro, Marco Tulio, Sameer Singh, and Carlos Guestrin. "" Why Should I Trust you?" Explaining the predictions of any classifier." In Proceedings of the 22nd ACM SIGKDD international conference on knowledge discovery and data mining, 2016, 1135-1144.
Sundararajan, Mukund, Ankur Taly, and Qiqi Yan. "Axiomatic Attribution for Deep Networks." In International conference on machine learning, PMLR, 2017, 3319-3328.
Elham Tabassi, Artificial Intelligence Risk Management Framework (AI RMF 1.0), NIST AI 100-1 (Gaithersburg, MD: National Institute of Standards and Technology, 2023), https://doi.org/10.6028/NIST.AI.100-1.
European Parliament, and Council of the European Union. "Regulation (EU) 2024/1689 of the European Parliament and of the Council of 13 June 2024 Laying Down Harmonised Rules on Artificial Intelligence (Artificial Intelligence Act)." Official Journal of the European Union, L 2024/1689 (2024). https://eur-lex.europa.eu/eli/reg/2024/1689/oj
Working, Machine Learning-enabled. "Good Machine Learning Practice for Medical Device Development: Guiding Principles." (2025).
Baevski, Alexei, Yuhao Zhou, Abdelrahman Mohamed, and Michael Auli. "Wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations." Advances in neural information processing systems 33 (2020): 12449-12460.
Ibrahim, Shahana, and Xiao Fu. "Learning Mixed Membership from Adjacency Graph Via Systematic Edge Query: Identifiability and Algorithm." In ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), IEEE, 2021, 5370-5374.
Sanh, Victor, Lysandre Debut, Julien Chaumond, and Thomas Wolf. "DistilBERT, a Distilled Version of BERT: Smaller, Faster, Cheaper and Lighter." arXiv preprint arXiv:1910.01108 (2019).
Guo, Chuan, Geoff Pleiss, Yu Sun, and Kilian Q. Weinberger. "On Calibration of Modern Neural Networks." In International conference on machine learning, PMLR, 2017, 1321-1330.
Naeini, Mahdi Pakdaman, Gregory Cooper, and Milos Hauskrecht. "Obtaining Well Calibrated Probabilities Using Bayesian Binning." In Proceedings of the AAAI conference on artificial intelligence, vol. 29, no. 1. 2015.
Neverova, Natalia, Christian Wolf, Graham Taylor, and Florian Nebout. "Moddrop: Adaptive Multi-Modal Gesture Recognition." IEEE Transactions on Pattern Analysis and Machine Intelligence 38, no. 8 (2015): 1692-1706.

Trustworthy Multimodal Depression Screening via Cross-Attention Fusion and Calibrated Uncertainty

How to Cite

Download Citation

Keywords

Abstract

References