Abstract
In the medical field, identifying various pathological conditions poses a crucial challenge because it requires an invasive and contact-based data extraction technique. Therefore, non-invasive and non-contact forms of vital data, such as speech signals, can be used to identify various pathological conditions. Speech signals have distinguishing phonetic characteristics that change when a pathological condition occurs in the human body. By using these changes, various pathological signals can be classified by training machine learning and deep learning models with the acoustic features of speech signals. This work proposes the acoustic spectrogram transformer, where all the layers in the transformer are trained using acoustic characteristics extracted from the speech signals of voice and lung disease patients. Mel-frequency cepstral coefficients (MFCCs), Mel spectrograms, and spectral variables like centroid, bandwidth, roll-off, and zero-crossing rate are used for feature extraction from the voice and lung dataset. These acoustic features train the transformer blocks and depth-adaptive parameters, enabling the model to capture complex patterns for effective signal classification. Along with this architecture, the model consists of frequency-focused attention mechanisms used to extract spectral characteristics that are most indicative of pathological conditions. Meanwhile, multiple pooling strategies are employed for the effective aggregation of temporal information. Due to this targeted design, the system serves as an effective clinical tool for classification, minimizing computational complexity and achieving an accuracy of about 83% in voice pathology classification and 99% in lung pathology classification.
References
Sankar, G., V. Ganesan, R. V. Shantharam, K. Palanisamy, and I. Katam. "Epidemiology of Voice Disorders Among Government School Teachers-An Analytical Cross-Sectional Study from Kancheepuram District." (2022). National Journal of Community Medicine 13 (2023): 869–875.
Abdulmajeed, Nuha Qais, Belal Al-Khateeb, and Mazin Abed Mohammed. "A Review on Voice Pathology: Taxonomy, Diagnosis, Medical Procedures and Detection Techniques, Open Challenges, Limitations, and Recommendations for Future Directions." Journal of Intelligent Systems 31 (2022): 855–875.
Fujiki, Robert Brinton, and Susan L. Thibeault. "Voice Disorder Prevalence and Vocal Health Characteristics in Children." JAMA Otolaryngology–Head & Neck Surgery 150, no. 8 (2024): 677-687.
Kaliappan, Vishnu Kumar, Rajasekaran Thangaraj, P. Pandiyan, K. Mohanasundaram, S. Anandamurugan, and Dugki Min. "Real-Time Face Mask Position Recognition System Using YOLO Models for Preventing COVID-19 Disease Spread in Public Places." International Journal of Ad Hoc and Ubiquitous Computing 42, no. 2 (2023): 73-82.
Revathi, S., and K. Mohana Sundaram. "Deep Learning-Based Voice Pathology Detection from Electroglottography." In Approaches to Human-Centered AI in Healthcare, IGI Global Scientific Publishing, 2024, 236-257.
Ramalingam, Anbukarasi, and Nithya Narayanan. "The Diagnostic Efficacy of Flexible Fiberoptic Laryngoscopy and Its Correlation with Histopathology in Different Benign Lesions of the Vocal Cord in a Tertiary Care Hospital: A Prospective Clinical Study." Cureus 16, no. 12 (2024).
Tami, Mohammad, Sari Masri, Ahmad Hasasneh, and Chakib Tadj. "Transformer-based Approach to Pathology Diagnosis Using Audio Spectrogram." Information 15, no. 5 (2024): 253.
Hegde, K. Jayashree, K. Manjula Shenoy, and K. Devaraja. "Performance Evaluation of Pre-Trained Models for Classification of Vocal Cord Paralysis over Vowels." Paper presented at the Second International Conference on Networks, Multimedia and Information Technology (NMITCON), Bengaluru, India, 2024.
Hemmerling, Daria, Marek Wodzinski, Juan Rafael Orozco-Arroyave, David Sztaho, Mateusz Daniol, Pawel Jemiolo, and Magdalena Wojcik-Pedziwiatr. "Vision Transformer for Parkinson’s Disease Classification Using Multilingual Sustained Vowel Recordings." In 2023 45th Annual International Conference of the IEEE Engineering in Medicine & Biology Society (EMBC), IEEE, 2023, 1-4.
Woldert-Jokisz, B. "Saarbruecken Voice Database, Version 2.0." 2007. https://stimmdb.coli.unisaarland.de/index.php4#target.
"Respiratory Sound Database, Version 2." Kaggle. https://www.kaggle.com/datasets/vbookshelf/respiratory-sound-database.
Javanmardi, Farhad, Sudarsana Reddy Kadiri, and Paavo Alku. "A Comparison of Data Augmentation Methods in Voice Pathology Detection." Computer Speech & Language 83 (2024): 101552.
Shen, Jiakun, Xueshuai Zhang, Yu Lu, Pengfei Ye, Pengyuan Zhang, and Yonghong Yan. "Novel Audio Characteristic-Dependent Feature Extraction and Data Augmentation Methods for Cough-Based Respiratory Disease Classification." Computers in Biology and Medicine 179 (2024): 108843.
Cai, Jie, Yuliang Song, Jianghao Wu, and Xiong Chen. "Voice Disorder Classification Using Wav2vec 2.0 Feature Extraction." Journal of Voice (2024). 0892-1997.
Bhattacharjee, Susmita, Hanumant Singh Shekhawat, and S. R. M. Prasanna. "Classification of Cleft Lip and Palate Speech Using Fine-Tuned Transformer Pretrained Models." In International Conference on Intelligent Human Computer Interaction, Cham: Springer Nature Switzerland, 2023, 55-61.
Choi, Hyosun, Li Zhang, and Chris Watkins. "Dual Representations: A Novel Variant of Self-Supervised Audio Spectrogram Transformer with Multi-Layer Feature Fusion and Pooling Combinations for Sound Classification." Neurocomputing 623 (2025): 129415.
Andayani, Felicia, Lau Bee Theng, Mark Teekit Tsun, and Caslon Chua. "Hybrid LSTM-Transformer Model for Emotion Recognition from Speech Audio Files." IEEE Access 10 (2022): 36018-36027.
Soylu, Emel, Sema Gül, Kübra Aslan, Muammer Türkoğlu, and Murat Terzi. "Vision Transformer Based Classification of Neurological Disorders from Human Speech." Firat University Journal of Experimental and Computational Engineering 3, no. 2 (2023): 160-174.
Cheng, Jiawen, and Kexue Sun. 2023. "Heart Sound Classification Network Based on Convolution and Transformer" Sensors 23, no. 19: 8168.
Xiao, Li, Lucheng Fang, Yuhong Yang, and Weiping Tu. "LungAdapter: Efficient Adapting Audio Spectrogram Transformer for Lung Sound Classification." In Proc. Interspeech 2024, 2024, 4738-4742.
Mahum, Rabbia, Ismaila Ganiyu, Lotfi Hidri, Ahmed M. El-Sherbeeny, and Haseeb Hassan. "A novel Swin transformer based Framework for Speech Recognition for Dysarthria." Scientific Reports 15, no. 1 (2025): 20070.
Farazi, Sahar, and Yaser Shekofteh. "Efficient DL Models for Voice Pathology Detection in Healthcare Applications using Sustained Vowels." Journal of Innovations in Computer Science and Engineering (JICSE) 2, no. Special Issues 2 (2025): 26-32.
Albadr, Musatafa Abbas Abbood, Masri Ayob, Sabrina Tiun, Fahad Taha AL-Dhief, Muataz Salam Al-Daweri, Raad Z. Homod, and Ali Hashim Abbas. "Fast Learning Network Algorithm for Voice Pathology Detection and Classification." Multimedia Tools and Applications 84, no. 17 (2025): 18567-18598.
Kumar, Deepak, Udit Satija, and Preetam Kumar. "Analysis and Classification of Electroglottography Signals for the Detection of Speech Disorders." In 2023 National Conference on Communications (NCC), IEEE, 2023, 1-6.
Rao, PVL Narasimha, and S. Meher. "ORG-RGRU: An Automated Diagnosed Model for Multiple Diseases by Heuristically based Optimized Deep Learning using Speech/Voice Signal." Biomedical Signal Processing and Control 88 (2024): 105493.
Devi, Kharibam Jilenkumari, Ayekpam Alice Devi, and Khelchandra Thongam. "Automatic Speaker Recognition using MFCC and Artificial Neural Network." Int. J. Innov. Technol. Explor. Eng 9, no. 1 (2019): 39-42.
Wu, Yuanbo, Changwei Zhou, Ziqi Fan, Di Wu, Xiaojun Zhang, and Zhi Tao. "Investigation and Evaluation of Glottal Flow Waveform for Voice Pathology Detection." IEEE Access 9 (2020): 30-44.
