Volume - 7 | Issue - 4 | december 2025
Published
07 November, 2025
In the medical field, identifying various pathological conditions poses a crucial challenge because it requires an invasive and contact-based data extraction technique. Therefore, non-invasive and non-contact forms of vital data, such as speech signals, can be used to identify various pathological conditions. Speech signals have distinguishing phonetic characteristics that change when a pathological condition occurs in the human body. By using these changes, various pathological signals can be classified by training machine learning and deep learning models with the acoustic features of speech signals. This work proposes the acoustic spectrogram transformer, where all the layers in the transformer are trained using acoustic characteristics extracted from the speech signals of voice and lung disease patients. Mel-frequency cepstral coefficients (MFCCs), Mel spectrograms, and spectral variables like centroid, bandwidth, roll-off, and zero-crossing rate are used for feature extraction from the voice and lung dataset. These acoustic features train the transformer blocks and depth-adaptive parameters, enabling the model to capture complex patterns for effective signal classification. Along with this architecture, the model consists of frequency-focused attention mechanisms used to extract spectral characteristics that are most indicative of pathological conditions. Meanwhile, multiple pooling strategies are employed for the effective aggregation of temporal information. Due to this targeted design, the system serves as an effective clinical tool for classification, minimizing computational complexity and achieving an accuracy of about 83% in voice pathology classification and 99% in lung pathology classification.
KeywordsVoice Pathology Lung Pathology Acoustic Spectrogram Transformer Mel Spectrogram