Cross-Lingual Attention-based Mechanism for Speech Emotion Recognition

Tummala Vamsi Aditya; Swarna Kuchibhotla; Devi Venkata Revathi Poduri; Hima Deepthi Vankayalapati

doi:10.36548/jtcsst.2025.3.003

Cross-Lingual Attention-based Mechanism for Speech Emotion Recognition

Open Access

https://doi.org/10.36548/jtcsst.2025.3.003

Vol. 7, No. 3 (2025)

Published: 16 August, 2025

Pages: 331-356

Tummala Vamsi Aditya , Tummala Vamsi Aditya

Department of Computer Science and Engineering, Koneru Lakshmaiah Education Foundation, Vaddeswaram

Department of Computer Science and Engineering, Koneru Lakshmaiah Education Foundation, Vaddeswaram
Swarna Kuchibhotla , Swarna Kuchibhotla

Department of Computer Science and Engineering, Koneru Lakshmaiah Education Foundation, Vaddeswaram

Department of Computer Science and Engineering, Koneru Lakshmaiah Education Foundation, Vaddeswaram
Devi Venkata Revathi Poduri , Devi Venkata Revathi Poduri

Department of Computer Science and Engineering, Koneru Lakshmaiah Education Foundation, Vaddeswaram

Department of Computer Science and Engineering, Koneru Lakshmaiah Education Foundation, Vaddeswaram
Hima Deepthi Vankayalapati Hima Deepthi Vankayalapati

Department of Artificial Intelligence, Mukesh Patel School of Technology Management and Engineering, NMIMS University, Mumbai

Department of Artificial Intelligence, Mukesh Patel School of Technology Management and Engineering, NMIMS University, Mumbai

view PDF

How to Cite

Aditya, Tummala Vamsi, Swarna Kuchibhotla, Devi Venkata Revathi Poduri, and Hima Deepthi Vankayalapati. 2025. “Cross-Lingual Attention-Based Mechanism for Speech Emotion Recognition”. Journal of Trends in Computer Science and Smart Technology 7 (3): 331-56. https://doi.org/10.36548/jtcsst.2025.3.003.

Keywords

Speech Emotion Recognition (SER)

RNN (Recurrent Neural Network)

CLAF-SER (Cross-Lingual Attention-based Adversarial Framework for SER)

SAVEE (Surrey Audio-Visual Expressed Emotion Database)

RAVDESS (Ryerson Audio-Visual Database of Emotional Speech and Song)

TESS (Toronto Emotional Speech Set)

EMO-DB (Berlin Database of Emotional Speech)

MFCC (Mel-Frequency Cepstral Coefficients)

LPCC (Linear Prediction-based Cepstral Coefficients)

Pitch

Energy

Chroma

Abstract

Speech emotion recognition is one of the most emerging areas for emotion detection that may fall within the scope of affective computing. In this particular case, emotional speech files of spoken words delivered during verbal communication are of interest. The emotions of speech are investigated through sound and emotion in speech and are modeled through machine learning. Through machine learning, we performed a series of experiments on datasets like RAVDESS, TESS, SAVEE, and EMO-DB, which lean toward the objective that a Recurrent Neural Network (RNN) and (CLAF-SER): The Cross-Lingual Attention-Based Adversarial Framework for SER would be able to detect and classify such emotions as sadness, anger, happiness, neutrality, and fear. Features such as MFCC, LPCC, pitch, energy, and chroma were extracted before implementing the RNN. Through this model, TESS achieved the highest accuracy among the other datasets. However, CLAF-SER gives the best performance when all datasets are combined.

References

Singh, Jagjeet, Lakshmi Babu Saheer, and Oliver Faust. ”Speech emotion recognition using an attention model.” International Journal of Environmental Research and Public Health, 20(6), 2023: 5140.
Sun C, Li H and Ma L. ”Speech emotion recognition based on improved masking EMD and convolutional recurrent neural network.” Frontiers in Psychology, 13, 2023: 1075624.
Alluhaidan Ala Saleh, Oumaima Saidani, Rashid Jahangir, Muhammad Asif Nauman, and Omnia Saidani Neffati. ”Speech emotion recognition through hybrid features and convolutional neural network.” Applied Sciences, 13(8), 2023: 4750.
Saumard Matthieu ”Enhancing Speech Emotions Recognition Using Multivariate Functional Data Analysis.” Big Data and Cognitive Computing, 7(3), 2023: 146.
Md Rayhan Ahmed, Salekul Islam, AKM Muzahidul Islam, and Swakkhar Shatabda. ”An ensemble 1D-CNN-LSTM-GRU model with data augmentation for speech emotion recognition.” Expert Systems with Applications, 218, 2023: 119633.
Bagadi, Kesava Rao, and Chandra Mohan Reddy Sivappagari. ”An evolutionary optimization method for selecting features for speech emotion recognition.” 21(1), 2023: 159-167.
Wani, Taiba Majid, Teddy Surya Gunawan, Syed Asif Ahmad Qadri, Mira Kartiwi, and Eliathamby Ambikairajah. ”A comprehensive review of speech emotion recognition systems.” IEEE Access, 9, 2021: 47795-47814.
Langari, Shadi, Hossein Marvi, and Morteza Zahedi. ”Efficient speech emotion recognition using modified feature extraction.” Informatics in Medicine Unlocked , 20, 2020: 100424.
Khalil, Ruhul Amin, Edward Jones, Mohammad Inayatullah Babar, Tariqullah Jan, Mohammad Haseeb Zafar, and Thamer Alhussain. ”Speech emotion recognition using deep learning techniques: A review.” IEEE Access, 7, 2019: 117327-117345.
Akc¸ay, Mehmet Berkehan, and Kaya Og˘uz. ”Speech emotion recognition: Emotional models, databases, features, preprocessing methods, supporting modalities, and classifiers.” Speech Communication, 116, 2020: 56-76.
Livingstone SR, Russo FA (2018) TheRyerson Audio-Visual Database of EmotionalSpeech and Song (RAVDESS): A dynamic,multimodal set of facial and vocal expressions in North American English. PLoS ONE 13(5):e0196391.
Burkhardt, F., Paeschke, A., Rolfes, M., Sendlmeier, W.F., Weiss, B. (2005) A database of German emotional speech. Proc. Interspeech 2005, 1517-1520, doi: 10.21437/Interspeech.2005-446
S. Haq and P.J.B. Jackson, ”Multimodal Emotion Recognition”, In W. Wang (ed), Machine Audition: Principles, Algorithms and Systems, IGI Global Press, ISBN 978-1615209194, chapter 17, pp. 398-423, 2010.
Pichora-Fuller, M. Kathleen; Dupuis, Kate, 2020, ”Toronto emotional speech set (TESS)”, https://doi.org/10.5683/SP2/E8H2MF, Borealis, V1
Avro, Shamin Bin Habib, Taieba Taher, and Nursadul Mamun. ”EmoTech: A Multi-modal Speech Emotion Recognition Using Multi-source Low-level Information with Hybrid Recurrent Network.” arXiv preprint arXiv:2501.12674 (2025).
Bautista, John Lorenzo, and Hyun Soon Shin. ”Speech Emotion Recognition Model Based on Joint Modeling of Discrete and Dimensional Emotion Representation.” Applied Sciences 15.2 (2025): 623.
Agrawal, Akshat, and Anurag Jain. ”Brhamo: metaheuristic optimization algorithm for speech emotion recognition using spectral and hybrid features.” Evolutionary Intelligence 18.1 (2025): 4.
Mishra, Siba Prasad, Pankaj Warule, and Suman Deb. ”Fixed frequency range empirical wavelet transform based acoustic and entropy features for speech emotion recognition.” Speech Communication 166 (2025): 103148.
Qi, Xin, et al. ”MFGCN: Multimodal fusion graph convolutional network for speech emotion recognition.” Neurocomputing 611 (2025): 128646.
Ahn, Chung-Soo, et al. ”Multitask Transformer for Cross-Corpus Speech Emotion Recognition.” IEEE Transactions on Affective Computing (2025).
Upadhyay, Shreya G., et al. ”Phonetically-Anchored Domain Adaptation for Cross-Lingual Speech Emotion Recognition.” IEEE Transactions on Affective Computing (2025).
Kang, Xueliang. ”Speech emotion recognition algorithm of intelligent robot based on ACO-SVM.” International Journal of Cognitive Computing in Engineering 6 (2025): 131-142.
Mishra, S.P., Warule, P. & Deb, S. Speech emotion recognition using MFCC-based entropy feature. SIViP 18, 153–161 (2024). https://doi.org/10.1007/s11760-023-02716-7
W. Chen, X. Xing, P. Chen and X. Xu, ”Vesper: A Compact and Effective Pretrained Model for Speech Emotion Recognition” in IEEE Transactions on Affective Computing, vol. 15, no. 03, July-Sept. 2024, 1711-1724. doi: 10.1109/TAFFC.2024.3369726.
Deepika, C., & Kuchibhotla, S. (2023). Design an Optimum Feature Selection Method to Improve the Accuracy of the Speech Recognition System. SN Computer Science,4(5), 655.
Deepika, C., & Kuchibhotla, S. (2024). Deep-CNN based knowledge learning with Beluga Whale optimization using chaogram transformation using intelligent sensors for speech emotion recognition. Measurement: Sensors, 32, 101030.

Cross-Lingual Attention-based Mechanism for Speech Emotion Recognition

How to Cite

Download Citation

Keywords

Abstract

References