Advanced speech technology pushes human-machine interaction to a new frontier. Most of the models address this either as a matter of speech-to-text transcription or emotion detection. By integrating an XGBoost-driven emotion classification component with a Conformer-based speech recognition system, an integrated solution has been developed. It will, therefore, strive to transcribe spoken utterances and estimate the emotional condition of the speaker with as much accuracy as possible to improve context-sensitive interaction. The transcription process combines large, multilingual speech corpora. A Conformer architecture captures both short- and long-range temporal dependencies. In this regard, an error rate of 0.322 words and 0.146 characters was achieved in transcription. For emotion recognition, several emotional speech datasets were collected, and various acoustic features were extracted under noisy conditions. Using an XGBoost model, 86.58% accuracy in emotion detection was attained. These results demonstrate the feasibility of integrating speech transcription with emotion recognition and form a basis for the further development of more human-like, empathic, and adaptive voice systems.
@article{c.2025,
author = {Mohan Bikram K C. and Smita Adhikari and Tara Bahadur Thapa},
title = {{Integrating Automatic Speech Recognition and Emotion Detection: A Conformer-XGBoost Framework for Human-Centered Speech Systems}},
journal = {Journal of Artificial Intelligence and Capsule Networks},
volume = {7},
number = {4},
pages = {343-361},
year = {2025},
publisher = {Inventive Research Organization},
doi = {10.36548/jaicn.2025.4.003},
url = {https://doi.org/10.36548/jaicn.2025.4.003}
}
Copy Citation
- Vaswani, Ashish, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Łukasz Kaiser, and Illia Polosukhin. "Attention is All You Need." Advances in neural information processing systems 30 (2017).
- Amodei, Dario, Sundaram Ananthanarayanan, Rishita Anubhai, Jingliang Bai, Eric Battenberg, Carl Case, Jared Casper et al. "Deep speech 2: End-to-End Speech Recognition in English and Mandarin." In International conference on machine learning, PMLR, 2016, 173-182.
- Aouani, Hadhami, and Yassine Ben Ayed. "Speech Emotion Recognition with Deep Learning." Procedia Computer Science 176 (2020): 251-260.
- Heafield, Kenneth. "KenLM: Faster and Smaller Language Model Queries." In Proceedings of the sixth workshop on statistical machine translation, 2011,187-197.
- Kim, Tae-Wan, and Keun-Chang Kwak. "Speech Emotion Recognition Using Deep Learning Transfer Models and Explainable Techniques." Applied Sciences 14, no. 4 (2024): 1553.
- Kumar, Tapesh, Mehul Mahrishi, and Sarfaraz Nawaz. "A Review of Speech Sentiment Analysis Using Machine Learning." Proceedings of Trends in Electronics and Health Informatics: TEHI 2021 (2022): 21-28.
- Sahu, Gaurav. "Multimodal Speech Emotion Recognition and Ambiguity Resolution." arXiv preprint arXiv:1904.06022 (2019).
- Sutskever, Ilya, Oriol Vinyals, and Quoc V. Le. "Sequence to Sequence Learning with Neural Networks." Advances in neural information processing systems 27 (2014).
- Vijayakumar, K. P., Hemant Singh, and Animesh Mohanty. "Real-Time Speech-To-Text/Text-To-Speech Converter with Automatic Text Summarizer using Natural Language Generation and Abstract Meaning Representation." Meaning Representation (2020).
- Ardila, Rosana, Megan Branson, Kelly Davis, Michael Kohler, Josh Meyer, Michael Henretty, Reuben Morais, Lindsay Saunders, Francis Tyers, and Gregor Weber. "Common Voice: A Massively-Multilingual Speech Corpus." In Proceedings of the twelfth language resources and evaluation conference, 2020, 4218-4222.
- Gulati, Anmol, James Qin, Chung-Cheng Chiu, Niki Parmar, Yu Zhang, Jiahui Yu, Wei Han et al. "Conformer: Convolution-Augmented Transformer for Speech Recognition." arXiv preprint arXiv:2005.08100 (2020).
- Livingstone, Steven R., and Frank A. Russo. "The Ryerson Audio-Visual Database of Emotional Speech and Song (RAVDESS): A Dynamic, Multimodal Set of Facial and Vocal Expressions in North American English." PloS one 13, no. 5 (2018): e0196391.
- Cao, Houwei, David G. Cooper, Michael K. Keutmann, Ruben C. Gur, Ani Nenkova, and Ragini Verma. "Crema-d: Crowd-Sourced Emotional Multimodal Actors Dataset." IEEE transactions on affective computing 5, no. 4 (2014): 377-390.
- Pichora-Fuller, M. Kathleen, and Kate Dupuis. "Toronto Emotional Speech Set (TESS)." Scholars Portal Dataverse 1 (2020): 2020.
- Jackson, Philip, and SJUoSG Haq. "Surrey Audio-Visual Expressed Emotion (Savee) Database." University of Surrey: Guildford, UK (2014).
- Zhou, Kun, Berrak Sisman, Rui Liu, and Haizhou Li. "Emotional voice conversion: Theory, Databases and Esd." Speech Communication 137 (2022): 1-18.
- Neumann, Michael, and Ngoc Thang Vu. "Attentive Convolutional Neural Network Based Speech Emotion Recognition: A Study on the Impact of Input Features, Signal Length, And Acted Speech." arXiv preprint arXiv:1706.00612 (2017).
- Mirsamadi, Seyedmahdad, Emad Barsoum, and Cha Zhang. "Automatic Speech Emotion Recognition Using Recurrent Neural Networks with Local Attention." In 2017 IEEE International conference on acoustics, speech and signal processing (ICASSP), IEEE, 2017, 2227-2231.
- Zeng, Xiaoping, Li Dong, Guanghui Chen, and Qi Dong. "Multi-Feature Fusion Speech Emotion Recognition Based on SVM." In 2020 IEEE 10th International Conference on Electronics Information and Emergency Communication (ICEIEC), IEEE, 2020, 77-80.

