Integrating Automatic Speech Recognition and Emotion Detection: A Conformer-XGBoost Framework for Human-Centered Speech Systems

Mohan Bikram K C.; Smita Adhikari; Tara Bahadur Thapa

doi:10.36548/jaicn.2025.4.003

Comparative Analysis of Machine Learning Algorithms for Early Prediction of Parkinson’s Disorder based on Voice Features
Volume-4 | Issue-4

Detection of Fake Job Advertisements using Machine Learning algorithms
Volume-4 | Issue-3

Automated Waste Sorting with Delta Arm and YOLOv8 Detection
Volume-6 | Issue-3

Smart Fashion: A Review of AI Applications in Virtual Try-On & Fashion Synthesis
Volume-3 | Issue-4

AI-Integrated Proctoring System for Online Exams
Volume-4 | Issue-2

Deep Convolution Neural Network Model for Credit-Card Fraud Detection and Alert
Volume-3 | Issue-2

An Overview of Artificial Intelligence Ethics: Issues and Solution for Challenges in Different Fields
Volume-5 | Issue-1

Using Deep Reinforcement Learning For Robot Arm Control
Volume-4 | Issue-3

5G Network Simulation in Smart Cities using Neural Network Algorithm
Volume-3 | Issue-1

Sentiment Analysis of Nepali COVID19 Tweets Using NB, SVM AND LSTM
Volume-3 | Issue-3

Real Time Anomaly Detection Techniques Using PySpark Frame Work
Volume-2 | Issue-1

Deniable Authentication Encryption for Privacy Protection using Blockchain
Volume-3 | Issue-3

Smart Fashion: A Review of AI Applications in Virtual Try-On & Fashion Synthesis
Volume-3 | Issue-4

Sentiment Analysis of Nepali COVID19 Tweets Using NB, SVM AND LSTM
Volume-3 | Issue-3

Audio Tagging Using CNN Based Audio Neural Networks for Massive Data Processing
Volume-3 | Issue-4

Frontiers of AI beyond 2030: Novel Perspectives
Volume-4 | Issue-4

Smart Medical Nursing Care Unit based on Internet of Things for Emergency Healthcare
Volume-3 | Issue-4

Early Stage Detection of Crack in Glasses by Hybrid CNN Transformation Approach
Volume-3 | Issue-4

ARTIFICIAL INTELLIGENCE APPLICATION IN SMART WAREHOUSING ENVIRONMENT FOR AUTOMATED LOGISTICS
Volume-1 | Issue-2

Deep Convolution Neural Network Model for Credit-Card Fraud Detection and Alert
Volume-3 | Issue-2

Home / Archives / Volume-7 / Issue-4 / Article-3

Integrating Automatic Speech Recognition and Emotion Detection: A Conformer-XGBoost Framework for Human-Centered Speech Systems

Mohan Bikram K C. , Smita Adhikari, Tara Bahadur Thapa

Open Access

Volume - 7 • Issue - 4 • december 2025

https://doi.org/10.36548/jaicn.2025.4.003

343-361 70

PDF

Abstract

Advanced speech technology pushes human-machine interaction to a new frontier. Most of the models address this either as a matter of speech-to-text transcription or emotion detection. By integrating an XGBoost-driven emotion classification component with a Conformer-based speech recognition system, an integrated solution has been developed. It will, therefore, strive to transcribe spoken utterances and estimate the emotional condition of the speaker with as much accuracy as possible to improve context-sensitive interaction. The transcription process combines large, multilingual speech corpora. A Conformer architecture captures both short- and long-range temporal dependencies. In this regard, an error rate of 0.322 words and 0.146 characters was achieved in transcription. For emotion recognition, several emotional speech datasets were collected, and various acoustic features were extracted under noisy conditions. Using an XGBoost model, 86.58% accuracy in emotion detection was attained. These results demonstrate the feasibility of integrating speech transcription with emotion recognition and form a basis for the further development of more human-like, empathic, and adaptive voice systems.

Cite this article

Chicago APA MLA Vancouver IEEE Harvard BibTeX

C., Mohan Bikram K, Smita Adhikari, and Tara Bahadur Thapa. "Integrating Automatic Speech Recognition and Emotion Detection: A Conformer-XGBoost Framework for Human-Centered Speech Systems." Journal of Artificial Intelligence and Capsule Networks 7, no. 4 (2025): 343-361. doi: 10.36548/jaicn.2025.4.003

Copy Citation

C., M. B. K., Adhikari, S., & Thapa, T. B. (2025). Integrating Automatic Speech Recognition and Emotion Detection: A Conformer-XGBoost Framework for Human-Centered Speech Systems. Journal of Artificial Intelligence and Capsule Networks, 7(4), 343-361. https://doi.org/10.36548/jaicn.2025.4.003

Copy Citation

C., Mohan Bikram K, et al. "Integrating Automatic Speech Recognition and Emotion Detection: A Conformer-XGBoost Framework for Human-Centered Speech Systems." Journal of Artificial Intelligence and Capsule Networks, vol. 7, no. 4, 2025, pp. 343-361. DOI: 10.36548/jaicn.2025.4.003.

Copy Citation

C. MBK, Adhikari S, Thapa TB. Integrating Automatic Speech Recognition and Emotion Detection: A Conformer-XGBoost Framework for Human-Centered Speech Systems. Journal of Artificial Intelligence and Capsule Networks. 2025;7(4):343-361. doi: 10.36548/jaicn.2025.4.003

Copy Citation

M. B. K. C., S. Adhikari, and T. B. Thapa, "Integrating Automatic Speech Recognition and Emotion Detection: A Conformer-XGBoost Framework for Human-Centered Speech Systems," Journal of Artificial Intelligence and Capsule Networks, vol. 7, no. 4, pp. 343-361, Dec. 2025, doi: 10.36548/jaicn.2025.4.003.

Copy Citation

C., M.B.K., Adhikari, S. and Thapa, T.B. (2025) 'Integrating Automatic Speech Recognition and Emotion Detection: A Conformer-XGBoost Framework for Human-Centered Speech Systems', Journal of Artificial Intelligence and Capsule Networks, vol. 7, no. 4, pp. 343-361. Available at: https://doi.org/10.36548/jaicn.2025.4.003.

Copy Citation

@article{c.2025,
  author    = {Mohan Bikram K C. and Smita Adhikari and Tara Bahadur Thapa},
  title     = {{Integrating Automatic Speech Recognition and Emotion Detection: A Conformer-XGBoost Framework for Human-Centered Speech Systems}},
  journal   = {Journal of Artificial Intelligence and Capsule Networks},
  volume    = {7},
  number    = {4},
  pages     = {343-361},
  year      = {2025},
  publisher = {Inventive Research Organization},
  doi       = {10.36548/jaicn.2025.4.003},
  url       = {https://doi.org/10.36548/jaicn.2025.4.003}
}

Copy Citation

Keywords

Automatic Speech Recognition Speech Emotion Recognition Conformer Architecture XGBoost Classifier Human Computer Interaction Multimodal Speech Processing

References

Vaswani, Ashish, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Łukasz Kaiser, and Illia Polosukhin. "Attention is All You Need." Advances in neural information processing systems 30 (2017).
Amodei, Dario, Sundaram Ananthanarayanan, Rishita Anubhai, Jingliang Bai, Eric Battenberg, Carl Case, Jared Casper et al. "Deep speech 2: End-to-End Speech Recognition in English and Mandarin." In International conference on machine learning, PMLR, 2016, 173-182.
Aouani, Hadhami, and Yassine Ben Ayed. "Speech Emotion Recognition with Deep Learning." Procedia Computer Science 176 (2020): 251-260.
Heafield, Kenneth. "KenLM: Faster and Smaller Language Model Queries." In Proceedings of the sixth workshop on statistical machine translation, 2011,187-197.
Kim, Tae-Wan, and Keun-Chang Kwak. "Speech Emotion Recognition Using Deep Learning Transfer Models and Explainable Techniques." Applied Sciences 14, no. 4 (2024): 1553.
Kumar, Tapesh, Mehul Mahrishi, and Sarfaraz Nawaz. "A Review of Speech Sentiment Analysis Using Machine Learning." Proceedings of Trends in Electronics and Health Informatics: TEHI 2021 (2022): 21-28.
Sahu, Gaurav. "Multimodal Speech Emotion Recognition and Ambiguity Resolution." arXiv preprint arXiv:1904.06022 (2019).
Sutskever, Ilya, Oriol Vinyals, and Quoc V. Le. "Sequence to Sequence Learning with Neural Networks." Advances in neural information processing systems 27 (2014).
Vijayakumar, K. P., Hemant Singh, and Animesh Mohanty. "Real-Time Speech-To-Text/Text-To-Speech Converter with Automatic Text Summarizer using Natural Language Generation and Abstract Meaning Representation." Meaning Representation (2020).
Ardila, Rosana, Megan Branson, Kelly Davis, Michael Kohler, Josh Meyer, Michael Henretty, Reuben Morais, Lindsay Saunders, Francis Tyers, and Gregor Weber. "Common Voice: A Massively-Multilingual Speech Corpus." In Proceedings of the twelfth language resources and evaluation conference, 2020, 4218-4222.
Gulati, Anmol, James Qin, Chung-Cheng Chiu, Niki Parmar, Yu Zhang, Jiahui Yu, Wei Han et al. "Conformer: Convolution-Augmented Transformer for Speech Recognition." arXiv preprint arXiv:2005.08100 (2020).
Livingstone, Steven R., and Frank A. Russo. "The Ryerson Audio-Visual Database of Emotional Speech and Song (RAVDESS): A Dynamic, Multimodal Set of Facial and Vocal Expressions in North American English." PloS one 13, no. 5 (2018): e0196391.
Cao, Houwei, David G. Cooper, Michael K. Keutmann, Ruben C. Gur, Ani Nenkova, and Ragini Verma. "Crema-d: Crowd-Sourced Emotional Multimodal Actors Dataset." IEEE transactions on affective computing 5, no. 4 (2014): 377-390.
Pichora-Fuller, M. Kathleen, and Kate Dupuis. "Toronto Emotional Speech Set (TESS)." Scholars Portal Dataverse 1 (2020): 2020.
Jackson, Philip, and SJUoSG Haq. "Surrey Audio-Visual Expressed Emotion (Savee) Database." University of Surrey: Guildford, UK (2014).
Zhou, Kun, Berrak Sisman, Rui Liu, and Haizhou Li. "Emotional voice conversion: Theory, Databases and Esd." Speech Communication 137 (2022): 1-18.
Neumann, Michael, and Ngoc Thang Vu. "Attentive Convolutional Neural Network Based Speech Emotion Recognition: A Study on the Impact of Input Features, Signal Length, And Acted Speech." arXiv preprint arXiv:1706.00612 (2017).
Mirsamadi, Seyedmahdad, Emad Barsoum, and Cha Zhang. "Automatic Speech Emotion Recognition Using Recurrent Neural Networks with Local Attention." In 2017 IEEE International conference on acoustics, speech and signal processing (ICASSP), IEEE, 2017, 2227-2231.
Zeng, Xiaoping, Li Dong, Guanghui Chen, and Qi Dong. "Multi-Feature Fusion Speech Emotion Recognition Based on SVM." In 2020 IEEE 10th International Conference on Electronics Information and Emergency Communication (ICEIEC), IEEE, 2020, 77-80.

Category	Fee
Article Access Charge	15 USD
Open Access Fee	Nil
Annual Subscription Fee	200 USD

Mohan Bikram K C.

Published

02 December, 2025

e-ISSN: 2582-2012
4 issues per year
DOI: https://doi.org/10.36548/jaicn

Indexing
GoogleScholar | Crossref | MicrosoftAcademic | ScienceGate | J-Gate

Publisher

Inventive Research Organization

Publication Charges: Nil

Most Accessed Articles

Most Downloaded Articles

Mohan Bikram K C.

Published

02 December, 2025

e-ISSN: 2582-2012 4 issues per year DOI: https://doi.org/10.36548/jaicn

Indexing GoogleScholar | Crossref | MicrosoftAcademic | ScienceGate | J-Gate

Publisher Inventive Research Organization

Publication Charges: Nil

e-ISSN: 2582-2012
4 issues per year
DOI: https://doi.org/10.36548/jaicn

Indexing
GoogleScholar | Crossref | MicrosoftAcademic | ScienceGate | J-Gate

Publisher

Inventive Research Organization