Volume - 7 | Issue - 4 | december 2025
Published
02 December, 2025
Advanced speech technology pushes human-machine interaction to a new frontier. Most of the models address this either as a matter of speech-to-text transcription or emotion detection. By integrating an XGBoost-driven emotion classification component with a Conformer-based speech recognition system, an integrated solution has been developed. It will, therefore, strive to transcribe spoken utterances and estimate the emotional condition of the speaker with as much accuracy as possible to improve context-sensitive interaction. The transcription process combines large, multilingual speech corpora. A Conformer architecture captures both short- and long-range temporal dependencies. In this regard, an error rate of 0.322 words and 0.146 characters was achieved in transcription. For emotion recognition, several emotional speech datasets were collected, and various acoustic features were extracted under noisy conditions. Using an XGBoost model, 86.58% accuracy in emotion detection was attained. These results demonstrate the feasibility of integrating speech transcription with emotion recognition and form a basis for the further development of more human-like, empathic, and adaptive voice systems.
KeywordsAutomatic Speech Recognition Speech Emotion Recognition Conformer Architecture XGBoost Classifier Human Computer Interaction Multimodal Speech Processing

