Volume - 7 | Issue - 1 | march 2025
Published
24 April, 2025
Neural Text-to-speech (TTS) synthesis is a powerful technology that can generate speech using neural networks. One of the most remarkable features of TTS synthesis is its capability to produce speech in the voice of different speakers. This research builds upon the foundation of neural TTS synthesis, particularly focusing on voice cloning and speech synthesis capabilities for Indian accents. This stands in contrast to most existing systems, which are predominantly trained on Western accents. First, an LSTM-based speaker verification system identifies distinctive speaker traits. Next, a synthesizer, acting as a sequence-to-sequence model, translates text into Mel spectrograms representing speech acoustics. A WaveRNN vocoder transforms these spectrograms into corresponding audio waveforms. Finally, noise reduction algorithms refine the generated speech for enhanced clarity and naturalness. This system significantly enhanced its cloning process by training on a diverse multi-accent dataset (with 80% Indian accent). The improvement is attributed to the model being exposed to 600 hours of speech signals, encompassing the voices of 3000 speakers. This research offers an open-source Python package specifically designed for professionals seeking to integrate voice cloning and speech synthesis capabilities into their devices. This package aims to generate synthetic speech that sounds like the natural voice of an individual, but it does not replace the natural human voice.
KeywordsSpeech Synthesis Voice Cloning Speaker Characteristics Mean Opinion Score (MOS) Speech Disorders