Abstract
Neural Text-to-speech (TTS) synthesis is a powerful technology that can generate speech using neural networks. One of the most remarkable features of TTS synthesis is its capability to produce speech in the voice of different speakers. This research builds upon the foundation of neural TTS synthesis, particularly focusing on voice cloning and speech synthesis capabilities for Indian accents. This stands in contrast to most existing systems, which are predominantly trained on Western accents. First, an LSTM-based speaker verification system identifies distinctive speaker traits. Next, a synthesizer, acting as a sequence-to-sequence model, translates text into Mel spectrograms representing speech acoustics. A WaveRNN vocoder transforms these spectrograms into corresponding audio waveforms. Finally, noise reduction algorithms refine the generated speech for enhanced clarity and naturalness. This system significantly enhanced its cloning process by training on a diverse multi-accent dataset (with 80% Indian accent). The improvement is attributed to the model being exposed to 600 hours of speech signals, encompassing the voices of 3000 speakers. This research offers an open-source Python package specifically designed for professionals seeking to integrate voice cloning and speech synthesis capabilities into their devices. This package aims to generate synthetic speech that sounds like the natural voice of an individual, but it does not replace the natural human voice.
References
Sari, Leda, Mark Hasegawa-Johnson, and Samuel Thomas. "Auxiliary networks for joint speaker adaptation and speaker change detection." IEEE/ACM Transactions on Audio, Speech, and Language Processing 29 (2020): 324-333.
Xue, Shaofei, Ossama Abdel-Hamid, Hui Jiang, and Lirong Dai. "Direct adaptation of hybrid DNN/HMM model for fast speaker adaptation in LVCSR based on speaker code." In 2014 IEEE international conference on acoustics, speech and signal processing (ICASSPIEEE, 2014. 6339-6343.
Wu, Zhizheng, Pawel Swietojanski, Christophe Veaux, Steve Renals, and Simon King. "A study of speaker adaptation for DNN-based speech synthesis." In Sixteenth Annual Conference of the International Speech Communication Association. 2015.
Doddipatla, Rama, Norbert Braunschweiler, and Ranniery Maia. "Speaker Adaptation in DNN-Based Speech Synthesis Using d-Vectors." In Interspeech, 2017. 3404-3408.
Luong, Hieu-Thi, and Junichi Yamagishi. "Nautilus: a versatile voice cloning system." IEEE/ACM Transactions on Audio, Speech, and Language Processing 28 (2020): 2967-2981.
Van Den Oord, Aaron, and Oriol Vinyals. "Neural discrete representation learning." Advances in neural information processing systems 30 (2017).
Hsu, Wei-Ning, Yu Zhang, Ron J. Weiss, Yu-An Chung, Yuxuan Wang, Yonghui Wu, and James Glass. "Disentangling correlated speaker and noise for speech synthesis via data augmentation and adversarial factorization." In ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), IEEE, 2019. 5901-5905.
Cong, Jian, Shan Yang, Lei Xie, Guoqiao Yu, and Guanglu Wan. "Data efficient voice cloning from noisy samples with domain adversarial training." arXiv preprint arXiv:2008.04265 (2020).
Jemine, Corentin. "Real-time-voice-cloning." University of Liége, Liége, Belgium (2019).
Wang, Yuxuan, Daisy Stanton, Yu Zhang, RJ-Skerry Ryan, Eric Battenberg, Joel Shor, Ying Xiao, Ye Jia, Fei Ren, and Rif A. Saurous. "Style tokens: Unsupervised style modeling, control and transfer in end-to-end speech synthesis." In International Conference on Machine Learning, PMLR, 2018. 5180-5189.
Shen, Jonathan, Ruoming Pang, Ron J. Weiss, Mike Schuster, Navdeep Jaitly, Zongheng Yang, Zhifeng Chen et al. "Natural tts synthesis by conditioning wavenet on mel spectrogram predictions." In 2018 IEEE international conference on acoustics, speech and signal processing (ICASSP), IEEE, 2018. 4779-4783.
Ren, Yi, Chenxu Hu, Xu Tan, Tao Qin, Sheng Zhao, Zhou Zhao, and Tie-Yan Liu. "Fastspeech 2: Fast and high-quality end-to-end text to speech." arXiv preprint arXiv:2006.04558 (2020).
Jia, Ye, Yu Zhang, Ron Weiss, Quan Wang, Jonathan Shen, Fei Ren, Patrick Nguyen, Ruoming Pang, Ignacio Lopez Moreno, and Yonghui Wu. "Transfer learning from speaker verification to multispeaker text-to-speech synthesis." Advances in neural information processing systems 31 (2018).
Valle, Rafael, Jason Li, Ryan Prenger, and Bryan Catanzaro. "Mellotron: Multispeaker expressive voice synthesis by conditioning on rhythm, pitch and global style tokens." In ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), IEEE, 2020. 6189-6193.
Mohammadi, Seyed Hamidreza, and Alexander Kain. "Voice conversion using deep neural networks with speaker-independent pre-training." In 2014 IEEE Spoken Language Technology Workshop (SLT), IEEE, 2014. 19-23.
Zhao, Li, and Feifan Chen. "Research on voice cloning with a few samples." In 2020 International Conference on Computer Network, Electronic and Automation (ICCNEA), IEEE, 2020. 323-328.
Wan, Li, Quan Wang, Alan Papir, and Ignacio Lopez Moreno. "Generalized end-to-end loss for speaker verification." In 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), IEEE, 2018. 4879-4883.
Shen, Jonathan, Ruoming Pang, Ron J. Weiss, Mike Schuster, Navdeep Jaitly, Zongheng Yang, Zhifeng Chen et al. "Natural tts synthesis by conditioning wavenet on mel spectrogram predictions." In 2018 IEEE international conference on acoustics, speech and signal processing (ICASSP), IEEE, 2018. 4779-4783.
Oord, Aaron van den, Sander Dieleman, Heiga Zen, Karen Simonyan, Oriol Vinyals, Alex Graves, Nal Kalchbrenner, Andrew Senior, and Koray Kavukcuoglu. "Wavenet: A generative model for raw audio." arXiv preprint arXiv:1609.03499 (2016).
Ping, Wei, Kainan Peng, Andrew Gibiansky, Sercan O. Arik, Ajay Kannan, Sharan Narang, Jonathan Raiman, and John Miller. "Deep voice 3: Scaling text-to-speech with convolutional sequence learning." arXiv preprint arXiv: 1710.07654 (2017).
Sainburg, Tim. "timsainb/noisereduce: v1. 0." Zenodo, Jun (2019).
https://github.com/AI4Bharat/NPTEL2020-Indian-English-Speech-Dataset
https://keithito.com/LJ-Speech-Dataset/
Jung, Chi-Sang, Young-Sun Joo, and Hong-Goo Kang. "Waveform interpolation-based speech analysis/synthesis for HMM-based TTS systems." IEEE Signal Processing Letters 19, no. 12 (2012): 809-812.
Yamagishi, Junichi, Bela Usabaev, Simon King, Oliver Watts, John Dines, Jilei Tian, Yong Guan et al. "Thousands of voices for HMM-based speech synthesis–Analysis and application of TTS systems built on various ASR corpora." IEEE Transactions on Audio, Speech, and Language Processing 18, no. 5 (2010): 984-1004.
Qian, Yanmin, Xun Gong, and Houjun Huang. "Layer-Wise Fast Adaptation for End-to-End Multi-Accent Speech Recognition." IEEE/ACM Transactions on Audio, Speech, and Language Processing 30 (2022): 2842-2853.
Ghorbani, Shahram, and John HL Hansen. "Domain Expansion for End-to-End Speech Recognition: Applications for Accent/Dialect Speech." IEEE/ACM Transactions on Audio, Speech, and Language Processing (2022).
William, Freddy, Abhijeet Sangwan, and John HL Hansen. "Automatic accent assessment using phonetic mismatch and human perception." IEEE transactions on audio, speech, and language processing 21, no. 9 (2013): 1818-1829.
Seong, Jiwon, WooKey Lee, and Suan Lee. "Multilingual speech synthesis for voice cloning." In 2021 IEEE International Conference on Big Data and Smart Computing (BigComp), pp. IEEE, 2021. 313-316.
Wang, Tao, Jianhua Tao, Ruibo Fu, Jiangyan Yi, Zhengqi Wen, and Rongxiu Zhong. "Spoken Content and Voice Factorization for Few-Shot Speaker Adaptation." In INTERSPEECH, 2020. 796-800.
Chen, Mingjian, Xu Tan, Bohan Li, Yanqing Liu, Tao Qin, Sheng Zhao, and Tie-Yan Liu. "Adaspeech: Adaptive text to speech for custom voice." arXiv preprint arXiv:2103.00993 (2021).
Fan, Yuchen, Yao Qian, Frank K. Soong, and Lei He. "Multi-speaker modeling and speaker adaptation for DNN-based TTS synthesis." In 2015 IEEE international conference on acoustics, speech and signal processing (ICASSP), pp. IEEE, 2015. 4475-4479.
Huang, Zhiying, Heng Lu, Ming Lei, and Zhijie Yan. "Linear networks based speaker adaptation for speech synthesis." In 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), IEEE, 2018. 5319-5323.
Inoue, Katsuki, Sunao Hara, Masanobu Abe, Tomoki Hayashi, Ryuichi Yamamoto, and Shinji Watanabe. "Semi-supervised speaker adaptation for end-to-end speech synthesis with pretrained models." In ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), IEEE, 2020. 7634-7638.
Arık, Sercan Ö., Mike Chrzanowski, Adam Coates, Gregory Diamos, Andrew Gibiansky, Yongguo Kang, Xian Li et al. "Deep voice: Real-time neural text-to-speech." In International conference on machine learning, PMLR, 2017. 7634-7638.
Gibiansky, Andrew, Sercan Arik, Gregory Diamos, John Miller, Kainan Peng, Wei Ping, Jonathan Raiman, and Yanqi Zhou. "Deep voice 2: Multi-speaker neural text-to-speech." Advances in neural information processing systems 30 (2017).
Choi, Seungwoo, Seungju Han, Dongyoung Kim, and Sungjoo Ha. "Attentron: Few-shot text-to-speech utilizing attention-based variable-length embedding." arXiv preprint arXiv:2005.08484 (2020).
