HierSpeech++: All the Amazing Things It Could Do

Table of Links Abstract and 1 Introduction 2 Related Work 2.1 Neural Codec Language Models and 2.2 Non-autoregressive Models 2.3 Diffusion Models and 2.4 Zero-shot Voice Cloning 3 Hierspeech++ and 3.1 Speech Representations 3.2 Hierarchical Speech Synthesizer 3.3 Text-to-Vec 3.4 Speech Super-resolution 3.5 Model Architecture 4 Speech Synthesis Tasks 4.1 Voice Conversion and 4.2 Text-to-Speech 4.3 Style Prompt Replication 5 Experiment and Result, and Dataset 5.2 Preprocessing and 5.3 Training 5.4 Evaluation Metrics 5.5 Ablation Study 5.6 Zero-shot Voice Conversion 5.7 High-diversity but High-fidelity Speech Synthesis 5.8 Zero-shot Text-to-Speech 5.9 Zero-shot Text-to-Speech with 1s Prompt 5.10 Speech Super-resolution 5.11 Additional Experiments with Other Baselines 6 Limitation and Quick Fix 7 Conclusion, Acknowledgement and References 7 CONCLUSION In this work, we propose HierSpeech++, which achieves a human-level high-quality zero-shot speech synthesis performance. We introduce an efficient and powerful speech synthesis framework by disentangling semantic modeling, speech synthesizer, and speech super-resolution. We thoroughly analyze the components of our model to demonstrate how to achieve a human-level speech synthesis performance even in zero-shot scenarios. Moreover, we simply achieve this performance with a small-scale open-source dataset, LibriTTS. In addition, our model has a significantly faster inference speed than recently proposed zero-shot speech synthesis models. Furthermore, we introduce a style prompt replication for 1s voice cloning, and noise-free speech synthesis by adopting a denoised style prompt. Furthermore, SpeechSR simply upsamples the audio to 48 kHz for high-resolution audio generation. We will release the source code and checkpoint of all components including TTV, hierarchical speech synthesizer, and SpeechSR. For future works, we will extend the model to cross-lingual and emotion-controllable speech synthesis models by utilizing the pre-trained models such as [77]. Furthermore, we see that our hierarchical speech synthesis framework could be adopted to a speech-tospeech translation system by introducing non-autoregressive generation [82]. ACKNOWLEDGMENTS We’d like to thank Hongsun Yang for helpful discussions and contributions to our work. This study used an opensource korean speech datasets, NIKL dataset from the NIA and multi-speaker speech synthesis (MSSS) dataset from the AIHub. REFERENCES [1] A. Babu, C. Wang, A. Tjandra, K. Lakhotia, Q. Xu, N. Goyal, K. Singh, P. von Platen, Y. Saraf, J. Pino, A. Baevski, A. Conneau, and M. Auli. XLS-R: Self-supervised cross-lingual speech representation learning at scale. In Proc. Interspeech, pages 2278–2282, 2022. [2] A. Baevski, Y. Zhou, A. Mohamed, and M. Auli. wav2vec 2.0: A framework for self-supervised learning of speech representations. Proc. Adv. Neural Inf. Process. Syst., 33:12449–12460, 2020. [3] Y. Bengio, A. Courville, and P. Vincent. Representation learning: A review and new perspectives. IEEE Trans. Pattern Anal. Mach. Intell., 35(8):1798–1828, 2013. [4] M. Bernard and H. Titeux. Phonemizer: Text to phones transcription for multiple languages in python. Journal of Open Source Software, 6(68):3958, 2021. [5] J. Betker. Better speech synthesis through scaling. arXiv preprint arXiv:2305.07243, 2023. [6] Z. Borsos, M. Sharifi, D. Vincent, E. Kharitonov, N. Zeghidour, and M. Tagliasacchi. Soundstorm: Efficient parallel audio generation. arXiv preprint arXiv:2305.09636, 2023. [7] E. Casanova, J. Weber, C. D. Shulby, A. C. Junior, E. Gölge, and M. A. Ponti. YourTTS: Towards Zero-Shot Multi-Speaker TTS and Zero-Shot Voice Conversion for everyone. In Proc. Int. Conf. on Mach. Learn., pages 2709–2720. PMLR, 2022. [8] M. Chen, X. Tan, B. Li, Y. Liu, T. Qin, S. Zhao, and T.-Y. Liu. Adaspeech: Adaptive text to speech for custom voice. arXiv preprint arXiv:2103.00993, 2021. [9] H.-S. Choi, J. Yang, J. Lee, and H. Kim. Nansy++: Unified voice synthesis with neural analysis and synthesis. arXiv preprint arXiv:2211.09407, 2022. [10] H.-Y. Choi, S.-H. Lee, and S.-W. Lee. Dddm-vc: Decoupled denoising diffusion models with disentangled representation and prior mixup for verified robust voice conversion. arXiv preprint arXiv:2305.15816, 2023. [11] H.-Y. Choi, S.-H. Lee, and S.-W. Lee. Diff-HierVC: Diffusionbased Hierarchical Voice Conversion with Robust Pitch Generation and Masked Prior for Zero-shot Speaker Adaptation. In Proc. INTERSPEECH, pages 2283–2287, 2023. [12] H. Chung, S.-H. Lee, and S.-W. Lee. Reinforce-Aligner: Reinforcement Alignment Search for Robust End-to-End Text-to-Speech. In Proc. Interspeech, pages 3635–3639, 2021. [13] J. S. Chung, J. Huh, S. Mun, M. Lee, H.-S. Heo, S. Choe, C. Ham, S. Jung, B.-J. Lee, and I. Han. In Defence of Metric Learning for Speaker Recognition. In Proc. Interspeech 2020, pages 2977–2981, 2020. [14] J. S. Chung, A. Nagrani, and A. Zisserman. Voxceleb2: Deep speaker recognition. In Proc. Interspeech, pages 1086–1090, 2018. [15] A. Défossez, J. Copet, G. Synnaeve, and Y. Adi. High fidelity neural audio compression. arXiv preprint arXiv:2210.13438, 2022. [16] A. Défossez, J. Copet, G. Synnaeve, and Y. Adi. High fidelity neural audio compression. Trans. Mach. Learn. Research, 2023. [17] C. Du, Y. Guo, X. Chen, and K. Yu. Vqtts: High-fidelity text-tospeech synthesis with self-supervised vq acoustic feature. arXiv preprint arXiv:2204.00768, 2022. [18] C. Du, Y. Guo, X. Chen, and K. Yu. Speaker adaptive text-to-speech with timbre-normalized vector-quantized feature. IEEE/ACM Trans. Audio, Speech, Lang. Process., pages 1–12, 2023. [19] Z. Guo, Y. Leng, Y. Wu, S. Zhao, and X. Tan. Prompttts: Controllable text-to-speech with text descriptions. In IEEE Int. Conf. Acoust., Speech, Signal Process., pages 1–5. IEEE, 2023. [20] K. Han, Y. Wang, H. Chen, X. Chen, J. Guo, Z. Liu, Y. Tang, A. Xiao, C. Xu, Y. Xu, Z. Yang, Y. Zhang, and D. Tao. A survey on vision transformer. IEEE Trans. Pattern Anal. Mach. Intell., 45(1):87–110, 2023. [21] S. Han and J. Lee. NU-Wave 2: A General Neural Audio Upsampling Model for Various Sampling Rates. In Proc. Interspeech, pages 4401–4405, 2022. [22] R. Huang, Y. Ren, J. Liu, C. Cui, and Z. Zhao. Generspeech: Towards style transfer for generalizable out-of-domain text-to-speech. In Proc. Adv. Neural Inf. Process. Syst., 2022. [23] R. Huang, C. Zhang, Y. Wang, D. Yang, L. Liu, Z. Ye, Z. Jiang, C. Weng, Z. Zhao, and D. Yu. Make-a-voice: Unified voice synthesis with discrete representation. arXiv preprint arXiv:2305.19269, 2023. [24] J.-S. Hwang, S.-H. Lee, and S.-W. Lee. Hiddensinger: High-quality singing voice synthesis via neural audio codec and latent diffusion models. arXiv preprint arXiv:2306.06814, 2023. [25] Y. Jia, Y. Zhang, R. Weiss, Q. Wang, J. Shen, F. Ren, P. Nguyen, R. Pang, I. Lopez Moreno, Y. Wu, et al. Transfer learning from speaker verification to multispeaker text-to-speech synthesis. Proc. Adv. Neural Inf. Process. Syst., 31, 2018. [26] Z. Jiang, J. Liu, Y. Ren, J. He, C. Zhang, Z. Ye, P. Wei, C. Wang, X. Yin, Z. Ma, et al. Mega-tts 2: Zero-shot text-to-speech with arbitrary length speech prompts. arXiv preprint arXiv:2307.07218, 2023. [27] J. Kahn, M. Rivière, W. Zheng, E. Kharitonov, Q. Xu, P. E. Mazaré, J. Karadayi, V. Liptchinsky, R. Collobert, C. Fuegen, T. Likhomanenko, G. Synnaeve, A. Joulin, A. Mohamed, and E. Dupoux. Libri-light: A benchmark for asr with limited or no supervision. In ICASSP 2020 - 2020 IEEE Int. Conf. on Acoustics, Speech and Signal Processing (ICASSP), pages 7669–7673, 2020. [28] H. Kameoka, W.-C. Huang, K. Tanaka, T. Kaneko, N. Hojo, and T. Toda. Many-to-many voice transformer network. IEEE/ACM Trans. Audio, Speech, Lang. Process., 29:656–670, 2021. [29] M. Kang, D. Min, and S. J. Hwang. Grad-stylespeech: Any-speaker adaptive text-to-speech synthesis with diffusion models. In IEEE Int. Conf. Acoust., Speech, Signal Process., pages 1–5. IEEE, 2023. [30] K. Kasi and S. A. Zahorian. Yet another algorithm for pitch tracking. In IEEE Int. Conf. Acoust., Speech, Signal Process., volume 1, pages I–361, 2002. [31] E. Kharitonov, D. Vincent, Z. Borsos, R. Marinier, S. Girgin, O. Pietquin, M. Sharifi, M. Tagliasacchi, and N. Zeghidour. Speak, read and prompt: High-fidelity text-to-speech with minimal supervision. arXiv preprint arXiv:2302.03540, 2023. [32] H. Kim, S. Kim, J. Yeom, and S. Yoon. Unitspeech: Speakeradaptive speech synthesis with untranscribed data. arXiv preprint arXiv:2306.16083, 2023. [33] H. Kim, S. Kim, and S. Yoon. Guided-TTS: A diffusion model for text-to-speech via classifier guidance. In Proc. Int. Conf. on Mach. Learn., pages 11119–11133, 2022. [34] J. Kim, S. Kim, J. Kong, and S. Yoon. Glow-TTS: A generative flow for text-to-speech via monotonic alignment search. Proc. Adv. Neural Inf. Process. Syst., 33:8067–8077, 2020. [35] J. Kim, J. Kong, and J. Son. Conditional variational autoencoder with adversarial learning for end-to-end text-to-speech. In Proc. Int. Conf. on Mach. Learn., pages 5530–5540. PMLR, 2021. [36] J.-H. Kim, S.-H. Lee, J.-H. Lee, and S.-W. Lee. Fre-GAN: Adversarial Frequency-Consistent Audio Synthesis. In Proc. Interspeech, pages 2197–2201, 2021. [37] M. Kim, M. Jeong, B. J. Choi, S. Ahn, J. Y. Lee, and N. S. Kim. Transfer Learning Framework for Low-Resource Text-to-Speech using a Large-Scale Unlabeled Speech Corpus. In Proc. Interspeech, pages 788–792, 2022. [38] S. Kim, H. Kim, and S. Yoon. Guided-TTS 2: A diffusion model for high-quality adaptive text-to-speech with untranscribed data. arXiv preprint arXiv:2205.15370, 2022. [39] S. Kim, K. J. Shih, R. Badlani, J. F. Santos, E. Bakhturina, M. T. Desta, R. Valle, S. Yoon, and B. Catanzaro. P-flow: A fast and data-efficient zero-shot TTS through speech prompting. In Proc. Adv. Neural Inf. Process. Syst., 2023. [40] J. Kong, J. Kim, and J. Bae. HiFi-GAN: Generative adversarial networks for efficient and high fidelity speech synthesis. Proc. Adv. Neural Inf. Process. Syst., 33:17022–17033, 2020. [41] R. Kumar, P. Seetharaman, A. Luebs, I. Kumar, and K. Kumar. High-fidelity audio compression with improved RVQGAN. In Proc. Adv. Neural Inf. Process. Syst., 2023. [42] Y. Kwon, H. S. Heo, B.-J. Lee, and J. S. Chung. The ins and outs of speaker recognition: lessons from VoxSRC 2020. In IEEE Int. Conf. Acoust., Speech, Signal Process., 2021. [43] M. Le, A. Vyas, B. Shi, B. Karrer, L. Sari, R. Moritz, M. Williamson, V. Manohar, Y. Adi, J. Mahadeokar, and W.-N. Hsu. Voicebox: Text-guided multilingual universal speech generation at scale. In Proc. Adv. Neural Inf. Process. Syst., 2023. [44] J.-H. Lee, S.-H. Lee, J.-H. Kim, and S.-W. Lee. Pvae-tts: Adaptive text-to-speech via progressive style adaptation. In IEEE Int. Conf. Acoust., Speech, Signal Process., pages 6312–6316. IEEE, 2022. [45] S.-H. Lee, H.-Y. Choi, H.-S. Oh, and S.-W. Lee. HierVST: Hierarchical Adaptive Zero-shot Voice Style Transfer. In Proc. Interspeech, pages 4439–4443, 2023. [46] S.-H. Lee, J.-H. Kim, H. Chung, and S.-W. Lee. VoiceMixer: Adversarial voice style mixup. Proc. Adv. Neural Inf. Process. Syst., 34:294–308, 2021. [47] S.-H. Lee, J.-H. Kim, K.-E. Lee, and S.-W. Lee. Fre-gan 2: Fast and efficient frequency-consistent audio synthesis. In IEEE Int. Conf. Acoust., Speech, Signal Process., pages 6192–6196, 2022. [48] S.-H. Lee, S.-B. Kim, J.-H. Lee, E. Song, M.-J. Hwang, and S.-W. Lee. HierSpeech: Bridging the gap between text and speech by hierarchical variational inference using self-supervised representations for speech synthesis. In Proc. Adv. Neural Inf. Process. Syst., 2022. [49] S.-H. Lee, H.-R. Noh, W.-J. Nam, and S.-W. Lee. Duration controllable voice conversion via phoneme-based information bottleneck. IEEE/ACM Trans. Audio, Speech, Lang. Process., 30:1173–1183, 2022. [50] S.-H. Lee, H.-W. Yoon, H.-R. Noh, J.-H. Kim, and S.-W. Lee. Multi-spectrogan: High-diversity and high-fidelity spectrogram generation with adversarial style combination for speech synthesis. In Proc. AAAI Conf. Artif. Intell., volume 35, pages 13198–13206, 2021. [51] Y. Lee and T. Kim. Robust and fine-grained prosody control of end-to-end speech synthesis. In IEEE Int. Conf. Acoust., Speech, Signal Process., pages 5911–5915. IEEE, 2019. [52] Y. Leng, Z. Guo, K. Shen, X. Tan, Z. Ju, Y. Liu, Y. Liu, D. Yang, L. Zhang, K. Song, et al. Prompttts 2: Describing and generating voices with text prompt. arXiv preprint arXiv:2309.02285, 2023. [53] N. Li, S. Liu, Y. Liu, S. Zhao, and M. Liu. Neural speech synthesis with transformer network. In Proc. AAAI Conf. Artif. Intell., volume 33, pages 6706–6713, 2019. [54] Y. A. Li, C. Han, V. S. Raghavan, G. Mischler, and N. Mesgarani. Styletts 2: Towards human-level text-to-speech through style diffusion and adversarial training with large speech language models. arXiv preprint arXiv:2306.07691, 2023. [55] H. Liu, K. Chen, Q. Tian, W. Wang, and M. D. Plumbley. Audiosr: Versatile audio super-resolution at scale. arXiv preprint arXiv:2309.07314, 2023. [56] S. Liu, Y. Cao, D. Wang, X. Wu, X. Liu, and H. Meng. Any-tomany voice conversion with location-relative sequence-to-sequence modeling. IEEE/ACM Trans. Audio, Speech, Lang. Process., 29:1717– 1728, 2021. [57] I. Loshchilov and F. Hutter. Decoupled weight decay regularization. In Proc. Int. Conf. Learn. Representations, 2019. [58] Y.-X. Lu, Y. Ai, and Z.-H. Ling. MP-SENet: A Speech Enhancement Model with Parallel Denoising of Magnitude and Phase Spectra. In Proc. INTERSPEECH, pages 3834–3838, 2023. [59] D. Min, D. B. Lee, E. Yang, and S. J. Hwang. Meta-stylespeech: Multi-speaker adaptive text-to-speech generation. In Proc. Int. Conf. on Mach. Learn., pages 7748–7759. PMLR, 2021. [60] M. Morrison, R. Kumar, K. Kumar, P. Seetharaman, A. Courville, and Y. Bengio. Chunked autoregressive GAN for conditional waveform synthesis. In Proc. Int. Conf. Learn. Representations, 2022. [61] T. A. Nguyen, W.-N. Hsu, A. D’Avirro, B. Shi, I. Gat, M. FazelZarani, T. Remez, J. Copet, G. Synnaeve, M. Hassid, F. Kreuk, Y. Adi, and E. Dupoux. Expresso: A Benchmark and Analysis of Discrete Expressive Speech Resynthesis. In Proc. INTERSPEECH, pages 4823–4827, 2023. [62] W. Peebles and S. Xie. Scalable diffusion models with transformers. In Proc. of the IEEE/CVF Int. Conf. on Computer Vision, pages 4195– 4205, 2023. [63] V. Popov, I. Vovk, V. Gogoryan, T. Sadekova, and M. Kudinov. Grad-TTS: A diffusion probabilistic model for text-to-speech. In Proc. Int. Conf. on Mach. Learn., pages 8599–8608, 2021. [64] V. Popov, I. Vovk, V. Gogoryan, T. Sadekova, M. S. Kudinov, and J. Wei. Diffusion-based voice conversion with fast maximum likelihood sampling scheme. In Proc. Int. Conf. Learn. Representations, 2022. [65] V. Pratap, A. Tjandra, B. Shi, P. Tomasello, A. Babu, S. Kundu, A. Elkahky, Z. Ni, A. Vyas, M. Fazel-Zarandi, et al. Scaling speech technology to 1,000+ languages. arXiv preprint arXiv:2305.13516, 2023. [66] K. Qian, Y. Zhang, S. Chang, X. Yang, and M. Hasegawa-Johnson. AutoVC: Zero-shot voice style transfer with only autoencoder loss. In Proc. Int. Conf. on Mach. Learn., pages 5210–5219, 2019. [67] Y. Ren, C. Hu, X. Tan, T. Qin, S. Zhao, Z. Zhao, and T.-Y. Liu. Fastspeech 2: Fast and high-quality end-to-end text to speech. arXiv preprint arXiv:2006.04558, 2020. [68] Y. Ren, Y. Ruan, X. Tan, T. Qin, S. Zhao, Z. Zhao, and T.-Y. Liu. Fastspeech: Fast, robust and controllable text to speech. Proc. Adv. Neural Inf. Process. Syst., 32, 2019. [69] T. Saeki, D. Xin, W. Nakata, T. Koriyama, S. Takamichi, and H. Saruwatari. UTMOS: UTokyo-SaruLab System for VoiceMOS Challenge 2022. In Proc. Interspeech 2022, pages 4521–4525, 2022. [70] K. Shen, Z. Ju, X. Tan, Y. Liu, Y. Leng, L. He, T. Qin, S. Zhao, and J. Bian. Naturalspeech 2: Latent diffusion models are natural and zero-shot speech and singing synthesizers. arXiv preprint arXiv:2304.09116, 2023. [71] H. Siuzdak, P. Dura, P. van Rijn, and N. Jacoby. Wavthruvec: Latent speech representation as intermediate features for neural speech synthesis. arXiv preprint arXiv:2203.16930, 2022. [72] R. Skerry-Ryan, E. Battenberg, Y. Xiao, Y. Wang, D. Stanton, J. Shor, R. Weiss, R. Clark, and R. A. Saurous. Towards end-to-end prosody transfer for expressive speech synthesis with tacotron. In Proc. Int. Conf. on Mach. Learn., pages 4693–4702. PMLR, 2018. [73] S. Son, J. Kim, W.-S. Lai, M.-H. Yang, and K. M. Lee. Toward real-world super-resolution via adaptive downsampling models. IEEE Trans. Pattern Anal. Mach. Intell., 44(11):8657–8670, 2022. [74] H. Sun, D. Wang, L. Li, C. Chen, and T. F. Zheng. Random cycle loss and its application to voice conversion. IEEE Trans. Pattern Anal. Mach. Intell., 45(8):10331–10345, 2023. [75] X. Tan, J. Chen, H. Liu, J. Cong, C. Zhang, Y. Liu, X. Wang, Y. Leng, Y. Yi, L. He, et al. Naturalspeech: End-to-end text to speech synthesis with human-level quality. arXiv preprint arXiv:2205.04421, 2022. [76] C. Veaux, J. Yamagishi, K. MacDonald, et al. Superseded-CSTR VCTK corpus: English multi-speaker corpus for CSTR voice cloning toolkit. 2017. [77] J. Wagner, A. Triantafyllopoulos, H. Wierstorf, M. Schmitt, F. Burkhardt, F. Eyben, and B. W. Schuller. Dawn of the transformer era in speech emotion recognition: Closing the valence gap. IEEE Trans. Pattern Anal. Mach. Intell., 45(9):10745–10759, 2023. [78] C. Wang, S. Chen, Y. Wu, Z. Zhang, L. Zhou, S. Liu, Z. Chen, Y. Liu, H. Wang, J. Li, et al. Neural codec language models are zero-shot text to speech synthesizers. arXiv preprint arXiv:2301.02111, 2023. [79] Y. Wang, R. Skerry-Ryan, D. Stanton, Y. Wu, R. J. Weiss, N. Jaitly, Z. Yang, Y. Xiao, Z. Chen, S. Bengio, Q. Le, Y. Agiomyrgiannakis, R. Clark, and R. A. Saurous. Tacotron: Towards End-to-End Speech Synthesis. In Proc. Interspeech, pages 4006–4010, 2017. [80] Y. Wang, D. Stanton, Y. Zhang, R.-S. Ryan, E. Battenberg, J. Shor, Y. Xiao, Y. Jia, F. Ren, and R. A. Saurous. Style tokens: Unsupervised style modeling, control and transfer in end-to-end speech synthesis. In Proc. Int. Conf. on Mach. Learn., pages 5180–5189. PMLR, 2018. [81] Y. Xian, C. H. Lampert, B. Schiele, and Z. Akata. Zero-shot learning—a comprehensive evaluation of the good, the bad and the ugly. IEEE Trans. Pattern Anal. Mach. Intell., 41(9):2251–2265, 2019. [82] Y. Xiao, L. Wu, J. Guo, J. Li, M. Zhang, T. Qin, and T.-Y. Liu. A survey on non-autoregressive generation for neural machine translation and beyond. IEEE Trans. Pattern Anal. Mach. Intell., 45(10):11407–11427, 2023. [83] H. Xue, S. Guo, P. Zhu, and M. Bi. Multi-gradspeech: Towards diffusion-based multi-speaker text-to-speech using consistent diffusion models. arXiv preprint arXiv:2308.10428, 2023. [84] D. Yang, S. Liu, R. Huang, G. Lei, C. Weng, H. Meng, and D. Yu. Instructtts: Modelling expressive tts in discrete latent space with natural language style prompt. arXiv preprint arXiv:2301.13662, 2023. [85] D. Yang, S. Liu, R. Huang, J. Tian, C. Weng, and Y. Zou. Hifi-codec: Group-residual vector quantization for high fidelity audio codec. arXiv preprint arXiv:2305.02765, 2023. [86] D. Yang, J. Tian, X. Tan, R. Huang, S. Liu, X. Chang, J. Shi, S. Zhao, J. Bian, X. Wu, et al. Uniaudio: An audio foundation model toward universal audio generation. arXiv preprint arXiv:2310.00704, 2023. [87] Z. Ye, W. Xue, X. Tan, J. Chen, Q. Liu, and Y. Guo. Comospeech: One-step speech and singing voice synthesis via consistency model. arXiv preprint arXiv:2305.06908, 2023. [88] C.-Y. Yu, S.-L. Yeh, G. Fazekas, and H. Tang. Conditioning and sampling in variational diffusion models for speech superresolution. In IEEE Int. Conf. Acoust., Speech, Signal Process., pages 1–5, 2023. [89] N. Zeghidour, A. Luebs, A. Omran, J. Skoglund, and M. Tagliasacchi. Soundstream: An end-to-end neural audio codec. IEEE/ACM Trans. Audio, Speech, Lang. Process., 30:495–507, 2021. [90] H. Zen, V. Dang, R. Clark, Y. Zhang, R. J. Weiss, Y. Jia, Z. Chen, and Y. Wu. LibriTTS: A Corpus Derived from LibriSpeech for Text-to-Speech. pages 1526–1530, 2019. This paper is available on arxiv under CC BY-NC-SA 4.0 DEED license. Authors: (1) Sang-Hoon Lee, Fellow, IEEE with the Department of Artificial Intelligence, Korea University, Seoul 02841, South Korea; (2) Ha-Yeong Choi, Fellow, IEEE with the Department of Artificial Intelligence, Korea University, Seoul 02841, South Korea; (3) Seung-Bin Kim, Fellow, IEEE with the Department of Artificial Intelligence, Korea University, Seoul 02841, South Korea; (4) Seong-Whan Lee, Fellow, IEEE with the Department of Artificial Intelligence, Korea University, Seoul 02841, South Korea and a Corresponding author. Table of Links Abstract and 1 Introduction Abstract and 1 Introduction 2 Related Work 2.1 Neural Codec Language Models and 2.2 Non-autoregressive Models 2.1 Neural Codec Language Models and 2.2 Non-autoregressive Models 2.3 Diffusion Models and 2.4 Zero-shot Voice Cloning 2.3 Diffusion Models and 2.4 Zero-shot Voice Cloning 3 Hierspeech++ and 3.1 Speech Representations 3 Hierspeech++ and 3.1 Speech Representations 3.2 Hierarchical Speech Synthesizer 3.2 Hierarchical Speech Synthesizer 3.3 Text-to-Vec 3.3 Text-to-Vec 3.4 Speech Super-resolution 3.4 Speech Super-resolution 3.5 Model Architecture 3.5 Model Architecture 4 Speech Synthesis Tasks 4.1 Voice Conversion and 4.2 Text-to-Speech 4.1 Voice Conversion and 4.2 Text-to-Speech 4.3 Style Prompt Replication 4.3 Style Prompt Replication 5 Experiment and Result, and Dataset 5 Experiment and Result, and Dataset 5.2 Preprocessing and 5.3 Training 5.2 Preprocessing and 5.3 Training 5.4 Evaluation Metrics 5.4 Evaluation Metrics 5.5 Ablation Study 5.5 Ablation Study 5.6 Zero-shot Voice Conversion 5.6 Zero-shot Voice Conversion 5.7 High-diversity but High-fidelity Speech Synthesis 5.7 High-diversity but High-fidelity Speech Synthesis 5.8 Zero-shot Text-to-Speech 5.8 Zero-shot Text-to-Speech 5.9 Zero-shot Text-to-Speech with 1s Prompt 5.9 Zero-shot Text-to-Speech with 1s Prompt 5.10 Speech Super-resolution 5.10 Speech Super-resolution 5.11 Additional Experiments with Other Baselines 5.11 Additional Experiments with Other Baselines 6 Limitation and Quick Fix 6 Limitation and Quick Fix 7 Conclusion, Acknowledgement and References 7 Conclusion, Acknowledgement and References 7 CONCLUSION In this work, we propose HierSpeech++, which achieves a human-level high-quality zero-shot speech synthesis performance. We introduce an efficient and powerful speech synthesis framework by disentangling semantic modeling, speech synthesizer, and speech super-resolution. We thoroughly analyze the components of our model to demonstrate how to achieve a human-level speech synthesis performance even in zero-shot scenarios. Moreover, we simply achieve this performance with a small-scale open-source dataset, LibriTTS. In addition, our model has a significantly faster inference speed than recently proposed zero-shot speech synthesis models. Furthermore, we introduce a style prompt replication for 1s voice cloning, and noise-free speech synthesis by adopting a denoised style prompt. Furthermore, SpeechSR simply upsamples the audio to 48 kHz for high-resolution audio generation. We will release the source code and checkpoint of all components including TTV, hierarchical speech synthesizer, and SpeechSR. For future works, we will extend the model to cross-lingual and emotion-controllable speech synthesis models by utilizing the pre-trained models such as [77]. Furthermore, we see that our hierarchical speech synthesis framework could be adopted to a speech-tospeech translation system by introducing non-autoregressive generation [82]. ACKNOWLEDGMENTS We’d like to thank Hongsun Yang for helpful discussions and contributions to our work. This study used an opensource korean speech datasets, NIKL dataset from the NIA and multi-speaker speech synthesis (MSSS) dataset from the AIHub. REFERENCES [1] A. Babu, C. Wang, A. Tjandra, K. Lakhotia, Q. Xu, N. Goyal, K. Singh, P. von Platen, Y. Saraf, J. Pino, A. Baevski, A. Conneau, and M. Auli. XLS-R: Self-supervised cross-lingual speech representation learning at scale. In Proc. Interspeech, pages 2278–2282, 2022. [2] A. Baevski, Y. Zhou, A. Mohamed, and M. Auli. wav2vec 2.0: A framework for self-supervised learning of speech representations. Proc. Adv. Neural Inf. Process. Syst., 33:12449–12460, 2020. [3] Y. Bengio, A. Courville, and P. Vincent. Representation learning: A review and new perspectives. IEEE Trans. Pattern Anal. Mach. Intell., 35(8):1798–1828, 2013. [4] M. Bernard and H. Titeux. Phonemizer: Text to phones transcription for multiple languages in python. Journal of Open Source Software, 6(68):3958, 2021. [5] J. Betker. Better speech synthesis through scaling. arXiv preprint arXiv:2305.07243, 2023. [6] Z. Borsos, M. Sharifi, D. Vincent, E. Kharitonov, N. Zeghidour, and M. Tagliasacchi. Soundstorm: Efficient parallel audio generation. arXiv preprint arXiv:2305.09636, 2023. [7] E. Casanova, J. Weber, C. D. Shulby, A. C. Junior, E. Gölge, and M. A. Ponti. YourTTS: Towards Zero-Shot Multi-Speaker TTS and Zero-Shot Voice Conversion for everyone. In Proc. Int. Conf. on Mach. Learn., pages 2709–2720. PMLR, 2022. [8] M. Chen, X. Tan, B. Li, Y. Liu, T. Qin, S. Zhao, and T.-Y. Liu. Adaspeech: Adaptive text to speech for custom voice. arXiv preprint arXiv:2103.00993, 2021. [9] H.-S. Choi, J. Yang, J. Lee, and H. Kim. Nansy++: Unified voice synthesis with neural analysis and synthesis. arXiv preprint arXiv:2211.09407, 2022. [10] H.-Y. Choi, S.-H. Lee, and S.-W. Lee. Dddm-vc: Decoupled denoising diffusion models with disentangled representation and prior mixup for verified robust voice conversion. arXiv preprint arXiv:2305.15816, 2023. [11] H.-Y. Choi, S.-H. Lee, and S.-W. Lee. Diff-HierVC: Diffusionbased Hierarchical Voice Conversion with Robust Pitch Generation and Masked Prior for Zero-shot Speaker Adaptation. In Proc. INTERSPEECH, pages 2283–2287, 2023. [12] H. Chung, S.-H. Lee, and S.-W. Lee. Reinforce-Aligner: Reinforcement Alignment Search for Robust End-to-End Text-to-Speech. In Proc. Interspeech, pages 3635–3639, 2021. [13] J. S. Chung, J. Huh, S. Mun, M. Lee, H.-S. Heo, S. Choe, C. Ham, S. Jung, B.-J. Lee, and I. Han. In Defence of Metric Learning for Speaker Recognition. In Proc. Interspeech 2020, pages 2977–2981, 2020. [14] J. S. Chung, A. Nagrani, and A. Zisserman. Voxceleb2: Deep speaker recognition. In Proc. Interspeech, pages 1086–1090, 2018. [15] A. Défossez, J. Copet, G. Synnaeve, and Y. Adi. High fidelity neural audio compression. arXiv preprint arXiv:2210.13438, 2022. [16] A. Défossez, J. Copet, G. Synnaeve, and Y. Adi. High fidelity neural audio compression. Trans. Mach. Learn. Research, 2023. [17] C. Du, Y. Guo, X. Chen, and K. Yu. Vqtts: High-fidelity text-tospeech synthesis with self-supervised vq acoustic feature. arXiv preprint arXiv:2204.00768, 2022. [18] C. Du, Y. Guo, X. Chen, and K. Yu. Speaker adaptive text-to-speech with timbre-normalized vector-quantized feature. IEEE/ACM Trans. Audio, Speech, Lang. Process., pages 1–12, 2023. [19] Z. Guo, Y. Leng, Y. Wu, S. Zhao, and X. Tan. Prompttts: Controllable text-to-speech with text descriptions. In IEEE Int. Conf. Acoust., Speech, Signal Process., pages 1–5. IEEE, 2023. [20] K. Han, Y. Wang, H. Chen, X. Chen, J. Guo, Z. Liu, Y. Tang, A. Xiao, C. Xu, Y. Xu, Z. Yang, Y. Zhang, and D. Tao. A survey on vision transformer. IEEE Trans. Pattern Anal. Mach. Intell., 45(1):87–110, 2023. [21] S. Han and J. Lee. NU-Wave 2: A General Neural Audio Upsampling Model for Various Sampling Rates. In Proc. Interspeech, pages 4401–4405, 2022. [22] R. Huang, Y. Ren, J. Liu, C. Cui, and Z. Zhao. Generspeech: Towards style transfer for generalizable out-of-domain text-to-speech. In Proc. Adv. Neural Inf. Process. Syst., 2022. [23] R. Huang, C. Zhang, Y. Wang, D. Yang, L. Liu, Z. Ye, Z. Jiang, C. Weng, Z. Zhao, and D. Yu. Make-a-voice: Unified voice synthesis with discrete representation. arXiv preprint arXiv:2305.19269, 2023. [24] J.-S. Hwang, S.-H. Lee, and S.-W. Lee. Hiddensinger: High-quality singing voice synthesis via neural audio codec and latent diffusion models. arXiv preprint arXiv:2306.06814, 2023. [25] Y. Jia, Y. Zhang, R. Weiss, Q. Wang, J. Shen, F. Ren, P. Nguyen, R. Pang, I. Lopez Moreno, Y. Wu, et al. Transfer learning from speaker verification to multispeaker text-to-speech synthesis. Proc. Adv. Neural Inf. Process. Syst., 31, 2018. [26] Z. Jiang, J. Liu, Y. Ren, J. He, C. Zhang, Z. Ye, P. Wei, C. Wang, X. Yin, Z. Ma, et al. Mega-tts 2: Zero-shot text-to-speech with arbitrary length speech prompts. arXiv preprint arXiv:2307.07218, 2023. [27] J. Kahn, M. Rivière, W. Zheng, E. Kharitonov, Q. Xu, P. E. Mazaré, J. Karadayi, V. Liptchinsky, R. Collobert, C. Fuegen, T. Likhomanenko, G. Synnaeve, A. Joulin, A. Mohamed, and E. Dupoux. Libri-light: A benchmark for asr with limited or no supervision. In ICASSP 2020 - 2020 IEEE Int. Conf. on Acoustics, Speech and Signal Processing (ICASSP), pages 7669–7673, 2020. [28] H. Kameoka, W.-C. Huang, K. Tanaka, T. Kaneko, N. Hojo, and T. Toda. Many-to-many voice transformer network. IEEE/ACM Trans. Audio, Speech, Lang. Process., 29:656–670, 2021. [29] M. Kang, D. Min, and S. J. Hwang. Grad-stylespeech: Any-speaker adaptive text-to-speech synthesis with diffusion models. In IEEE Int. Conf. Acoust., Speech, Signal Process., pages 1–5. IEEE, 2023. [30] K. Kasi and S. A. Zahorian. Yet another algorithm for pitch tracking. In IEEE Int. Conf. Acoust., Speech, Signal Process., volume 1, pages I–361, 2002. [31] E. Kharitonov, D. Vincent, Z. Borsos, R. Marinier, S. Girgin, O. Pietquin, M. Sharifi, M. Tagliasacchi, and N. Zeghidour. Speak, read and prompt: High-fidelity text-to-speech with minimal supervision. arXiv preprint arXiv:2302.03540, 2023. [32] H. Kim, S. Kim, J. Yeom, and S. Yoon. Unitspeech: Speakeradaptive speech synthesis with untranscribed data. arXiv preprint arXiv:2306.16083, 2023. [33] H. Kim, S. Kim, and S. Yoon. Guided-TTS: A diffusion model for text-to-speech via classifier guidance. In Proc. Int. Conf. on Mach. Learn., pages 11119–11133, 2022. [34] J. Kim, S. Kim, J. Kong, and S. Yoon. Glow-TTS: A generative flow for text-to-speech via monotonic alignment search. Proc. Adv. Neural Inf. Process. Syst., 33:8067–8077, 2020. [35] J. Kim, J. Kong, and J. Son. Conditional variational autoencoder with adversarial learning for end-to-end text-to-speech. In Proc. Int. Conf. on Mach. Learn., pages 5530–5540. PMLR, 2021. [36] J.-H. Kim, S.-H. Lee, J.-H. Lee, and S.-W. Lee. Fre-GAN: Adversarial Frequency-Consistent Audio Synthesis. In Proc. Interspeech, pages 2197–2201, 2021. [37] M. Kim, M. Jeong, B. J. Choi, S. Ahn, J. Y. Lee, and N. S. Kim. Transfer Learning Framework for Low-Resource Text-to-Speech using a Large-Scale Unlabeled Speech Corpus. In Proc. Interspeech, pages 788–792, 2022. [38] S. Kim, H. Kim, and S. Yoon. Guided-TTS 2: A diffusion model for high-quality adaptive text-to-speech with untranscribed data. arXiv preprint arXiv:2205.15370, 2022. [39] S. Kim, K. J. Shih, R. Badlani, J. F. Santos, E. Bakhturina, M. T. Desta, R. Valle, S. Yoon, and B. Catanzaro. P-flow: A fast and data-efficient zero-shot TTS through speech prompting. In Proc. Adv. Neural Inf. Process. Syst., 2023. [40] J. Kong, J. Kim, and J. Bae. HiFi-GAN: Generative adversarial networks for efficient and high fidelity speech synthesis. Proc. Adv. Neural Inf. Process. Syst., 33:17022–17033, 2020. [41] R. Kumar, P. Seetharaman, A. Luebs, I. Kumar, and K. Kumar. High-fidelity audio compression with improved RVQGAN. In Proc. Adv. Neural Inf. Process. Syst., 2023. [42] Y. Kwon, H. S. Heo, B.-J. Lee, and J. S. Chung. The ins and outs of speaker recognition: lessons from VoxSRC 2020. In IEEE Int. Conf. Acoust., Speech, Signal Process., 2021. [43] M. Le, A. Vyas, B. Shi, B. Karrer, L. Sari, R. Moritz, M. Williamson, V. Manohar, Y. Adi, J. Mahadeokar, and W.-N. Hsu. Voicebox: Text-guided multilingual universal speech generation at scale. In Proc. Adv. Neural Inf. Process. Syst., 2023. [44] J.-H. Lee, S.-H. Lee, J.-H. Kim, and S.-W. Lee. Pvae-tts: Adaptive text-to-speech via progressive style adaptation. In IEEE Int. Conf. Acoust., Speech, Signal Process., pages 6312–6316. IEEE, 2022. [45] S.-H. Lee, H.-Y. Choi, H.-S. Oh, and S.-W. Lee. HierVST: Hierarchical Adaptive Zero-shot Voice Style Transfer. In Proc. Interspeech, pages 4439–4443, 2023. [46] S.-H. Lee, J.-H. Kim, H. Chung, and S.-W. Lee. VoiceMixer: Adversarial voice style mixup. Proc. Adv. Neural Inf. Process. Syst., 34:294–308, 2021. [47] S.-H. Lee, J.-H. Kim, K.-E. Lee, and S.-W. Lee. Fre-gan 2: Fast and efficient frequency-consistent audio synthesis. In IEEE Int. Conf. Acoust., Speech, Signal Process., pages 6192–6196, 2022. [48] S.-H. Lee, S.-B. Kim, J.-H. Lee, E. Song, M.-J. Hwang, and S.-W. Lee. HierSpeech: Bridging the gap between text and speech by hierarchical variational inference using self-supervised representations for speech synthesis. In Proc. Adv. Neural Inf. Process. Syst., 2022. [49] S.-H. Lee, H.-R. Noh, W.-J. Nam, and S.-W. Lee. Duration controllable voice conversion via phoneme-based information bottleneck. IEEE/ACM Trans. Audio, Speech, Lang. Process., 30:1173–1183, 2022. [50] S.-H. Lee, H.-W. Yoon, H.-R. Noh, J.-H. Kim, and S.-W. Lee. Multi-spectrogan: High-diversity and high-fidelity spectrogram generation with adversarial style combination for speech synthesis. In Proc. AAAI Conf. Artif. Intell., volume 35, pages 13198–13206, 2021. [51] Y. Lee and T. Kim. Robust and fine-grained prosody control of end-to-end speech synthesis. In IEEE Int. Conf. Acoust., Speech, Signal Process., pages 5911–5915. IEEE, 2019. [52] Y. Leng, Z. Guo, K. Shen, X. Tan, Z. Ju, Y. Liu, Y. Liu, D. Yang, L. Zhang, K. Song, et al. Prompttts 2: Describing and generating voices with text prompt. arXiv preprint arXiv:2309.02285, 2023. [53] N. Li, S. Liu, Y. Liu, S. Zhao, and M. Liu. Neural speech synthesis with transformer network. In Proc. AAAI Conf. Artif. Intell., volume 33, pages 6706–6713, 2019. [54] Y. A. Li, C. Han, V. S. Raghavan, G. Mischler, and N. Mesgarani. Styletts 2: Towards human-level text-to-speech through style diffusion and adversarial training with large speech language models. arXiv preprint arXiv:2306.07691, 2023. [55] H. Liu, K. Chen, Q. Tian, W. Wang, and M. D. Plumbley. Audiosr: Versatile audio super-resolution at scale. arXiv preprint arXiv:2309.07314, 2023. [56] S. Liu, Y. Cao, D. Wang, X. Wu, X. Liu, and H. Meng. Any-tomany voice conversion with location-relative sequence-to-sequence modeling. IEEE/ACM Trans. Audio, Speech, Lang. Process., 29:1717– 1728, 2021. [57] I. Loshchilov and F. Hutter. Decoupled weight decay regularization. In Proc. Int. Conf. Learn. Representations, 2019. [58] Y.-X. Lu, Y. Ai, and Z.-H. Ling. MP-SENet: A Speech Enhancement Model with Parallel Denoising of Magnitude and Phase Spectra. In Proc. INTERSPEECH, pages 3834–3838, 2023. [59] D. Min, D. B. Lee, E. Yang, and S. J. Hwang. Meta-stylespeech: Multi-speaker adaptive text-to-speech generation. In Proc. Int. Conf. on Mach. Learn., pages 7748–7759. PMLR, 2021. [60] M. Morrison, R. Kumar, K. Kumar, P. Seetharaman, A. Courville, and Y. Bengio. Chunked autoregressive GAN for conditional waveform synthesis. In Proc. Int. Conf. Learn. Representations, 2022. [61] T. A. Nguyen, W.-N. Hsu, A. D’Avirro, B. Shi, I. Gat, M. FazelZarani, T. Remez, J. Copet, G. Synnaeve, M. Hassid, F. Kreuk, Y. Adi, and E. Dupoux. Expresso: A Benchmark and Analysis of Discrete Expressive Speech Resynthesis. In Proc. INTERSPEECH, pages 4823–4827, 2023. [62] W. Peebles and S. Xie. Scalable diffusion models with transformers. In Proc. of the IEEE/CVF Int. Conf. on Computer Vision, pages 4195– 4205, 2023. [63] V. Popov, I. Vovk, V. Gogoryan, T. Sadekova, and M. Kudinov. Grad-TTS: A diffusion probabilistic model for text-to-speech. In Proc. Int. Conf. on Mach. Learn., pages 8599–8608, 2021. [64] V. Popov, I. Vovk, V. Gogoryan, T. Sadekova, M. S. Kudinov, and J. Wei. Diffusion-based voice conversion with fast maximum likelihood sampling scheme. In Proc. Int. Conf. Learn. Representations, 2022. [65] V. Pratap, A. Tjandra, B. Shi, P. Tomasello, A. Babu, S. Kundu, A. Elkahky, Z. Ni, A. Vyas, M. Fazel-Zarandi, et al. Scaling speech technology to 1,000+ languages. arXiv preprint arXiv:2305.13516, 2023. [66] K. Qian, Y. Zhang, S. Chang, X. Yang, and M. Hasegawa-Johnson. AutoVC: Zero-shot voice style transfer with only autoencoder loss. In Proc. Int. Conf. on Mach. Learn., pages 5210–5219, 2019. [67] Y. Ren, C. Hu, X. Tan, T. Qin, S. Zhao, Z. Zhao, and T.-Y. Liu. Fastspeech 2: Fast and high-quality end-to-end text to speech. arXiv preprint arXiv:2006.04558, 2020. [68] Y. Ren, Y. Ruan, X. Tan, T. Qin, S. Zhao, Z. Zhao, and T.-Y. Liu. Fastspeech: Fast, robust and controllable text to speech. Proc. Adv. Neural Inf. Process. Syst., 32, 2019. [69] T. Saeki, D. Xin, W. Nakata, T. Koriyama, S. Takamichi, and H. Saruwatari. UTMOS: UTokyo-SaruLab System for VoiceMOS Challenge 2022. In Proc. Interspeech 2022, pages 4521–4525, 2022. [70] K. Shen, Z. Ju, X. Tan, Y. Liu, Y. Leng, L. He, T. Qin, S. Zhao, and J. Bian. Naturalspeech 2: Latent diffusion models are natural and zero-shot speech and singing synthesizers. arXiv preprint arXiv:2304.09116, 2023. [71] H. Siuzdak, P. Dura, P. van Rijn, and N. Jacoby. Wavthruvec: Latent speech representation as intermediate features for neural speech synthesis. arXiv preprint arXiv:2203.16930, 2022. [72] R. Skerry-Ryan, E. Battenberg, Y. Xiao, Y. Wang, D. Stanton, J. Shor, R. Weiss, R. Clark, and R. A. Saurous. Towards end-to-end prosody transfer for expressive speech synthesis with tacotron. In Proc. Int. Conf. on Mach. Learn., pages 4693–4702. PMLR, 2018. [73] S. Son, J. Kim, W.-S. Lai, M.-H. Yang, and K. M. Lee. Toward real-world super-resolution via adaptive downsampling models. IEEE Trans. Pattern Anal. Mach. Intell., 44(11):8657–8670, 2022. [74] H. Sun, D. Wang, L. Li, C. Chen, and T. F. Zheng. Random cycle loss and its application to voice conversion. IEEE Trans. Pattern Anal. Mach. Intell., 45(8):10331–10345, 2023. [75] X. Tan, J. Chen, H. Liu, J. Cong, C. Zhang, Y. Liu, X. Wang, Y. Leng, Y. Yi, L. He, et al. Naturalspeech: End-to-end text to speech synthesis with human-level quality. arXiv preprint arXiv:2205.04421, 2022. [76] C. Veaux, J. Yamagishi, K. MacDonald, et al. Superseded-CSTR VCTK corpus: English multi-speaker corpus for CSTR voice cloning toolkit. 2017. [77] J. Wagner, A. Triantafyllopoulos, H. Wierstorf, M. Schmitt, F. Burkhardt, F. Eyben, and B. W. Schuller. Dawn of the transformer era in speech emotion recognition: Closing the valence gap. IEEE Trans. Pattern Anal. Mach. Intell., 45(9):10745–10759, 2023. [78] C. Wang, S. Chen, Y. Wu, Z. Zhang, L. Zhou, S. Liu, Z. Chen, Y. Liu, H. Wang, J. Li, et al. Neural codec language models are zero-shot text to speech synthesizers. arXiv preprint arXiv:2301.02111, 2023. [79] Y. Wang, R. Skerry-Ryan, D. Stanton, Y. Wu, R. J. Weiss, N. Jaitly, Z. Yang, Y. Xiao, Z. Chen, S. Bengio, Q. Le, Y. Agiomyrgiannakis, R. Clark, and R. A. Saurous. Tacotron: Towards End-to-End Speech Synthesis. In Proc. Interspeech, pages 4006–4010, 2017. [80] Y. Wang, D. Stanton, Y. Zhang, R.-S. Ryan, E. Battenberg, J. Shor, Y. Xiao, Y. Jia, F. Ren, and R. A. Saurous. Style tokens: Unsupervised style modeling, control and transfer in end-to-end speech synthesis. In Proc. Int. Conf. on Mach. Learn., pages 5180–5189. PMLR, 2018. [81] Y. Xian, C. H. Lampert, B. Schiele, and Z. Akata. Zero-shot learning—a comprehensive evaluation of the good, the bad and the ugly. IEEE Trans. Pattern Anal. Mach. Intell., 41(9):2251–2265, 2019. [82] Y. Xiao, L. Wu, J. Guo, J. Li, M. Zhang, T. Qin, and T.-Y. Liu. A survey on non-autoregressive generation for neural machine translation and beyond. IEEE Trans. Pattern Anal. Mach. Intell., 45(10):11407–11427, 2023. [83] H. Xue, S. Guo, P. Zhu, and M. Bi. Multi-gradspeech: Towards diffusion-based multi-speaker text-to-speech using consistent diffusion models. arXiv preprint arXiv:2308.10428, 2023. [84] D. Yang, S. Liu, R. Huang, G. Lei, C. Weng, H. Meng, and D. Yu. Instructtts: Modelling expressive tts in discrete latent space with natural language style prompt. arXiv preprint arXiv:2301.13662, 2023. [85] D. Yang, S. Liu, R. Huang, J. Tian, C. Weng, and Y. Zou. Hifi-codec: Group-residual vector quantization for high fidelity audio codec. arXiv preprint arXiv:2305.02765, 2023. [86] D. Yang, J. Tian, X. Tan, R. Huang, S. Liu, X. Chang, J. Shi, S. Zhao, J. Bian, X. Wu, et al. Uniaudio: An audio foundation model toward universal audio generation. arXiv preprint arXiv:2310.00704, 2023. [87] Z. Ye, W. Xue, X. Tan, J. Chen, Q. Liu, and Y. Guo. Comospeech: One-step speech and singing voice synthesis via consistency model. arXiv preprint arXiv:2305.06908, 2023. [88] C.-Y. Yu, S.-L. Yeh, G. Fazekas, and H. Tang. Conditioning and sampling in variational diffusion models for speech superresolution. In IEEE Int. Conf. Acoust., Speech, Signal Process., pages 1–5, 2023. [89] N. Zeghidour, A. Luebs, A. Omran, J. Skoglund, and M. Tagliasacchi. Soundstream: An end-to-end neural audio codec. IEEE/ACM Trans. Audio, Speech, Lang. Process., 30:495–507, 2021. [90] H. Zen, V. Dang, R. Clark, Y. Zhang, R. J. Weiss, Y. Jia, Z. Chen, and Y. Wu. LibriTTS: A Corpus Derived from LibriSpeech for Text-to-Speech. pages 1526–1530, 2019. This paper is available on arxiv under CC BY-NC-SA 4.0 DEED license. This paper is available on arxiv under CC BY-NC-SA 4.0 DEED license. available on arxiv Authors: (1) Sang-Hoon Lee, Fellow, IEEE with the Department of Artificial Intelligence, Korea University, Seoul 02841, South Korea; (2) Ha-Yeong Choi, Fellow, IEEE with the Department of Artificial Intelligence, Korea University, Seoul 02841, South Korea; (3) Seung-Bin Kim, Fellow, IEEE with the Department of Artificial Intelligence, Korea University, Seoul 02841, South Korea; (4) Seong-Whan Lee, Fellow, IEEE with the Department of Artificial Intelligence, Korea University, Seoul 02841, South Korea and a Corresponding author. Authors: Authors: (1) Sang-Hoon Lee, Fellow, IEEE with the Department of Artificial Intelligence, Korea University, Seoul 02841, South Korea; (2) Ha-Yeong Choi, Fellow, IEEE with the Department of Artificial Intelligence, Korea University, Seoul 02841, South Korea; (3) Seung-Bin Kim, Fellow, IEEE with the Department of Artificial Intelligence, Korea University, Seoul 02841, South Korea; (4) Seong-Whan Lee, Fellow, IEEE with the Department of Artificial Intelligence, Korea University, Seoul 02841, South Korea and a Corresponding author.