Table of Links Abstract and 1 Introduction 2 Related Work 2.1 Neural Codec Language Models and 2.2 Non-autoregressive Models 2.3 Diffusion Models and 2.4 Zero-shot Voice Cloning 3 Hierspeech++ and 3.1 Speech Representations 3.2 Hierarchical Speech Synthesizer 3.3 Text-to-Vec 3.4 Speech Super-resolution 3.5 Model Architecture 4 Speech Synthesis Tasks 4.1 Voice Conversion and 4.2 Text-to-Speech 4.3 Style Prompt Replication 5 Experiment and Result, and Dataset 5.2 Preprocessing and 5.3 Training 5.4 Evaluation Metrics 5.5 Ablation Study 5.6 Zero-shot Voice Conversion 5.7 High-diversity but High-fidelity Speech Synthesis 5.8 Zero-shot Text-to-Speech 5.9 Zero-shot Text-to-Speech with 1s Prompt 5.10 Speech Super-resolution 5.11 Additional Experiments with Other Baselines 6 Limitation and Quick Fix 7 Conclusion, Acknowledgement and References 2.3 Diffusion Models Diffusion models have also demonstrated their powerful generative performances in speech synthesis. Grad-TTS [63] first introduced a score-based decoder to generate a Melspectrogram, and Diff-VC demonstrated the high-adaptation performance of diffusion models in zero-shot voice conversion scenarios. DiffSinger achieved a state-of-the-art performance in SVS task by generating a high-quality singing voice with a powerful adaptation performance. DDDM-VC [10] significantly improved speech representation disentangle [3] and voice conversion performance by a disentnalged denoising diffusion model and prior mixup. Diff-HierVC [11] introduced a hierarchical voice style transfer frameworks that generates pitch contour and voice hierarchically based on diffusion models. Guided-TTS [33] and Guided-TTS 2 [38] have also shown good speaker adaptation performance for TTS. UnitSpeech [32] introduced a unit-based speech synthesis with diffusion models. Furthermore, recent studies utilized a diffusion model in latent representation. Naturalspeech 2 [70] and HiddenSinger [24] utilized the acoustic representation of an audio autoencoder as a latent representation, and developed a conditional latent diffusion model for speech or singing voice synthesis. StyleTTS 2 [54] proposed a style latent diffusion for style adaptation. Although all the above models have shown powerful adaptation performance, they have a slow inference speed for their iterative generation manner. To reduce the inference speed, CoMoSpeech [87] and Multi-GradSpeech [83] adopted a consistency model for a diffusion-based TTS model. Recently, VoiceBox [43] and P-Flow [39] utilized flow matching with optimal transport for fast sampling. However, these models still have a traininginference mismatch problem that arises from two-stage speech synthesis frameworks and they are vulnerable to noisy target voice prompt. 2.4 Zero-shot Voice Cloning Zero-shot learning [81] for voice cloning is a task to synthesize speech with a novel speaker, which has not been previously observed during training. A majority of the studies on voice cloning [25], [74] focused on cloning the voice styles, such as timbre and environment, and speaking styles, such as prosody and pronunciation. [72] presented a reference encoder for prosody modeling, and GST [80] utilized a learnable token for style modeling from the reference speech or manually control. [51] proposed a fine-grained prosody control for expressive speech synthesis from reference speech. Multi-SpectroGAN [50] utilized adversarial feedback and a mixup strategy for an expressive and diverse zero-shot TTS. Meta-StyleSpeech [59] introduced meta-learning for style modeling, and GenerSpeech [22] utilized a mix-style layer normalization for better generalization on the out-ofdomain style transfer. PVAE-TTS [44] utilized a progressive style adaptation for high-quality zero-shot TTS. AdaSpeech [8] introduced adaptive layer normalization for adaptive speech synthesis. YourTTS [7] trained the VITS [35] with a speaker encoder and Grad-StyleSpeech [29] utilized a styleconditioned prior on a score-based Mel-spectrogram decoder for better adaptive TTS. Built upon VQTTS [17], TN-VQTTS [18] introduce a timbre-normalized vector-quantized acoustic features for speaking style and timbre transfer. Meanwhile, there are text prompt-based style generation models which can describe a speaking or voice style from text descriptions [19], [52], [84]. This paper is available on arxiv under CC BY-NC-SA 4.0 DEED license. Authors:
(1) Sang-Hoon Lee, Fellow, IEEE with the Department of Artificial Intelligence, Korea University, Seoul 02841, South Korea;
(2) Ha-Yeong Choi, Fellow, IEEE with the Department of Artificial Intelligence, Korea University, Seoul 02841, South Korea;
(3) Seung-Bin Kim, Fellow, IEEE with the Department of Artificial Intelligence, Korea University, Seoul 02841, South Korea;
(4) Seong-Whan Lee, Fellow, IEEE with the Department of Artificial Intelligence, Korea University, Seoul 02841, South Korea and a Corresponding author. Table of Links Abstract and 1 Introduction Abstract and 1 Introduction 2 Related Work 2.1 Neural Codec Language Models and 2.2 Non-autoregressive Models 2.1 Neural Codec Language Models and 2.2 Non-autoregressive Models 2.3 Diffusion Models and 2.4 Zero-shot Voice Cloning 2.3 Diffusion Models and 2.4 Zero-shot Voice Cloning 3 Hierspeech++ and 3.1 Speech Representations 3 Hierspeech++ and 3.1 Speech Representations 3.2 Hierarchical Speech Synthesizer 3.2 Hierarchical Speech Synthesizer 3.3 Text-to-Vec 3.3 Text-to-Vec 3.4 Speech Super-resolution 3.4 Speech Super-resolution 3.5 Model Architecture 3.5 Model Architecture 4 Speech Synthesis Tasks 4.1 Voice Conversion and 4.2 Text-to-Speech 4.1 Voice Conversion and 4.2 Text-to-Speech 4.3 Style Prompt Replication 4.3 Style Prompt Replication 5 Experiment and Result, and Dataset 5 Experiment and Result, and Dataset 5.2 Preprocessing and 5.3 Training 5.2 Preprocessing and 5.3 Training 5.4 Evaluation Metrics 5.4 Evaluation Metrics 5.5 Ablation Study 5.5 Ablation Study 5.6 Zero-shot Voice Conversion 5.6 Zero-shot Voice Conversion 5.7 High-diversity but High-fidelity Speech Synthesis 5.7 High-diversity but High-fidelity Speech Synthesis 5.8 Zero-shot Text-to-Speech 5.8 Zero-shot Text-to-Speech 5.9 Zero-shot Text-to-Speech with 1s Prompt 5.9 Zero-shot Text-to-Speech with 1s Prompt 5.10 Speech Super-resolution 5.10 Speech Super-resolution 5.11 Additional Experiments with Other Baselines 5.11 Additional Experiments with Other Baselines 6 Limitation and Quick Fix 6 Limitation and Quick Fix 7 Conclusion, Acknowledgement and References 7 Conclusion, Acknowledgement and References 2.3 Diffusion Models Diffusion models have also demonstrated their powerful generative performances in speech synthesis. Grad-TTS [63] first introduced a score-based decoder to generate a Melspectrogram, and Diff-VC demonstrated the high-adaptation performance of diffusion models in zero-shot voice conversion scenarios. DiffSinger achieved a state-of-the-art performance in SVS task by generating a high-quality singing voice with a powerful adaptation performance. DDDM-VC [10] significantly improved speech representation disentangle [3] and voice conversion performance by a disentnalged denoising diffusion model and prior mixup. Diff-HierVC [11] introduced a hierarchical voice style transfer frameworks that generates pitch contour and voice hierarchically based on diffusion models. Guided-TTS [33] and Guided-TTS 2 [38] have also shown good speaker adaptation performance for TTS. UnitSpeech [32] introduced a unit-based speech synthesis with diffusion models. Furthermore, recent studies utilized a diffusion model in latent representation. Naturalspeech 2 [70] and HiddenSinger [24] utilized the acoustic representation of an audio autoencoder as a latent representation, and developed a conditional latent diffusion model for speech or singing voice synthesis. StyleTTS 2 [54] proposed a style latent diffusion for style adaptation. Although all the above models have shown powerful adaptation performance, they have a slow inference speed for their iterative generation manner. To reduce the inference speed, CoMoSpeech [87] and Multi-GradSpeech [83] adopted a consistency model for a diffusion-based TTS model. Recently, VoiceBox [43] and P-Flow [39] utilized flow matching with optimal transport for fast sampling. However, these models still have a traininginference mismatch problem that arises from two-stage speech synthesis frameworks and they are vulnerable to noisy target voice prompt. 2.4 Zero-shot Voice Cloning Zero-shot learning [81] for voice cloning is a task to synthesize speech with a novel speaker, which has not been previously observed during training. A majority of the studies on voice cloning [25], [74] focused on cloning the voice styles, such as timbre and environment, and speaking styles, such as prosody and pronunciation. [72] presented a reference encoder for prosody modeling, and GST [80] utilized a learnable token for style modeling from the reference speech or manually control. [51] proposed a fine-grained prosody control for expressive speech synthesis from reference speech. Multi-SpectroGAN [50] utilized adversarial feedback and a mixup strategy for an expressive and diverse zero-shot TTS. Meta-StyleSpeech [59] introduced meta-learning for style modeling, and GenerSpeech [22] utilized a mix-style layer normalization for better generalization on the out-ofdomain style transfer. PVAE-TTS [44] utilized a progressive style adaptation for high-quality zero-shot TTS. AdaSpeech [8] introduced adaptive layer normalization for adaptive speech synthesis. YourTTS [7] trained the VITS [35] with a speaker encoder and Grad-StyleSpeech [29] utilized a styleconditioned prior on a score-based Mel-spectrogram decoder for better adaptive TTS. Built upon VQTTS [17], TN-VQTTS [18] introduce a timbre-normalized vector-quantized acoustic features for speaking style and timbre transfer. Meanwhile, there are text prompt-based style generation models which can describe a speaking or voice style from text descriptions [19], [52], [84]. This paper is available on arxiv under CC BY-NC-SA 4.0 DEED license. This paper is available on arxiv under CC BY-NC-SA 4.0 DEED license. available on arxiv Authors: (1) Sang-Hoon Lee, Fellow, IEEE with the Department of Artificial Intelligence, Korea University, Seoul 02841, South Korea; (2) Ha-Yeong Choi, Fellow, IEEE with the Department of Artificial Intelligence, Korea University, Seoul 02841, South Korea; (3) Seung-Bin Kim, Fellow, IEEE with the Department of Artificial Intelligence, Korea University, Seoul 02841, South Korea; (4) Seong-Whan Lee, Fellow, IEEE with the Department of Artificial Intelligence, Korea University, Seoul 02841, South Korea and a Corresponding author. Authors: Authors: (1) Sang-Hoon Lee, Fellow, IEEE with the Department of Artificial Intelligence, Korea University, Seoul 02841, South Korea; (2) Ha-Yeong Choi, Fellow, IEEE with the Department of Artificial Intelligence, Korea University, Seoul 02841, South Korea; (3) Seung-Bin Kim, Fellow, IEEE with the Department of Artificial Intelligence, Korea University, Seoul 02841, South Korea; (4) Seong-Whan Lee, Fellow, IEEE with the Department of Artificial Intelligence, Korea University, Seoul 02841, South Korea and a Corresponding author.

Part of HackerNoon's growing list of open-source research papers, promoting free access to academic material.

Diffusion Models and Zero-shot Voice Cloning in Speech Synthesis: How Do They Fare?

About Author

Comments

TOPICS

THIS ARTICLE WAS FEATURED IN

Related Stories

A Close Look at Misalignment in Pretraining Datasets

Finding AI-Generated Faces in the Wild: Model

Finding AI-Generated Faces in the Wild: Data sets

Finding AI-Generated Faces in the Wild: Results

Finding AI-Generated Faces in the Wild: Discussion, Acknowledgements, and References

Finding AI-Generated Faces in the Wild: Abstract and Intro

A Close Look at Misalignment in Pretraining Datasets

Finding AI-Generated Faces in the Wild: Model

Finding AI-Generated Faces in the Wild: Data sets

Finding AI-Generated Faces in the Wild: Results

Finding AI-Generated Faces in the Wild: Discussion, Acknowledgements, and References

Finding AI-Generated Faces in the Wild: Abstract and Intro

Light-Mode

Classic

Newspaper

Minty

Dark-Mode

Neon Noir

Minty

HN StartUps