Table of Links Abstract and 1 Introduction 2 Related Work 2.1 Neural Codec Language Models and 2.2 Non-autoregressive Models 2.3 Diffusion Models and 2.4 Zero-shot Voice Cloning 3 Hierspeech++ and 3.1 Speech Representations 3.2 Hierarchical Speech Synthesizer 3.3 Text-to-Vec 3.4 Speech Super-resolution 3.5 Model Architecture 4 Speech Synthesis Tasks 4.1 Voice Conversion and 4.2 Text-to-Speech 4.3 Style Prompt Replication 5 Experiment and Result, and Dataset 5.2 Preprocessing and 5.3 Training 5.4 Evaluation Metrics 5.5 Ablation Study 5.6 Zero-shot Voice Conversion 5.7 High-diversity but High-fidelity Speech Synthesis 5.8 Zero-shot Text-to-Speech 5.9 Zero-shot Text-to-Speech with 1s Prompt 5.10 Speech Super-resolution 5.11 Additional Experiments with Other Baselines 6 Limitation and Quick Fix 7 Conclusion, Acknowledgement and References 5.5 Ablation Study Before we compare the model with other baselines in the TTS and VC tasks, we conducted ablation studies by comparing Reconstruction[10], Resynthesis [11], and VC performance to verify the effectiveness of each component in HierSpeech++. First, although previous E2E models have shown highquality waveform audio generation, the zero-shot speech synthesis performance was considerably low, and some studies must fine-tune or use speaker id for speaker adaptation. Recently, HierVST has significantly improved a voice style transfer performance of the E2E model; therefore so we conduct ablation studies by building up on HierVST AMP Block We first replaced the MRF block of HiFi-GAN with the AMP of BigVGAN for OOD generation. The AMP improved the performance of all tasks in terms of all metrics without F0 consistency. The results show that the BigVGANbased HAG performs better but the loss balance may lean toward optimizing the waveform reconstruction rather than F0 prediction; however, the naturalness and similarity of the converted speech improved in terms of all metrics. Specifically, objective naturalness exhibited better UTMOS. SF Encoder To address F0 consistency, we utilize an SF encoder (SFE) for a dual-path semantic encoder, which enhances the semantic prior in terms of all metrics. This significantly improved the F0 consistency of inference scenario. It is worth noting that F0 can be manually controlled. Dual-audio Encoder We also utilized a dual-audio posterior encoder (DAE) to increase the acoustic capacity of the acoustic representation, which significantly increases the reconstruction performance. Although the linear spectrogram contains useful information for reconstructing a waveform audio, this representation still lacks the ability to reproduce all information; therefore, additional information from waveform audio could complement a wave-level acoustic representation. It is worth noting that the DAE was only utilized during training but significantly improved reconstruction and pronunciation. However, we found that the enhanced acoustic posterior contains a large information resulting in reducing a VC performance. T-Flow To bridge the gap between each representation, we replace a wavenet-based normalizing flow with Transformerbased normalizing flow (T-Flow) using AdaLN-Zero for style adaptation. This also improved the entire performance of all metrics. Moreover, speaker similarity significantly improved. Bi-Flow Moreover, we adopt a bidirectional normalizing flow (Bi-Flow) to reduce the train-inference mismatch problem. The results show that Bi-Flow slightly decreases the reconstruction quality. However, this could regularize a posterior by conditioning the information used in the inference scenario, thereby improving the VC performance. We also found that high weight of Bi-Flow significantly decreased a reconstruction performance, and thus, we use λ of 0.5 for weak regularization. Large-scale Data In addition, we demonstrated that our model is robust to data scale-up. We did not use any labels to train the model, and only used a low-resolution speech dataset of 16 kHz, which we could obtain simply due to SpeechSR. For scaling-up, we did not conduct any preprocessing to train the model without down-sampling (any to 16 kHz), so there are noisy samples in our dataset however we did not experience any problems. This paper is available on arxiv under CC BY-NC-SA 4.0 DEED license. [10] Reconstruction: Posterior Encoder → Generator → Audio [11] Resynthesis: Prior Encoder → Generator → Audio Authors:
(1) Sang-Hoon Lee, Fellow, IEEE with the Department of Artificial Intelligence, Korea University, Seoul 02841, South Korea;
(2) Ha-Yeong Choi, Fellow, IEEE with the Department of Artificial Intelligence, Korea University, Seoul 02841, South Korea;
(3) Seung-Bin Kim, Fellow, IEEE with the Department of Artificial Intelligence, Korea University, Seoul 02841, South Korea;
(4) Seong-Whan Lee, Fellow, IEEE with the Department of Artificial Intelligence, Korea University, Seoul 02841, South Korea and a Corresponding author. Table of Links Abstract and 1 Introduction Abstract and 1 Introduction 2 Related Work 2.1 Neural Codec Language Models and 2.2 Non-autoregressive Models 2.1 Neural Codec Language Models and 2.2 Non-autoregressive Models 2.3 Diffusion Models and 2.4 Zero-shot Voice Cloning 2.3 Diffusion Models and 2.4 Zero-shot Voice Cloning 3 Hierspeech++ and 3.1 Speech Representations 3 Hierspeech++ and 3.1 Speech Representations 3.2 Hierarchical Speech Synthesizer 3.2 Hierarchical Speech Synthesizer 3.3 Text-to-Vec 3.3 Text-to-Vec 3.4 Speech Super-resolution 3.4 Speech Super-resolution 3.5 Model Architecture 3.5 Model Architecture 4 Speech Synthesis Tasks 4.1 Voice Conversion and 4.2 Text-to-Speech 4.1 Voice Conversion and 4.2 Text-to-Speech 4.3 Style Prompt Replication 4.3 Style Prompt Replication 5 Experiment and Result, and Dataset 5 Experiment and Result, and Dataset 5.2 Preprocessing and 5.3 Training 5.2 Preprocessing and 5.3 Training 5.4 Evaluation Metrics 5.4 Evaluation Metrics 5.5 Ablation Study 5.5 Ablation Study 5.6 Zero-shot Voice Conversion 5.6 Zero-shot Voice Conversion 5.7 High-diversity but High-fidelity Speech Synthesis 5.7 High-diversity but High-fidelity Speech Synthesis 5.8 Zero-shot Text-to-Speech 5.8 Zero-shot Text-to-Speech 5.9 Zero-shot Text-to-Speech with 1s Prompt 5.9 Zero-shot Text-to-Speech with 1s Prompt 5.10 Speech Super-resolution 5.10 Speech Super-resolution 5.11 Additional Experiments with Other Baselines 5.11 Additional Experiments with Other Baselines 6 Limitation and Quick Fix 6 Limitation and Quick Fix 7 Conclusion, Acknowledgement and References 7 Conclusion, Acknowledgement and References 5.5 Ablation Study Before we compare the model with other baselines in the TTS and VC tasks, we conducted ablation studies by comparing Reconstruction[10], Resynthesis [11], and VC performance to verify the effectiveness of each component in HierSpeech++. First, although previous E2E models have shown highquality waveform audio generation, the zero-shot speech synthesis performance was considerably low, and some studies must fine-tune or use speaker id for speaker adaptation. Recently, HierVST has significantly improved a voice style transfer performance of the E2E model; therefore so we conduct ablation studies by building up on HierVST AMP Block We first replaced the MRF block of HiFi-GAN with the AMP of BigVGAN for OOD generation. The AMP improved the performance of all tasks in terms of all metrics without F0 consistency. The results show that the BigVGANbased HAG performs better but the loss balance may lean toward optimizing the waveform reconstruction rather than F0 prediction; however, the naturalness and similarity of the converted speech improved in terms of all metrics. Specifically, objective naturalness exhibited better UTMOS. AMP Block SF Encoder To address F0 consistency, we utilize an SF encoder (SFE) for a dual-path semantic encoder, which enhances the semantic prior in terms of all metrics. This significantly improved the F0 consistency of inference scenario. It is worth noting that F0 can be manually controlled. SF Encoder Dual-audio Encoder We also utilized a dual-audio posterior encoder (DAE) to increase the acoustic capacity of the acoustic representation, which significantly increases the reconstruction performance. Although the linear spectrogram contains useful information for reconstructing a waveform audio, this representation still lacks the ability to reproduce all information; therefore, additional information from waveform audio could complement a wave-level acoustic representation. It is worth noting that the DAE was only utilized during training but significantly improved reconstruction and pronunciation. However, we found that the enhanced acoustic posterior contains a large information resulting in reducing a VC performance. Dual-audio Encoder T-Flow To bridge the gap between each representation, we replace a wavenet-based normalizing flow with Transformerbased normalizing flow (T-Flow) using AdaLN-Zero for style adaptation. This also improved the entire performance of all metrics. Moreover, speaker similarity significantly improved. T-Flow Bi-Flow Moreover, we adopt a bidirectional normalizing flow (Bi-Flow) to reduce the train-inference mismatch problem. The results show that Bi-Flow slightly decreases the reconstruction quality. However, this could regularize a posterior by conditioning the information used in the inference scenario, thereby improving the VC performance. We also found that high weight of Bi-Flow significantly decreased a reconstruction performance, and thus, we use λ of 0.5 for weak regularization. Bi-Flow Large-scale Data In addition, we demonstrated that our model is robust to data scale-up. We did not use any labels to train the model, and only used a low-resolution speech dataset of 16 kHz, which we could obtain simply due to SpeechSR. For scaling-up, we did not conduct any preprocessing to train the model without down-sampling (any to 16 kHz), so there are noisy samples in our dataset however we did not experience any problems. Large-scale Data This paper is available on arxiv under CC BY-NC-SA 4.0 DEED license. This paper is available on arxiv under CC BY-NC-SA 4.0 DEED license. available on arxiv [10] Reconstruction: Posterior Encoder → Generator → Audio [11] Resynthesis: Prior Encoder → Generator → Audio Authors: (1) Sang-Hoon Lee, Fellow, IEEE with the Department of Artificial Intelligence, Korea University, Seoul 02841, South Korea; (2) Ha-Yeong Choi, Fellow, IEEE with the Department of Artificial Intelligence, Korea University, Seoul 02841, South Korea; (3) Seung-Bin Kim, Fellow, IEEE with the Department of Artificial Intelligence, Korea University, Seoul 02841, South Korea; (4) Seong-Whan Lee, Fellow, IEEE with the Department of Artificial Intelligence, Korea University, Seoul 02841, South Korea and a Corresponding author. Authors: Authors: (1) Sang-Hoon Lee, Fellow, IEEE with the Department of Artificial Intelligence, Korea University, Seoul 02841, South Korea; (2) Ha-Yeong Choi, Fellow, IEEE with the Department of Artificial Intelligence, Korea University, Seoul 02841, South Korea; (3) Seung-Bin Kim, Fellow, IEEE with the Department of Artificial Intelligence, Korea University, Seoul 02841, South Korea; (4) Seong-Whan Lee, Fellow, IEEE with the Department of Artificial Intelligence, Korea University, Seoul 02841, South Korea and a Corresponding author.

Part of HackerNoon's growing list of open-source research papers, promoting free access to academic material.

Conducting Ablation Studies to Verify the Effectiveness of Each Component in HierSpeech++

About Author

Comments

TOPICS

THIS ARTICLE WAS FEATURED IN

Related Stories

A Close Look at Misalignment in Pretraining Datasets

CLHE's Triumph in Multimodal Bundle Construction Experimentation

DML Revolutionizes Multi-Task Learning with Proven Effectiveness and Real-World Deployment

Neural Codec Language Models and Non-Autoregressive Models Explained

Introducing Hierspeech++: A Human-Level Zeroshot Speech Synthesis Model

The Backbone Speech Synthesizer for HierSpeech++

A Close Look at Misalignment in Pretraining Datasets

CLHE's Triumph in Multimodal Bundle Construction Experimentation

DML Revolutionizes Multi-Task Learning with Proven Effectiveness and Real-World Deployment

Neural Codec Language Models and Non-Autoregressive Models Explained

Introducing Hierspeech++: A Human-Level Zeroshot Speech Synthesis Model

The Backbone Speech Synthesizer for HierSpeech++

Light-Mode

Classic

Newspaper

Minty

Dark-Mode

Neon Noir

Minty

HN StartUps