This story draft by @fewshot has not been reviewed by an editor, YET.

Conducting Ablation Studies to Verify the Effectiveness of Each Component in HierSpeech++

Table of Links

Abstract and 1 Introduction

2.3 Diffusion Models and 2.4 Zero-shot Voice Cloning

3 Hierspeech++ and 3.1 Speech Representations

3.2 Hierarchical Speech Synthesizer

3.3 Text-to-Vec

3.4 Speech Super-resolution

3.5 Model Architecture

4 Speech Synthesis Tasks

4.1 Voice Conversion and 4.2 Text-to-Speech

4.3 Style Prompt Replication

5 Experiment and Result, and Dataset

5.2 Preprocessing and 5.3 Training

5.4 Evaluation Metrics

5.5 Ablation Study

5.6 Zero-shot Voice Conversion

5.7 High-diversity but High-fidelity Speech Synthesis

5.8 Zero-shot Text-to-Speech

5.9 Zero-shot Text-to-Speech with 1s Prompt

5.10 Speech Super-resolution

5.11 Additional Experiments with Other Baselines

6 Limitation and Quick Fix

7 Conclusion, Acknowledgement and References

5.5 Ablation Study

Before we compare the model with other baselines in the TTS and VC tasks, we conducted ablation studies by comparing Reconstruction[10], Resynthesis [11], and VC performance to verify the effectiveness of each component in HierSpeech++. First, although previous E2E models have shown highquality waveform audio generation, the zero-shot speech synthesis performance was considerably low, and some studies must fine-tune or use speaker id for speaker adaptation. Recently, HierVST has significantly improved a voice style transfer performance of the E2E model; therefore so we conduct ablation studies by building up on HierVST

AMP Block We first replaced the MRF block of HiFi-GAN with the AMP of BigVGAN for OOD generation. The AMP improved the performance of all tasks in terms of all metrics without F0 consistency. The results show that the BigVGANbased HAG performs better but the loss balance may lean toward optimizing the waveform reconstruction rather than F0 prediction; however, the naturalness and similarity of the converted speech improved in terms of all metrics. Specifically, objective naturalness exhibited better UTMOS.

SF Encoder To address F0 consistency, we utilize an SF encoder (SFE) for a dual-path semantic encoder, which enhances the semantic prior in terms of all metrics. This significantly improved the F0 consistency of inference scenario. It is worth noting that F0 can be manually controlled.

Dual-audio Encoder We also utilized a dual-audio posterior encoder (DAE) to increase the acoustic capacity of the acoustic representation, which significantly increases the reconstruction performance. Although the linear spectrogram contains useful information for reconstructing a waveform audio, this representation still lacks the ability to reproduce all information; therefore, additional information from waveform audio could complement a wave-level acoustic representation. It is worth noting that the DAE was only utilized during training but significantly improved reconstruction and pronunciation. However, we found that the enhanced acoustic posterior contains a large information resulting in reducing a VC performance.

T-Flow To bridge the gap between each representation, we replace a wavenet-based normalizing flow with Transformerbased normalizing flow (T-Flow) using AdaLN-Zero for style adaptation. This also improved the entire performance of all metrics. Moreover, speaker similarity significantly improved.

Bi-Flow Moreover, we adopt a bidirectional normalizing flow (Bi-Flow) to reduce the train-inference mismatch problem. The results show that Bi-Flow slightly decreases the reconstruction quality. However, this could regularize a posterior by conditioning the information used in the inference scenario, thereby improving the VC performance. We also found that high weight of Bi-Flow significantly decreased a reconstruction performance, and thus, we use λ of 0.5 for weak regularization.

Large-scale Data In addition, we demonstrated that our model is robust to data scale-up. We did not use any labels to train the model, and only used a low-resolution speech dataset of 16 kHz, which we could obtain simply due to SpeechSR. For scaling-up, we did not conduct any preprocessing to train the model without down-sampling (any to 16 kHz), so there are noisy samples in our dataset however we did not experience any problems.

This paper is available on arxiv under CC BY-NC-SA 4.0 DEED license.

[10] Reconstruction: Posterior Encoder → Generator → Audio

[11] Resynthesis: Prior Encoder → Generator → Audio

Authors:

(1) Sang-Hoon Lee, Fellow, IEEE with the Department of Artificial Intelligence, Korea University, Seoul 02841, South Korea;

(2) Ha-Yeong Choi, Fellow, IEEE with the Department of Artificial Intelligence, Korea University, Seoul 02841, South Korea;

(3) Seung-Bin Kim, Fellow, IEEE with the Department of Artificial Intelligence, Korea University, Seoul 02841, South Korea;

(4) Seong-Whan Lee, Fellow, IEEE with the Department of Artificial Intelligence, Korea University, Seoul 02841, South Korea and a Corresponding author.