paint-brush
HierSpeech++: How Does It Compare to Vall-E, Natural Speech 2, and StyleTTS2?by@textmodels

HierSpeech++: How Does It Compare to Vall-E, Natural Speech 2, and StyleTTS2?

tldt arrow

Too Long; Didn't Read

We compared the zero-shot TTS performance of our model with Vall-E, NaturalSpeech 2, and StyleTTS 2.
featured image - HierSpeech++: How Does It Compare to Vall-E, Natural Speech 2, and StyleTTS2?
Writings, Papers and Blogs on Text Models HackerNoon profile picture

Abstract and 1 Introduction

2 Related Work

2.1 Neural Codec Language Models and 2.2 Non-autoregressive Models

2.3 Diffusion Models and 2.4 Zero-shot Voice Cloning

3 Hierspeech++ and 3.1 Speech Representations

3.2 Hierarchical Speech Synthesizer

3.3 Text-to-Vec

3.4 Speech Super-resolution

3.5 Model Architecture

4 Speech Synthesis Tasks

4.1 Voice Conversion and 4.2 Text-to-Speech

4.3 Style Prompt Replication

5 Experiment and Result, and Dataset

5.2 Preprocessing and 5.3 Training

5.4 Evaluation Metrics

5.5 Ablation Study

5.6 Zero-shot Voice Conversion

5.7 High-diversity but High-fidelity Speech Synthesis

5.8 Zero-shot Text-to-Speech

5.9 Zero-shot Text-to-Speech with 1s Prompt

5.10 Speech Super-resolution

5.11 Additional Experiments with Other Baselines

6 Limitation and Quick Fix

7 Conclusion, Acknowledgement and References

5.11 Additional Experiments with Other Baselines

We compared the zero-shot TTS performance of our model with Vall-E, NaturalSpeech 2, and StyleTTS 2. Because there are no official implementations of them, we utilize the demo samples in NaturalSpeech 2 and StyleTTS 2. We only compared four samples for this experiment. We also added the audio samples to the demo pages. For naturalness, we utilized UTMOS and our model shows a significantly higher score than others. We also compared the similarity with prompt and GT. TABLE 11 shows that our model has much higher similarity with prompts than others. However, our model has a lower similarity with GT than others. We found that SECS between prompt and GT also shows a low similarity in these samples and this means the prompt and GT have a slightly different style even from the same speaker.


TABLE 12: Results on Speech Prompts with Noise Suppression. HierSpeech++♠ denotes the cascaded denoising results of Hierspeech++ after speech synthesis. We only utilize the denoised audio as speech prompt for style encoder to extract the denoised style representation.


In addition, we only utilize four samples. Meanwhile, we thought that the similarity between prompt and generated speech is more important for zero-shot speech synthesis.


This paper is available on arxiv under CC BY-NC-SA 4.0 DEED license.

Authors:

(1) Sang-Hoon Lee, Fellow, IEEE with the Department of Artificial Intelligence, Korea University, Seoul 02841, South Korea;

(2) Ha-Yeong Choi, Fellow, IEEE with the Department of Artificial Intelligence, Korea University, Seoul 02841, South Korea;

(3) Seung-Bin Kim, Fellow, IEEE with the Department of Artificial Intelligence, Korea University, Seoul 02841, South Korea;

(4) Seong-Whan Lee, Fellow, IEEE with the Department of Artificial Intelligence, Korea University, Seoul 02841, South Korea and a Corresponding author.