Table of Links Abstract and 1 Introduction 2 Related Work 2.1 Neural Codec Language Models and 2.2 Non-autoregressive Models 2.3 Diffusion Models and 2.4 Zero-shot Voice Cloning 3 Hierspeech++ and 3.1 Speech Representations 3.2 Hierarchical Speech Synthesizer 3.3 Text-to-Vec 3.4 Speech Super-resolution 3.5 Model Architecture 4 Speech Synthesis Tasks 4.1 Voice Conversion and 4.2 Text-to-Speech 4.3 Style Prompt Replication 5 Experiment and Result, and Dataset 5.2 Preprocessing and 5.3 Training 5.4 Evaluation Metrics 5.5 Ablation Study 5.6 Zero-shot Voice Conversion 5.7 High-diversity but High-fidelity Speech Synthesis 5.8 Zero-shot Text-to-Speech 5.9 Zero-shot Text-to-Speech with 1s Prompt 5.10 Speech Super-resolution 5.11 Additional Experiments with Other Baselines 6 Limitation and Quick Fix 7 Conclusion, Acknowledgement and References 5.11 Additional Experiments with Other Baselines We compared the zero-shot TTS performance of our model with Vall-E, NaturalSpeech 2, and StyleTTS 2. Because there are no official implementations of them, we utilize the demo samples in NaturalSpeech 2 and StyleTTS 2. We only compared four samples for this experiment. We also added the audio samples to the demo pages. For naturalness, we utilized UTMOS and our model shows a significantly higher score than others. We also compared the similarity with prompt and GT. TABLE 11 shows that our model has much higher similarity with prompts than others. However, our model has a lower similarity with GT than others. We found that SECS between prompt and GT also shows a low similarity in these samples and this means the prompt and GT have a slightly different style even from the same speaker. In addition, we only utilize four samples. Meanwhile, we thought that the similarity between prompt and generated speech is more important for zero-shot speech synthesis. This paper is available on arxiv under CC BY-NC-SA 4.0 DEED license. Authors:
(1) Sang-Hoon Lee, Fellow, IEEE with the Department of Artificial Intelligence, Korea University, Seoul 02841, South Korea;
(2) Ha-Yeong Choi, Fellow, IEEE with the Department of Artificial Intelligence, Korea University, Seoul 02841, South Korea;
(3) Seung-Bin Kim, Fellow, IEEE with the Department of Artificial Intelligence, Korea University, Seoul 02841, South Korea;
(4) Seong-Whan Lee, Fellow, IEEE with the Department of Artificial Intelligence, Korea University, Seoul 02841, South Korea and a Corresponding author. Table of Links Abstract and 1 Introduction Abstract and 1 Introduction 2 Related Work 2.1 Neural Codec Language Models and 2.2 Non-autoregressive Models 2.1 Neural Codec Language Models and 2.2 Non-autoregressive Models 2.3 Diffusion Models and 2.4 Zero-shot Voice Cloning 2.3 Diffusion Models and 2.4 Zero-shot Voice Cloning 3 Hierspeech++ and 3.1 Speech Representations 3 Hierspeech++ and 3.1 Speech Representations 3.2 Hierarchical Speech Synthesizer 3.2 Hierarchical Speech Synthesizer 3.3 Text-to-Vec 3.3 Text-to-Vec 3.4 Speech Super-resolution 3.4 Speech Super-resolution 3.5 Model Architecture 3.5 Model Architecture 4 Speech Synthesis Tasks 4.1 Voice Conversion and 4.2 Text-to-Speech 4.1 Voice Conversion and 4.2 Text-to-Speech 4.3 Style Prompt Replication 4.3 Style Prompt Replication 5 Experiment and Result, and Dataset 5 Experiment and Result, and Dataset 5.2 Preprocessing and 5.3 Training 5.2 Preprocessing and 5.3 Training 5.4 Evaluation Metrics 5.4 Evaluation Metrics 5.5 Ablation Study 5.5 Ablation Study 5.6 Zero-shot Voice Conversion 5.6 Zero-shot Voice Conversion 5.7 High-diversity but High-fidelity Speech Synthesis 5.7 High-diversity but High-fidelity Speech Synthesis 5.8 Zero-shot Text-to-Speech 5.8 Zero-shot Text-to-Speech 5.9 Zero-shot Text-to-Speech with 1s Prompt 5.9 Zero-shot Text-to-Speech with 1s Prompt 5.10 Speech Super-resolution 5.10 Speech Super-resolution 5.11 Additional Experiments with Other Baselines 5.11 Additional Experiments with Other Baselines 6 Limitation and Quick Fix 6 Limitation and Quick Fix 7 Conclusion, Acknowledgement and References 7 Conclusion, Acknowledgement and References 5.11 Additional Experiments with Other Baselines We compared the zero-shot TTS performance of our model with Vall-E, NaturalSpeech 2, and StyleTTS 2. Because there are no official implementations of them, we utilize the demo samples in NaturalSpeech 2 and StyleTTS 2. We only compared four samples for this experiment. We also added the audio samples to the demo pages. For naturalness, we utilized UTMOS and our model shows a significantly higher score than others. We also compared the similarity with prompt and GT. TABLE 11 shows that our model has much higher similarity with prompts than others. However, our model has a lower similarity with GT than others. We found that SECS between prompt and GT also shows a low similarity in these samples and this means the prompt and GT have a slightly different style even from the same speaker. In addition, we only utilize four samples. Meanwhile, we thought that the similarity between prompt and generated speech is more important for zero-shot speech synthesis. This paper is available on arxiv under CC BY-NC-SA 4.0 DEED license. This paper is available on arxiv under CC BY-NC-SA 4.0 DEED license. available on arxiv Authors: (1) Sang-Hoon Lee, Fellow, IEEE with the Department of Artificial Intelligence, Korea University, Seoul 02841, South Korea; (2) Ha-Yeong Choi, Fellow, IEEE with the Department of Artificial Intelligence, Korea University, Seoul 02841, South Korea; (3) Seung-Bin Kim, Fellow, IEEE with the Department of Artificial Intelligence, Korea University, Seoul 02841, South Korea; (4) Seong-Whan Lee, Fellow, IEEE with the Department of Artificial Intelligence, Korea University, Seoul 02841, South Korea and a Corresponding author. Authors: Authors: (1) Sang-Hoon Lee, Fellow, IEEE with the Department of Artificial Intelligence, Korea University, Seoul 02841, South Korea; (2) Ha-Yeong Choi, Fellow, IEEE with the Department of Artificial Intelligence, Korea University, Seoul 02841, South Korea; (3) Seung-Bin Kim, Fellow, IEEE with the Department of Artificial Intelligence, Korea University, Seoul 02841, South Korea; (4) Seong-Whan Lee, Fellow, IEEE with the Department of Artificial Intelligence, Korea University, Seoul 02841, South Korea and a Corresponding author.

Part of HackerNoon's growing list of open-source research papers, promoting free access to academic material.

HierSpeech++: How Does It Compare to Vall-E, Natural Speech 2, and StyleTTS2?

About Author

Comments

TOPICS

THIS ARTICLE WAS FEATURED IN

Related Stories

102 Languages, One Model: The Multimodal AI Breakthrough You Need to Know

Neural Codec Language Models and Non-Autoregressive Models Explained

Introducing Hierspeech++: A Human-Level Zeroshot Speech Synthesis Model

The Backbone Speech Synthesizer for HierSpeech++

A Text-To-Vec Model That Can Generate A Semantic Representation and F0 From A Text Sequence

The Preprocessing and Training That HierSpeech++ Went Through

102 Languages, One Model: The Multimodal AI Breakthrough You Need to Know

Neural Codec Language Models and Non-Autoregressive Models Explained

Introducing Hierspeech++: A Human-Level Zeroshot Speech Synthesis Model

The Backbone Speech Synthesizer for HierSpeech++

A Text-To-Vec Model That Can Generate A Semantic Representation and F0 From A Text Sequence

The Preprocessing and Training That HierSpeech++ Went Through

Light-Mode

Classic

Newspaper

Minty

Dark-Mode

Neon Noir

Minty

HN StartUps