Table of Links Abstract and 1 Introduction 2 Related Work 2.1 Neural Codec Language Models and 2.2 Non-autoregressive Models 2.3 Diffusion Models and 2.4 Zero-shot Voice Cloning 3 Hierspeech++ and 3.1 Speech Representations 3.2 Hierarchical Speech Synthesizer 3.3 Text-to-Vec 3.4 Speech Super-resolution 3.5 Model Architecture 4 Speech Synthesis Tasks 4.1 Voice Conversion and 4.2 Text-to-Speech 4.3 Style Prompt Replication 5 Experiment and Result, and Dataset 5.2 Preprocessing and 5.3 Training 5.4 Evaluation Metrics 5.5 Ablation Study 5.6 Zero-shot Voice Conversion 5.7 High-diversity but High-fidelity Speech Synthesis 5.8 Zero-shot Text-to-Speech 5.9 Zero-shot Text-to-Speech with 1s Prompt 5.10 Speech Super-resolution 5.11 Additional Experiments with Other Baselines 6 Limitation and Quick Fix 7 Conclusion, Acknowledgement and References 5 EXPERIMENT AND RESULT 5.1 Dataset We utilized LibriTTS dataset [90] to train the hierarchical speech synthesizer. First, we trained the model with trainclean subsets of LibriTTS (train-clean-100 and train-clean-360) for a fair comparison. Additionally, we utilized the trainother-500 subsets of LibriTTS for better voice style transfer. Furthermore, we scaled-up the dataset to 1 kh to improve the robustness and diversity, as indicated in TABLE 1[2] . For the Libri-light [27] and Multi-Speaker Speech Synthesis (MSSS) dataset of AIHub [3] , we sampled a small portion of speech from each speaker. We used a EXPRESSO [61] and NIKL[4]. We downsampled the audio at 16 kHz, and normalized it using a scale of [-0.95, 0.95]. For text-to-vec, we utilized all the train subsets of LibriTTS. For speechSR, we used a VCTK dataset [76] which has a sampling rate of 48 kHz to compare the models. However, we also trained the model with a largescale dataset for better speech super-resolution performance by including MSSS dataset, VCTK, and EXPRESSO. This paper is available on arxiv under CC BY-NC-SA 4.0 DEED license. [2] Although we hope to increase the data scale to over 10k Hours, this is the maximum limit in our academic resources. [3] https://aihub.or.kr [4] https://www.nia.or.kr/ Authors:
(1) Sang-Hoon Lee, Fellow, IEEE with the Department of Artificial Intelligence, Korea University, Seoul 02841, South Korea;
(2) Ha-Yeong Choi, Fellow, IEEE with the Department of Artificial Intelligence, Korea University, Seoul 02841, South Korea;
(3) Seung-Bin Kim, Fellow, IEEE with the Department of Artificial Intelligence, Korea University, Seoul 02841, South Korea;
(4) Seong-Whan Lee, Fellow, IEEE with the Department of Artificial Intelligence, Korea University, Seoul 02841, South Korea and a Corresponding author. Table of Links Abstract and 1 Introduction Abstract and 1 Introduction 2 Related Work 2.1 Neural Codec Language Models and 2.2 Non-autoregressive Models 2.1 Neural Codec Language Models and 2.2 Non-autoregressive Models 2.3 Diffusion Models and 2.4 Zero-shot Voice Cloning 2.3 Diffusion Models and 2.4 Zero-shot Voice Cloning 3 Hierspeech++ and 3.1 Speech Representations 3 Hierspeech++ and 3.1 Speech Representations 3.2 Hierarchical Speech Synthesizer 3.2 Hierarchical Speech Synthesizer 3.3 Text-to-Vec 3.3 Text-to-Vec 3.4 Speech Super-resolution 3.4 Speech Super-resolution 3.5 Model Architecture 3.5 Model Architecture 4 Speech Synthesis Tasks 4.1 Voice Conversion and 4.2 Text-to-Speech 4.1 Voice Conversion and 4.2 Text-to-Speech 4.3 Style Prompt Replication 4.3 Style Prompt Replication 5 Experiment and Result, and Dataset 5 Experiment and Result, and Dataset 5.2 Preprocessing and 5.3 Training 5.2 Preprocessing and 5.3 Training 5.4 Evaluation Metrics 5.4 Evaluation Metrics 5.5 Ablation Study 5.5 Ablation Study 5.6 Zero-shot Voice Conversion 5.6 Zero-shot Voice Conversion 5.7 High-diversity but High-fidelity Speech Synthesis 5.7 High-diversity but High-fidelity Speech Synthesis 5.8 Zero-shot Text-to-Speech 5.8 Zero-shot Text-to-Speech 5.9 Zero-shot Text-to-Speech with 1s Prompt 5.9 Zero-shot Text-to-Speech with 1s Prompt 5.10 Speech Super-resolution 5.10 Speech Super-resolution 5.11 Additional Experiments with Other Baselines 5.11 Additional Experiments with Other Baselines 6 Limitation and Quick Fix 6 Limitation and Quick Fix 7 Conclusion, Acknowledgement and References 7 Conclusion, Acknowledgement and References 5 EXPERIMENT AND RESULT 5.1 Dataset We utilized LibriTTS dataset [90] to train the hierarchical speech synthesizer. First, we trained the model with trainclean subsets of LibriTTS (train-clean-100 and train-clean-360) for a fair comparison. Additionally, we utilized the trainother-500 subsets of LibriTTS for better voice style transfer. Furthermore, we scaled-up the dataset to 1 kh to improve the robustness and diversity, as indicated in TABLE 1[2] . For the Libri-light [27] and Multi-Speaker Speech Synthesis (MSSS) dataset of AIHub [3] , we sampled a small portion of speech from each speaker. We used a EXPRESSO [61] and NIKL[4]. We downsampled the audio at 16 kHz, and normalized it using a scale of [-0.95, 0.95]. For text-to-vec, we utilized all the train subsets of LibriTTS. For speechSR, we used a VCTK dataset [76] which has a sampling rate of 48 kHz to compare the models. However, we also trained the model with a largescale dataset for better speech super-resolution performance by including MSSS dataset, VCTK, and EXPRESSO. This paper is available on arxiv under CC BY-NC-SA 4.0 DEED license. This paper is available on arxiv under CC BY-NC-SA 4.0 DEED license. available on arxiv [2] Although we hope to increase the data scale to over 10k Hours, this is the maximum limit in our academic resources. [3] https://aihub.or.kr https://aihub.or.kr [4] https://www.nia.or.kr/ https://www.nia.or.kr/ Authors: (1) Sang-Hoon Lee, Fellow, IEEE with the Department of Artificial Intelligence, Korea University, Seoul 02841, South Korea; (2) Ha-Yeong Choi, Fellow, IEEE with the Department of Artificial Intelligence, Korea University, Seoul 02841, South Korea; (3) Seung-Bin Kim, Fellow, IEEE with the Department of Artificial Intelligence, Korea University, Seoul 02841, South Korea; (4) Seong-Whan Lee, Fellow, IEEE with the Department of Artificial Intelligence, Korea University, Seoul 02841, South Korea and a Corresponding author. Authors: Authors: (1) Sang-Hoon Lee, Fellow, IEEE with the Department of Artificial Intelligence, Korea University, Seoul 02841, South Korea; (2) Ha-Yeong Choi, Fellow, IEEE with the Department of Artificial Intelligence, Korea University, Seoul 02841, South Korea; (3) Seung-Bin Kim, Fellow, IEEE with the Department of Artificial Intelligence, Korea University, Seoul 02841, South Korea; (4) Seong-Whan Lee, Fellow, IEEE with the Department of Artificial Intelligence, Korea University, Seoul 02841, South Korea and a Corresponding author.

Part of HackerNoon's growing list of open-source research papers, promoting free access to academic material.

How We Used the LibriTTS Dataset to Train the Hierarchical Speech Synthesizer

About Author

Comments

TOPICS

THIS ARTICLE WAS FEATURED IN

Related Stories

A Close Look at Misalignment in Pretraining Datasets

Neural Codec Language Models and Non-Autoregressive Models Explained

Introducing Hierspeech++: A Human-Level Zeroshot Speech Synthesis Model

The Backbone Speech Synthesizer for HierSpeech++

A Text-To-Vec Model That Can Generate A Semantic Representation and F0 From A Text Sequence

The Preprocessing and Training That HierSpeech++ Went Through

A Close Look at Misalignment in Pretraining Datasets

Neural Codec Language Models and Non-Autoregressive Models Explained

Introducing Hierspeech++: A Human-Level Zeroshot Speech Synthesis Model

The Backbone Speech Synthesizer for HierSpeech++

A Text-To-Vec Model That Can Generate A Semantic Representation and F0 From A Text Sequence

The Preprocessing and Training That HierSpeech++ Went Through

Light-Mode

Classic

Newspaper

Minty

Dark-Mode

Neon Noir

Minty

HN StartUps