Table of Links Abstract and 1 Introduction 2 Related Work 2.1 Neural Codec Language Models and 2.2 Non-autoregressive Models 2.3 Diffusion Models and 2.4 Zero-shot Voice Cloning 3 Hierspeech++ and 3.1 Speech Representations 3.2 Hierarchical Speech Synthesizer 3.3 Text-to-Vec 3.4 Speech Super-resolution 3.5 Model Architecture 4 Speech Synthesis Tasks 4.1 Voice Conversion and 4.2 Text-to-Speech 4.3 Style Prompt Replication 5 Experiment and Result, and Dataset 5.2 Preprocessing and 5.3 Training 5.4 Evaluation Metrics 5.5 Ablation Study 5.6 Zero-shot Voice Conversion 5.7 High-diversity but High-fidelity Speech Synthesis 5.8 Zero-shot Text-to-Speech 5.9 Zero-shot Text-to-Speech with 1s Prompt 5.10 Speech Super-resolution 5.11 Additional Experiments with Other Baselines 6 Limitation and Quick Fix 7 Conclusion, Acknowledgement and References 5.10 Speech Super-resolution We introduced SpeechSR for a simple and efficient speech super-resolution for real-world practical application [73]. Because we train a target-specific SpeechSR which can upsample 16 kHz to 48 kHz, our model shows the best performance even with a simple architecture indicated in TABLE 10. For a fair comparison, we trained the model with the VCTK dataset and compare some publicly-available super-resolution models. Although other models perform multi-task super-resolution, we simply focus on 16-48 kHz upsampling for the speech synthesis model. In addition, DTW-based discriminators also improve the super-resolution performance. Furthermore, we scale up the dataset for more robust speech super-resolution and we will release the source code and checkpoint which is trained with the large-scale high-resolution open-source dataset. For the preference test, Fig 9 shows that the upsampled speech also shows a better performance than the original speech. Furthermore, it is worth noting that SpeechSR has a 742× faster inference speed than AudioSR (Speech version) and has a 1,986× smaller parameter size (SpeechSR has only 0.13M parameters but AudioSR has 258.20M parameters). However, we acknowledge that other models also have a good performance and could upsample any input audio with a sampling rate of 2 kHz to a high-resolution audio with 48 kHz. This paper is available on arxiv under CC BY-NC-SA 4.0 DEED license. Authors:
(1) Sang-Hoon Lee, Fellow, IEEE with the Department of Artificial Intelligence, Korea University, Seoul 02841, South Korea;
(2) Ha-Yeong Choi, Fellow, IEEE with the Department of Artificial Intelligence, Korea University, Seoul 02841, South Korea;
(3) Seung-Bin Kim, Fellow, IEEE with the Department of Artificial Intelligence, Korea University, Seoul 02841, South Korea;
(4) Seong-Whan Lee, Fellow, IEEE with the Department of Artificial Intelligence, Korea University, Seoul 02841, South Korea and a Corresponding author. Table of Links Abstract and 1 Introduction Abstract and 1 Introduction 2 Related Work 2.1 Neural Codec Language Models and 2.2 Non-autoregressive Models 2.1 Neural Codec Language Models and 2.2 Non-autoregressive Models 2.3 Diffusion Models and 2.4 Zero-shot Voice Cloning 2.3 Diffusion Models and 2.4 Zero-shot Voice Cloning 3 Hierspeech++ and 3.1 Speech Representations 3 Hierspeech++ and 3.1 Speech Representations 3.2 Hierarchical Speech Synthesizer 3.2 Hierarchical Speech Synthesizer 3.3 Text-to-Vec 3.3 Text-to-Vec 3.4 Speech Super-resolution 3.4 Speech Super-resolution 3.5 Model Architecture 3.5 Model Architecture 4 Speech Synthesis Tasks 4.1 Voice Conversion and 4.2 Text-to-Speech 4.1 Voice Conversion and 4.2 Text-to-Speech 4.3 Style Prompt Replication 4.3 Style Prompt Replication 5 Experiment and Result, and Dataset 5 Experiment and Result, and Dataset 5.2 Preprocessing and 5.3 Training 5.2 Preprocessing and 5.3 Training 5.4 Evaluation Metrics 5.4 Evaluation Metrics 5.5 Ablation Study 5.5 Ablation Study 5.6 Zero-shot Voice Conversion 5.6 Zero-shot Voice Conversion 5.7 High-diversity but High-fidelity Speech Synthesis 5.7 High-diversity but High-fidelity Speech Synthesis 5.8 Zero-shot Text-to-Speech 5.8 Zero-shot Text-to-Speech 5.9 Zero-shot Text-to-Speech with 1s Prompt 5.9 Zero-shot Text-to-Speech with 1s Prompt 5.10 Speech Super-resolution 5.10 Speech Super-resolution 5.11 Additional Experiments with Other Baselines 5.11 Additional Experiments with Other Baselines 6 Limitation and Quick Fix 6 Limitation and Quick Fix 7 Conclusion, Acknowledgement and References 7 Conclusion, Acknowledgement and References 5.10 Speech Super-resolution We introduced SpeechSR for a simple and efficient speech super-resolution for real-world practical application [73]. Because we train a target-specific SpeechSR which can upsample 16 kHz to 48 kHz, our model shows the best performance even with a simple architecture indicated in TABLE 10. For a fair comparison, we trained the model with the VCTK dataset and compare some publicly-available super-resolution models. Although other models perform multi-task super-resolution, we simply focus on 16-48 kHz upsampling for the speech synthesis model. In addition, DTW-based discriminators also improve the super-resolution performance. Furthermore, we scale up the dataset for more robust speech super-resolution and we will release the source code and checkpoint which is trained with the large-scale high-resolution open-source dataset. For the preference test, Fig 9 shows that the upsampled speech also shows a better performance than the original speech. Furthermore, it is worth noting that SpeechSR has a 742× faster inference speed than AudioSR (Speech version) and has a 1,986× smaller parameter size (SpeechSR has only 0.13M parameters but AudioSR has 258.20M parameters). However, we acknowledge that other models also have a good performance and could upsample any input audio with a sampling rate of 2 kHz to a high-resolution audio with 48 kHz. This paper is available on arxiv under CC BY-NC-SA 4.0 DEED license. This paper is available on arxiv under CC BY-NC-SA 4.0 DEED license. available on arxiv Authors: (1) Sang-Hoon Lee, Fellow, IEEE with the Department of Artificial Intelligence, Korea University, Seoul 02841, South Korea; (2) Ha-Yeong Choi, Fellow, IEEE with the Department of Artificial Intelligence, Korea University, Seoul 02841, South Korea; (3) Seung-Bin Kim, Fellow, IEEE with the Department of Artificial Intelligence, Korea University, Seoul 02841, South Korea; (4) Seong-Whan Lee, Fellow, IEEE with the Department of Artificial Intelligence, Korea University, Seoul 02841, South Korea and a Corresponding author. Authors: Authors: (1) Sang-Hoon Lee, Fellow, IEEE with the Department of Artificial Intelligence, Korea University, Seoul 02841, South Korea; (2) Ha-Yeong Choi, Fellow, IEEE with the Department of Artificial Intelligence, Korea University, Seoul 02841, South Korea; (3) Seung-Bin Kim, Fellow, IEEE with the Department of Artificial Intelligence, Korea University, Seoul 02841, South Korea; (4) Seong-Whan Lee, Fellow, IEEE with the Department of Artificial Intelligence, Korea University, Seoul 02841, South Korea and a Corresponding author.

Part of HackerNoon's growing list of open-source research papers, promoting free access to academic material.

A Deeper Look at Speech Super-Resolution

About Author

Comments

TOPICS

THIS ARTICLE WAS FEATURED IN

Related Stories

A Close Look at Misalignment in Pretraining Datasets

Open Source Timeseries Database — 9 Mostly Asked Questions Answered

Backing Up Weaviate with MinIO S3 Buckets to Achieve Strategic Enhancement to Data Management

Here's What to Know About Apache Cassandra 5.0

How Database DevOps Makes Life Easy for DB Admins

HierSpeech++: A Fast and Strong Zero-Shot Speech Synthesizer for Text-to-Speech

A Close Look at Misalignment in Pretraining Datasets

Open Source Timeseries Database — 9 Mostly Asked Questions Answered

Backing Up Weaviate with MinIO S3 Buckets to Achieve Strategic Enhancement to Data Management

Here's What to Know About Apache Cassandra 5.0

How Database DevOps Makes Life Easy for DB Admins

HierSpeech++: A Fast and Strong Zero-Shot Speech Synthesizer for Text-to-Speech

Light-Mode

Classic

Newspaper

Minty

Dark-Mode

Neon Noir

Minty

HN StartUps