paint-brush
A Deeper Look at Speech Super-Resolutionby@fewshot
New Story

A Deeper Look at Speech Super-Resolution

tldt arrow

Too Long; Didn't Read

For a fair comparison, we trained the model with the VCTK dataset and compare some publicly-available super-resolution models. Although other models perform multi-task super-resolution, we simply focus on 16-48 kHz
featured image - A Deeper Look at Speech Super-Resolution
The FewShot Prompting Publication  HackerNoon profile picture

Abstract and 1 Introduction

2 Related Work

2.1 Neural Codec Language Models and 2.2 Non-autoregressive Models

2.3 Diffusion Models and 2.4 Zero-shot Voice Cloning

3 Hierspeech++ and 3.1 Speech Representations

3.2 Hierarchical Speech Synthesizer

3.3 Text-to-Vec

3.4 Speech Super-resolution

3.5 Model Architecture

4 Speech Synthesis Tasks

4.1 Voice Conversion and 4.2 Text-to-Speech

4.3 Style Prompt Replication

5 Experiment and Result, and Dataset

5.2 Preprocessing and 5.3 Training

5.4 Evaluation Metrics

5.5 Ablation Study

5.6 Zero-shot Voice Conversion

5.7 High-diversity but High-fidelity Speech Synthesis

5.8 Zero-shot Text-to-Speech

5.9 Zero-shot Text-to-Speech with 1s Prompt

5.10 Speech Super-resolution

5.11 Additional Experiments with Other Baselines

6 Limitation and Quick Fix

7 Conclusion, Acknowledgement and References

5.10 Speech Super-resolution

We introduced SpeechSR for a simple and efficient speech super-resolution for real-world practical application [73]. Because we train a target-specific SpeechSR which can upsample 16 kHz to 48 kHz, our model shows the best performance even with a simple architecture indicated in TABLE 10. For a fair comparison, we trained the model with the VCTK dataset and compare some publicly-available super-resolution models. Although other models perform multi-task super-resolution, we simply focus on 16-48 kHz


TABLE 11: Comparison with VALL-E, NaturalSpeech2, and StyleTTS 2. We only compared four samples in the demo page of NaturalSpeech 2 and StyleTTS 2 so this experiment is just for reference.


upsampling for the speech synthesis model. In addition, DTW-based discriminators also improve the super-resolution performance. Furthermore, we scale up the dataset for more robust speech super-resolution and we will release the source code and checkpoint which is trained with the large-scale high-resolution open-source dataset. For the preference test, Fig 9 shows that the upsampled speech also shows a better performance than the original speech. Furthermore, it is worth noting that SpeechSR has a 742× faster inference speed than AudioSR (Speech version) and has a 1,986× smaller parameter size (SpeechSR has only 0.13M parameters but AudioSR has 258.20M parameters). However, we acknowledge that other models also have a good performance and could upsample any input audio with a sampling rate of 2 kHz to a high-resolution audio with 48 kHz.


This paper is available on arxiv under CC BY-NC-SA 4.0 DEED license.

Authors:

(1) Sang-Hoon Lee, Fellow, IEEE with the Department of Artificial Intelligence, Korea University, Seoul 02841, South Korea;

(2) Ha-Yeong Choi, Fellow, IEEE with the Department of Artificial Intelligence, Korea University, Seoul 02841, South Korea;

(3) Seung-Bin Kim, Fellow, IEEE with the Department of Artificial Intelligence, Korea University, Seoul 02841, South Korea;

(4) Seong-Whan Lee, Fellow, IEEE with the Department of Artificial Intelligence, Korea University, Seoul 02841, South Korea and a Corresponding author.