Style Prompt Replication: A Simple Trick That Helped Us In Our Journey

by The FewShot Prompting Publication December 19th, 2024

Too Long; Didn't Read

We found a simple trick to transfer the style even with a one second speech prompt by introducing style prompt replication (SPR).

featured image - Style Prompt Replication: A Simple Trick That Helped Us In Our Journey

‘a magician performing a card trick’ Image created by HackerNoon AI Image Generator

Table of Links

Abstract and 1 Introduction

2.3 Diffusion Models and 2.4 Zero-shot Voice Cloning

3 Hierspeech++ and 3.1 Speech Representations

3.2 Hierarchical Speech Synthesizer

3.3 Text-to-Vec

3.4 Speech Super-resolution

3.5 Model Architecture

4 Speech Synthesis Tasks

4.1 Voice Conversion and 4.2 Text-to-Speech

4.3 Style Prompt Replication

5 Experiment and Result, and Dataset

5.2 Preprocessing and 5.3 Training

5.4 Evaluation Metrics

5.5 Ablation Study

5.6 Zero-shot Voice Conversion

5.7 High-diversity but High-fidelity Speech Synthesis

5.8 Zero-shot Text-to-Speech

5.9 Zero-shot Text-to-Speech with 1s Prompt

5.10 Speech Super-resolution

5.11 Additional Experiments with Other Baselines

6 Limitation and Quick Fix

7 Conclusion, Acknowledgement and References

4.3 Style Prompt Replication

We found a simple trick to transfer the style even with a one second speech prompt by introducing style prompt replication (SPR). Similar to the DNA replication, we copy the same sequence of prompt as shown in Fig 7. The replicated prompt by n times is fed to the style encoder to extract the style representation. Specifically, because the prompt style encoder usually encounters a long sequence of prompts over 3s, synthetic speech from short prompts may be generated incorrectly. However, SPR can deceive the style encoder as it seems like long prompts, thus we can synthesize the speech even with 1s speech prompt.

This paper is available on arxiv under CC BY-NC-SA 4.0 DEED license.

Authors:

(1) Sang-Hoon Lee, Fellow, IEEE with the Department of Artificial Intelligence, Korea University, Seoul 02841, South Korea;

(2) Ha-Yeong Choi, Fellow, IEEE with the Department of Artificial Intelligence, Korea University, Seoul 02841, South Korea;

(3) Seung-Bin Kim, Fellow, IEEE with the Department of Artificial Intelligence, Korea University, Seoul 02841, South Korea;

(4) Seong-Whan Lee, Fellow, IEEE with the Department of Artificial Intelligence, Korea University, Seoul 02841, South Korea and a Corresponding author.