This story draft by @fewshot has not been reviewed by an editor, YET.

The 7 Objective Metrics We Conducted for the Reconstruction and Resynthesis Tasks

Table of Links

Abstract and 1 Introduction

2.3 Diffusion Models and 2.4 Zero-shot Voice Cloning

3 Hierspeech++ and 3.1 Speech Representations

3.2 Hierarchical Speech Synthesizer

3.3 Text-to-Vec

3.4 Speech Super-resolution

3.5 Model Architecture

4 Speech Synthesis Tasks

4.1 Voice Conversion and 4.2 Text-to-Speech

4.3 Style Prompt Replication

5 Experiment and Result, and Dataset

5.2 Preprocessing and 5.3 Training

5.4 Evaluation Metrics

5.5 Ablation Study

5.6 Zero-shot Voice Conversion

5.7 High-diversity but High-fidelity Speech Synthesis

5.8 Zero-shot Text-to-Speech

5.9 Zero-shot Text-to-Speech with 1s Prompt

5.10 Speech Super-resolution

5.11 Additional Experiments with Other Baselines

6 Limitation and Quick Fix

7 Conclusion, Acknowledgement and References

5.4 Evaluation Metrics

For the reconstruction and resynthesis tasks, we conducted seven objective metrics: a log-scale Mel error distance (Mel), perceptual evaluation of speech quality (PESQ)[5] , Pitch, periodicity (Period.), voice/unvoice (V/UV) F1 score, and logscale F0 consistency F0c. We used the official implementation of CARGAN [60] for pitch, periodicity, and U/UV F1[6] . For F0c, we calculated the L1 distance between the log-scale ground-truth and the predicted F0 in the HAG.

For VC, we used two subjective metrics: naturalness mean opinion score (nMOS) and voice similarity MOS (sMOS) with a CI of 95%; and three objective metrics for naturalness: UTMOS [69], character error rate (CER) and word error rate (WER); two objective metrics for similarity: automatic speaker verification equal error rate (EER), and speaker encoder cosine similarity (SECS). We utilized the open-source UTMOS[7] which is an MOS prediction model for a naturalness metric. Although this can not be considered an absolute evaluation metric, we believe that it is a simple way to estimate the audio quality of synthetic speech. Additionally, this method does not require ground-truth audio or labels to estimate the score. Therefore, we highly recommend using this simple metric during validation by adding a single line. For CER and WER, we utilized the Whisper’s official implementation. We used a large model with 1,550 M parameters and calculated the CER and WER after text normalization, as presented in the official implementation. We utilized a pre-trained automatic speaker verification models [42][8] which was trained with a large-scale speech dataset, VoxCeleb2 [14]. In [13], the effectiveness of metric learning in automatic speaker verification was demonstrated. Furthermore, [42] introduced online data augmentation, which decreased the EER from 2.17% to 1.17%. In addition, we utilized the pre-trained speaker encoder, Resemblyzer [9] to extract a speaker representation, and we calculated the cosine similarity between the speaker representation of the target speech and synthetic speech.

For TTS, we additionally utilize a prosody MOS (pMOS). Sixty samples were randomly selected for each model. The nMOS was rated by 10 listeners on a scale of 1-5, and the sMOS and pMOS were rated by 10 listeners on a scale of 1-4. A confidence interval of 95% was reported for MOS.

This paper is available on arxiv under CC BY-NC-SA 4.0 DEED license.

[5] https://github.com/ludlows/PESQ

[6] https://github.com/descriptinc/cargan

[7] https://github.com/tarepan/SpeechMOS

[8] https://github.com/clovaai/voxceleb_trainer

[9] https://github.com/resemble-ai/Resemblyzer

Authors:

(1) Sang-Hoon Lee, Fellow, IEEE with the Department of Artificial Intelligence, Korea University, Seoul 02841, South Korea;

(2) Ha-Yeong Choi, Fellow, IEEE with the Department of Artificial Intelligence, Korea University, Seoul 02841, South Korea;

(3) Seung-Bin Kim, Fellow, IEEE with the Department of Artificial Intelligence, Korea University, Seoul 02841, South Korea;

(4) Seong-Whan Lee, Fellow, IEEE with the Department of Artificial Intelligence, Korea University, Seoul 02841, South Korea and a Corresponding author.