Forewords Forewords My last article on voice cloning is more than a year ago, and here we are again for adopting some latest advancement. Refering to some Chinese source such as this blog and this video, I was attempting to adopt new tools for my audio book service, such as CosyVoice, F5-TTS, GPT-SoVITS, and fish-speech. My last article this blog this video CosyVoice F5-TTS GPT-SoVITS fish-speech But before we start, I recommend to: Install miniconda for dependency sanity wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh && sudo chmod +x Miniconda3-latest-Linux-x86_64.sh && bash Miniconda3-latest-Linux-x86_64.sh wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh && sudo chmod +x Miniconda3-latest-Linux-x86_64.sh && bash Miniconda3-latest-Linux-x86_64.sh Setup PyTorch environment as needed and confirm with python -m torch.utils.collect_env PyTorch environment python -m torch.utils.collect_env Install nvtop if preferred by sudo apt install nvtop nvtop sudo apt install nvtop GPT-SoVITS This project is made by same group of people from so-vits-svc. The model’s quality improved greatly from v2 to v4. Although when doing This project so-vits-svc long text tts, errors are unavoidable, it is good enough for my use case. By the time of writing this artile, they released a new version 20250606v2pro which may have some differences since I was using version 20250422v4. 20250606v2pro 20250422v4 But, you can always using their “windows package” which packed with all models and works on Linux servers despite its “windows package” name, so that is intended to provide a more user friendly “one-click” experience. Install on Linux git clone https://github.com/RVC-Boss/GPT-SoVITS.git && cd GPT-SoVITS conda create -n GPTSoVits python=3.10 conda activate GPTSoVits #auto install script bash install.sh --source HF --download-uvr5 #(optional) manual install pip install -r extra-req.txt --no-deps pip install -r requirements.txt git clone https://github.com/RVC-Boss/GPT-SoVITS.git && cd GPT-SoVITS conda create -n GPTSoVits python=3.10 conda activate GPTSoVits #auto install script bash install.sh --source HF --download-uvr5 #(optional) manual install pip install -r extra-req.txt --no-deps pip install -r requirements.txt Install FFmpeg and other deps sudo apt install ffmpeg sudo apt install libsox-dev #(optional for troubleshooting) conda install -c conda-forge 'ffmpeg<7' pip install -U gradio python -m nltk.downloader averaged_perceptron_tagger_eng sudo apt install ffmpeg sudo apt install libsox-dev #(optional for troubleshooting) conda install -c conda-forge 'ffmpeg<7' pip install -U gradio python -m nltk.downloader averaged_perceptron_tagger_eng (optional) Download Pretrained ASR Models for Chinese git lfs install cd tools/asr/models/ git clone https://www.modelscope.cn/iic/speech_fsmn_vad_zh-cn-16k-common-pytorch.git tools/asr/models/speech_fsmn_vad_zh-cn-16k-common-pytorch git clone https://www.modelscope.cn/iic/punc_ct-transformer_zh-cn-common-vocab272727-pytorch.git tools/asr/models/punc_ct-transformer_zh-cn-common-vocab272727-pytorch git clone https://www.modelscope.cn/iic/speech_paraformer-large_asr_nat-zh-cn-16k-common-vocab8404-pytorch.git tools/asr/models/speech_paraformer-large_asr_nat-zh-cn-16k-common-vocab8404-pytorch git lfs install cd tools/asr/models/ git clone https://www.modelscope.cn/iic/speech_fsmn_vad_zh-cn-16k-common-pytorch.git tools/asr/models/speech_fsmn_vad_zh-cn-16k-common-pytorch git clone https://www.modelscope.cn/iic/punc_ct-transformer_zh-cn-common-vocab272727-pytorch.git tools/asr/models/punc_ct-transformer_zh-cn-common-vocab272727-pytorch git clone https://www.modelscope.cn/iic/speech_paraformer-large_asr_nat-zh-cn-16k-common-vocab8404-pytorch.git tools/asr/models/speech_paraformer-large_asr_nat-zh-cn-16k-common-vocab8404-pytorch After all done, run GRADIO_SHARE=0 GRADIO_ANALYTICS_ENABLED=0 DISABLE_TELEMETRY=1 DO_NOT_TRACK=1 python -m webui.py to start server then access via http://ip:9874/ GRADIO_SHARE=0 GRADIO_ANALYTICS_ENABLED=0 DISABLE_TELEMETRY=1 DO_NOT_TRACK=1 python -m webui.py http://ip:9874/ http://ip:9874/ 0-Fetch dataset preparation-for-vocal-recording preparation-for-vocal-recording I used Audacity instead of UVR5 because the recording is clean and clear. Use the webui built-in audio slicer to slice the new recording.wav and put old recordings (if any) all together under output/slicer_opt output/slicer_opt Use built-in batch ASR tool with faster whisper, since I’m doing multilingual model this time. These issues are related exclusively with old GPU architecture. No worries for new GPU (30x0/40x0) users. Troubleshooting 1 RuntimeError: parallel_for failed: cudaErrorNoKernelImageForDevice: no kernel image is available for execution on the device. Troubleshooting 1 RuntimeError: parallel_for failed: cudaErrorNoKernelImageForDevice: no kernel image is available for execution on the device. pip uninstall -y ctranslate2 pip install ctranslate2==3.24.0 pip uninstall -y ctranslate2 pip install ctranslate2==3.24.0 Troubleshooting 2 'iwrk': array([], dfitpack_int), 'u': array([], float), Troubleshooting 2 'iwrk': array([], dfitpack_int), 'u': array([], float), pip uninstall numpy scipy pip install numba==0.60.0 numpy==1.26.4 scipy pip uninstall numpy scipy pip install numba==0.60.0 numpy==1.26.4 scipy After transcription finished, use built-in labeling tool(subfix) to remove bad samples. If the web page doesn’t pop-up, use ip:9871 manually. Choose Audio and Delete Audio, Save File when finished. ip:9871 1-GPT-SOVITS-TTS 1A-Dataset formatting Fill up the empty fields and click Set One-Click Formatting: Set One-Click Formatting #Text labelling file /home/username/GPT-SoVITS/output/asr_opt/slicer_opt.list #Audio dataset folder output/slicer_opt #Text labelling file /home/username/GPT-SoVITS/output/asr_opt/slicer_opt.list #Audio dataset folder output/slicer_opt 1B-Fine-tuned training 1Ba-SoVITS training Use batch size at 1, total epoch at 5 and save_every_epoch at 1. batch size 1 total epoch 5 save_every_epoch 1 1Bb-GPT training For this part, my batch size is 6 with DPO enabled. Total traning epochs should be around 5-15, adjust Save frequency based on needs. batch size 6 DPO enabled Troubleshooting for old GPUs Error: cuFFT doesn't support signals of half type with compute cuFFT doesn't support signals of half type with compute capability less than SM_53, but the device containing input half tensor capability less than SM_53, but the device containing input half tensor only has SM_52. only has SM_52 Fix1: Eidt webui.py, add a new line after from multiprocessing import cpu_count with is_half = False webui.py from multiprocessing import cpu_count is_half = False Fix2: Edit GPT_SoVITS/s2_train.py, add hps.train.fp16_run = False at beginning (among with torch.backends.cudnn.benchmark = False) GPT_SoVITS/s2_train.py hps.train.fp16_run = False torch.backends.cudnn.benchmark = False 1C-inference click refreshing model paths and select model in both lists refreshing model paths Check Enable Parallel Inference Version then open TTS Inference WebUI, this need a while to load, manually access ip:9872 if needed. Enable Parallel Inference Version open TTS Inference WebUI ip:9872 Troubleshooting for ValueError: Due to a serious vulnerability issue in torch.load fix by pip install transformers==4.43 ValueError: Due to a serious vulnerability issue in torch.load pip install transformers==4.43 Inference Settings: e3.ckpte15.pthPrimary Reference Audio with Text and multiple reference audioSlice by every puncttop_k 5top_p 1temperature 0.9Repetition Penalty 2speed_factor 1.3 e3.ckpt e15.pth Primary Reference Audio with Text and multiple reference audio Slice by every punct top_k 5 top_p 1 temperature 0.9 Repetition Penalty 2 speed_factor 1.3 Keep everything else default. To find the best GPT weight, parameters and random seed, first inferencing on a large block of text and pick up a few problematic sentences for next inference. Then, adjust GPT weight and parameters to make the problem go away while on a fixed seed number. Once the best GPT weight and parameters are found, fix them then play with difference seed number to refine the final result. Take note on the parameters when inference gets perfect for future use. Fish Speech Fish-speech is contributed by the same people from Bert-vits2 which I used for a long time. Fish-speech Bert-vits2 Following their official docs to install version 1.4 (unfortunately, v1.5 has problem of sound quality while finetuning) official docs git clone --branch v1.4.3 https://github.com/fishaudio/fish-speech.git && cd fish-speech conda create -n fish-speech python=3.10 conda activate fish-speech pip3 install torch==2.4.1 torchvision==0.19.1 torchaudio==2.4.1 apt install libsox-dev ffmpeg apt install build-essential \ cmake \ libasound-dev \ portaudio19-dev \ libportaudio2 \ libportaudiocpp0 pip3 install -e . git clone --branch v1.4.3 https://github.com/fishaudio/fish-speech.git && cd fish-speech conda create -n fish-speech python=3.10 conda activate fish-speech pip3 install torch==2.4.1 torchvision==0.19.1 torchaudio==2.4.1 apt install libsox-dev ffmpeg apt install build-essential \ cmake \ libasound-dev \ portaudio19-dev \ libportaudio2 \ libportaudiocpp0 pip3 install -e . Download required models required models huggingface-cli download fishaudio/fish-speech-1.4 --local-dir checkpoints/fish-speech-1.4 huggingface-cli download fishaudio/fish-speech-1.4 --local-dir checkpoints/fish-speech-1.4 Prepare dataset mkdir data cp -r /home/username/GPT-SoVITS/output/slicer_opt data/ python tools/whisper_asr.py --audio-dir data/slicer_opt --save-dir data/slicer_opt --compute-type float32 python tools/vqgan/extract_vq.py data \ --num-workers 1 --batch-size 16 \ --config-name "firefly_gan_vq" \ --checkpoint-path "checkpoints/fish-speech-1.5/firefly-gan-vq-fsq-8x1024-21hz-generator.pth" python tools/llama/build_dataset.py \ --input "data" \ --output "data/protos" \ --text-extension .lab \ --num-workers 16 mkdir data cp -r /home/username/GPT-SoVITS/output/slicer_opt data/ python tools/whisper_asr.py --audio-dir data/slicer_opt --save-dir data/slicer_opt --compute-type float32 python tools/vqgan/extract_vq.py data \ --num-workers 1 --batch-size 16 \ --config-name "firefly_gan_vq" \ --checkpoint-path "checkpoints/fish-speech-1.5/firefly-gan-vq-fsq-8x1024-21hz-generator.pth" python tools/llama/build_dataset.py \ --input "data" \ --output "data/protos" \ --text-extension .lab \ --num-workers 16 Edit parameters by nano fish_speech/configs/text2semantic_finetune.yaml and start training nano fish_speech/configs/text2semantic_finetune.yaml python fish_speech/train.py --config-name text2semantic_finetune \ project=$project \ +lora@model.model.lora_config=r_8_alpha_16 python fish_speech/train.py --config-name text2semantic_finetune \ project=$project \ +lora@model.model.lora_config=r_8_alpha_16 Note: The default numbers are too high for my setup, so both num_workers and batch_size needs to be lowered according to CPU cores and VRAM.For the first run, I set max_steps: 10000 and val_check_interval: 1000 to have 5 models that have lower steps with some diversity .Things like lr, weight_decay and num_warmup_steps can be further adjusted accroding to this article. My setting is lr: 1e-5, weight_decay: 1e-6, num_warmup_steps: 500.To check to training metrics such as loss curve, run tensorboard --logdir fish-speech/results/tensorboard/version_xx/ and access localhost:6006 via browser. Determine overfitting with the graph AND actuall hear to the inference result for each checkpoint.At first I found out overfitting starts around 5000 step. Then a a second training for 5000 steps and find the best result is step_000004000.ckpt.Training requires a newer GPU with bf16 and no workaround so far.When training a model for inferencing on an older GPU, use precision: 32-true in fish_speech/configs/text2semantic_finetune.yaml and result += (self.lora_dropout(x).to(torch.float32) @ self.lora_A.to(torch.float32).transpose(0, 1) @ self.lora_B.to(torch.float32).transpose(0, 1)) * self.scaling.to(torch.float32) in/home/username/miniconda3/envs/fish-speech/lib/python3.10/site-packages/loralib/layers.py. The default numbers are too high for my setup, so both num_workers and batch_size needs to be lowered according to CPU cores and VRAM. num_workers batch_size according to CPU cores and VRAM For the first run, I set max_steps: 10000 and val_check_interval: 1000 to have 5 models that have lower steps with some diversity . max_steps: 10000 val_check_interval: 1000 Things like lr, weight_decay and num_warmup_steps can be further adjusted accroding to this article. My setting is lr: 1e-5, weight_decay: 1e-6, num_warmup_steps: 500. lr weight_decay num_warmup_steps this article lr: 1e-5 weight_decay: 1e-6 num_warmup_steps: 500 To check to training metrics such as loss curve, run tensorboard --logdir fish-speech/results/tensorboard/version_xx/ and access localhost:6006 via browser. Determine overfitting with the graph AND actuall hear to the inference result for each checkpoint. tensorboard --logdir fish-speech/results/tensorboard/version_xx/ localhost:6006 At first I found out overfitting starts around 5000 step. Then a a second training for 5000 steps and find the best result is step_000004000.ckpt. step_000004000.ckpt Training requires a newer GPU with bf16 and no workaround so far. bf16 When training a model for inferencing on an older GPU, use precision: 32-true in fish_speech/configs/text2semantic_finetune.yaml and result precision: 32-true fish_speech/configs/text2semantic_finetune.yaml result += (self.lora_dropout(x).to(torch.float32) @ += (self.lora_dropout(x).to(torch.float32) @ self.lora_A.to(torch.float32).transpose(0, 1) @ self.lora_A.to(torch.float32).transpose(0, 1) @ self.lora_B.to(torch.float32).transpose(0, 1)) * self.lora_B.to(torch.float32).transpose(0, 1)) * self.scaling.to(torch.float32) in/home/username/miniconda3/envs/fish-speech/lib/python3.10/site-packages/loralib/layers.py. self.scaling.to(torch.float32) /home/username/miniconda3/envs/fish-speech/lib/python3.10/site-packages/loralib/layers.py Training would take many hours on weak GPU. After finished, convert the LoRA weights python tools/llama/merge_lora.py \ --lora-config r_8_alpha_16 \ --base-weight checkpoints/fish-speech-1.4 \ --lora-weight results/$project/checkpoints/step_000005000.ckpt \ --output checkpoints/fish-speech-1.4-yth-lora/ python tools/llama/merge_lora.py \ --lora-config r_8_alpha_16 \ --base-weight checkpoints/fish-speech-1.4 \ --lora-weight results/$project/checkpoints/step_000005000.ckpt \ --output checkpoints/fish-speech-1.4-yth-lora/ Generate prompt and semantic tokens python tools/vqgan/inference.py \ -i "1.wav" \ --checkpoint-path "checkpoints/fish-speech-1.4/firefly-gan-vq-fsq-8x1024-21hz-generator.pth" python tools/vqgan/inference.py \ -i "1.wav" \ --checkpoint-path "checkpoints/fish-speech-1.4/firefly-gan-vq-fsq-8x1024-21hz-generator.pth" Troubleshooting for old GPU 1 Unable to load any of {libcudnn_ops.so.9.1.0, libcudnn_ops.so.9.1, libcudnn_ops.so.9, libcudnn_ops.so} Troubleshooting for old GPU 1 Unable to load any of {libcudnn_ops.so.9.1.0, libcudnn_ops.so.9.1, libcudnn_ops.so.9, libcudnn_ops.so} pip uninstall -y ctranslate2 pip install ctranslate2==3.24.0 pip uninstall -y ctranslate2 pip install ctranslate2==3.24.0 Troubleshooting for old GPU 2 ImportError: cannot import name 'is_callable_allowed' from partially initialized module 'torch._dynamo.trace_rules' Troubleshooting for old GPU 2 ImportError: cannot import name 'is_callable_allowed' from partially initialized module 'torch._dynamo.trace_rules' conda install pytorch==2.5.1 torchvision==0.20.1 torchaudio==2.5.1 pytorch-cuda=11.8 -c pytorch -c nvidia conda install pytorch==2.5.1 torchvision==0.20.1 torchaudio==2.5.1 pytorch-cuda=11.8 -c pytorch -c nvidia Make it accessible from LAN nano tools/run_webui.py nano tools/run_webui.py app.launch(server_name="0.0.0.0", server_port=7860, show_api=True) app.launch(server_name="0.0.0.0", server_port=7860, show_api=True) Change the --llama-checkpoint-path to the newly trained LoRA, and start WebUI (added --half for my old GPU to avoid bf16 error) --llama-checkpoint-path --half GRADIO_SHARE=0 GRADIO_SERVER_NAME=0.0.0.0 GRADIO_SERVER_PORT=7860 GRADIO_ANALYTICS_ENABLED=0 DISABLE_TELEMETRY=1 DO_NOT_TRACK=1 python -m tools.webui \ --llama-checkpoint-path "checkpoints/fish-speech-1.4-yth-lora" \ --decoder-checkpoint-path "checkpoints/fish-speech-1.4/firefly-gan-vq-fsq-8x1024-21hz-generator.pth" \ --decoder-config-name firefly_gan_vq \ --half GRADIO_SHARE=0 GRADIO_SERVER_NAME=0.0.0.0 GRADIO_SERVER_PORT=7860 GRADIO_ANALYTICS_ENABLED=0 DISABLE_TELEMETRY=1 DO_NOT_TRACK=1 python -m tools.webui \ --llama-checkpoint-path "checkpoints/fish-speech-1.4-yth-lora" \ --decoder-checkpoint-path "checkpoints/fish-speech-1.4/firefly-gan-vq-fsq-8x1024-21hz-generator.pth" \ --decoder-config-name firefly_gan_vq \ --half Parameters for inferencing Enable Reference Audiocheck Text NormalizationInterative Prompt Lenth 200Top-P 0.8Temperature 0.7Repetition Penalty 1.5Set Seed Enable Reference Audio check Text Normalization Interative Prompt Lenth 200 Top-P 0.8 Temperature 0.7 Repetition Penalty 1.5 Set Seed Note: higher number to compensate overfitted model, lower number for underfitted model.certain punctuation or tab space may trigger noise generation. Text normalization suppose to address these issue but sometimes I still need to find & replace. higher number to compensate overfitted model, lower number for underfitted model. certain punctuation or tab space may trigger noise generation. Text normalization suppose to address these issue but sometimes I still need to find & replace. However, a bug Negative code found occurs quite frequent while inferencing without solution by now. Give up. Negative code found CosyVoice CosyVoice is one from the FunAudioLLM toolkits, which developed by the same team from Alibaba’s Qwen I use a lot. CosyVoice Install git clone --recursive https://github.com/FunAudioLLM/CosyVoice.git && cd CosyVoice git submodule update --init --recursive conda create -n cosyvoice -y python=3.10 conda activate cosyvoice conda install -y -c conda-forge pynini==2.1.5 sudo apt-get install sox libsox-dev -y pip install -r requirements.txt git clone --recursive https://github.com/FunAudioLLM/CosyVoice.git && cd CosyVoice git submodule update --init --recursive conda create -n cosyvoice -y python=3.10 conda activate cosyvoice conda install -y -c conda-forge pynini==2.1.5 sudo apt-get install sox libsox-dev -y pip install -r requirements.txt Download Pretrained Models git lfs install mkdir -p pretrained_models git clone https://huggingface.co/FunAudioLLM/CosyVoice2-0.5B pretrained_models/CosyVoice2-0.5B git clone https://huggingface.co/FunAudioLLM/CosyVoice-300M pretrained_models/CosyVoice-300M git clone https://huggingface.co/FunAudioLLM/CosyVoice-300M-SFT pretrained_models/CosyVoice-300M-SFT git clone https://huggingface.co/FunAudioLLM/CosyVoice-300M-Instruct pretrained_models/CosyVoice-300M-Instruct git lfs install mkdir -p pretrained_models git clone https://huggingface.co/FunAudioLLM/CosyVoice2-0.5B pretrained_models/CosyVoice2-0.5B git clone https://huggingface.co/FunAudioLLM/CosyVoice-300M pretrained_models/CosyVoice-300M git clone https://huggingface.co/FunAudioLLM/CosyVoice-300M-SFT pretrained_models/CosyVoice-300M-SFT git clone https://huggingface.co/FunAudioLLM/CosyVoice-300M-Instruct pretrained_models/CosyVoice-300M-Instruct Run with GRADIO_SHARE=0 GRADIO_SERVER_NAME=0.0.0.0 GRADIO_SERVER_PORT=7860 GRADIO_ANALYTICS_ENABLED=0 DISABLE_TELEMETRY=1 DO_NOT_TRACK=1 python -m webui.py --port 50000 --model_dir pretrained_models/CosyVoice-300M GRADIO_SHARE=0 GRADIO_SERVER_NAME=0.0.0.0 GRADIO_SERVER_PORT=7860 GRADIO_ANALYTICS_ENABLED=0 DISABLE_TELEMETRY=1 DO_NOT_TRACK=1 python -m webui.py --port 50000 --model_dir pretrained_models/CosyVoice-300M Troubleshooting “GLIBCXX_3.4.29’ not found” with this Troubleshooting “GLIBCXX_3.4.29’ not found” this strings /usr/lib/x86_64-linux-gnu/libstdc++.so.6 | grep GLIBCXX strings $CONDA_PREFIX/lib/libstdc++.so.6 | grep GLIBCXX nano ~/.bashrc export LD_LIBRARY_PATH=$CONDA_PREFIX/lib:$LD_LIBRARY_PATH find / -name "libstdc++.so*" rm /home/username/anaconda3/lib/python3.11/site-packages/../../libstdc++.so.6 ln -s /home/username/text-generation-webui/installer_files/env/lib/libstdc++.so.6.0.29 /home/username/anaconda3/lib/python3.11/site-packages/../../libstdc++.so.6 strings /usr/lib/x86_64-linux-gnu/libstdc++.so.6 | grep GLIBCXX strings $CONDA_PREFIX/lib/libstdc++.so.6 | grep GLIBCXX nano ~/.bashrc export LD_LIBRARY_PATH=$CONDA_PREFIX/lib:$LD_LIBRARY_PATH find / -name "libstdc++.so*" rm /home/username/anaconda3/lib/python3.11/site-packages/../../libstdc++.so.6 ln -s /home/username/text-generation-webui/installer_files/env/lib/libstdc++.so.6.0.29 /home/username/anaconda3/lib/python3.11/site-packages/../../libstdc++.so.6 It ends up working fine but not as good as GPT-SoVITS. Hope their 3.0 version can pump it up. Voice Conversion Both RVC and Seed-VC are intended to replace my good old so-vits-svc instance. RVC Seed-VC so-vits-svc Retrieval-based-Voice-Conversion Install git clone https://github.com/RVC-Project/Retrieval-based-Voice-Conversion-WebUI && cd Retrieval-based-Voice-Conversion-WebUI conda create -n rvc -y python=3.8 conda activate rvc pip install torch torchvision torchaudio pip install pip==24.0 pip install -r requirements.txt python tools/download_models.py sudo apt install ffmpeg wget https://huggingface.co/lj1995/VoiceConversionWebUI/resolve/main/rmvpe.pt git clone https://github.com/RVC-Project/Retrieval-based-Voice-Conversion-WebUI && cd Retrieval-based-Voice-Conversion-WebUI conda create -n rvc -y python=3.8 conda activate rvc pip install torch torchvision torchaudio pip install pip==24.0 pip install -r requirements.txt python tools/download_models.py sudo apt install ffmpeg wget https://huggingface.co/lj1995/VoiceConversionWebUI/resolve/main/rmvpe.pt Run with python infer-web.py, fill up following then click buttons step by step with default settings: python infer-web.py Enter the experiment name:/path/to/raw/ Enter the experiment name:/path/to/raw/ Troubleshooting “enabled=hps.train.fp16_run” Troubleshooting “enabled=hps.train.fp16_run” Seed-VC Install git clone https://github.com/Plachtaa/seed-vc && cd Retrieval-based-Voice-Conversion-WebUI conda create -n seedvc -y python=3.10 conda activate seedvc pip install -r requirements.txt GRADIO_SHARE=0 GRADIO_SERVER_NAME=0.0.0.0 GRADIO_SERVER_PORT=7860 GRADIO_ANALYTICS_ENABLED=0 DISABLE_TELEMETRY=1 DO_NOT_TRACK=1 python app.py --enable-v1 --enable-v2 git clone https://github.com/Plachtaa/seed-vc && cd Retrieval-based-Voice-Conversion-WebUI conda create -n seedvc -y python=3.10 conda activate seedvc pip install -r requirements.txt GRADIO_SHARE=0 GRADIO_SERVER_NAME=0.0.0.0 GRADIO_SERVER_PORT=7860 GRADIO_ANALYTICS_ENABLED=0 DISABLE_TELEMETRY=1 DO_NOT_TRACK=1 python app.py --enable-v1 --enable-v2 Settings #V2 Diffusion Steps: 100 Length Adjust: 1 Intelligibility CFG Rate: 0 Similarity CFG Rate: 1 Top-p: 1 Temperature: 1 Repetition Penalty: 2 convert style/emotion/accent: check #V1 Diffusion Steps: 100 Length Adjust: 1 Inference CFG Rate: 1 Use F0 conditioned model: check Auto F0 adjust: check Pitch shift: 0 #V2 Diffusion Steps: 100 Length Adjust: 1 Intelligibility CFG Rate: 0 Similarity CFG Rate: 1 Top-p: 1 Temperature: 1 Repetition Penalty: 2 convert style/emotion/accent: check #V1 Diffusion Steps: 100 Length Adjust: 1 Inference CFG Rate: 1 Use F0 conditioned model: check Auto F0 adjust: check Pitch shift: 0 Training python train.py --config /home/username/seed-vc/configs/presets/config_dit_mel_seed_uvit_whisper_base_f0_44k.yml --dataset-dir /home/username/GPT-SoVITS-v4/output/slicer_opt --run-name username --batch-size 6 --max-steps 10000 --max-epochs 10000 --save-every 1000 --num-workers 1 accelerate launch train_v2.py --dataset-dir /home/username/GPT-SoVITS-v4/output/slicer_opt --run-name username-v2 --batch-size 6 --max-steps 2000 --max-epochs 2000 --save-every 200 --num-workers 0 --train-cfm python train.py --config /home/username/seed-vc/configs/presets/config_dit_mel_seed_uvit_whisper_base_f0_44k.yml --dataset-dir /home/username/GPT-SoVITS-v4/output/slicer_opt --run-name username --batch-size 6 --max-steps 10000 --max-epochs 10000 --save-every 1000 --num-workers 1 accelerate launch train_v2.py --dataset-dir /home/username/GPT-SoVITS-v4/output/slicer_opt --run-name username-v2 --batch-size 6 --max-steps 2000 --max-epochs 2000 --save-every 200 --num-workers 0 --train-cfm Using checkpoints #Voice Conversion Web UI GRADIO_SHARE=0 GRADIO_SERVER_NAME=0.0.0.0 GRADIO_SERVER_PORT=7860 GRADIO_ANALYTICS_ENABLED=0 DISABLE_TELEMETRY=1 DO_NOT_TRACK=1 python app_vc.py --checkpoint ./runs/test01/ft_model.pth --config ./configs/presets/config_dit_mel_seed_uvit_whisper_base_f0_44k.yml --fp16 False #Singing Voice Conversion Web UI GRADIO_SHARE=0 GRADIO_SERVER_NAME=0.0.0.0 GRADIO_SERVER_PORT=7860 GRADIO_ANALYTICS_ENABLED=0 DISABLE_TELEMETRY=1 DO_NOT_TRACK=1 python app_svc.py --checkpoint ./runs/username/DiT_epoch_00029_step_08000.pth --config ./configs/presets/config_dit_mel_seed_uvit_whisper_base_f0_44k.yml --fp16 False #V2 model Web UI GRADIO_SHARE=0 GRADIO_SERVER_NAME=0.0.0.0 GRADIO_SERVER_PORT=7860 GRADIO_ANALYTICS_ENABLED=0 DISABLE_TELEMETRY=1 DO_NOT_TRACK=1 python app_vc_v2.py --cfm-checkpoint-path runs/Satine-V2/CFM_epoch_00000_step_00600.pth #Voice Conversion Web UI GRADIO_SHARE=0 GRADIO_SERVER_NAME=0.0.0.0 GRADIO_SERVER_PORT=7860 GRADIO_ANALYTICS_ENABLED=0 DISABLE_TELEMETRY=1 DO_NOT_TRACK=1 python app_vc.py --checkpoint ./runs/test01/ft_model.pth --config ./configs/presets/config_dit_mel_seed_uvit_whisper_base_f0_44k.yml --fp16 False #Singing Voice Conversion Web UI GRADIO_SHARE=0 GRADIO_SERVER_NAME=0.0.0.0 GRADIO_SERVER_PORT=7860 GRADIO_ANALYTICS_ENABLED=0 DISABLE_TELEMETRY=1 DO_NOT_TRACK=1 python app_svc.py --checkpoint ./runs/username/DiT_epoch_00029_step_08000.pth --config ./configs/presets/config_dit_mel_seed_uvit_whisper_base_f0_44k.yml --fp16 False #V2 model Web UI GRADIO_SHARE=0 GRADIO_SERVER_NAME=0.0.0.0 GRADIO_SERVER_PORT=7860 GRADIO_ANALYTICS_ENABLED=0 DISABLE_TELEMETRY=1 DO_NOT_TRACK=1 python app_vc_v2.py --cfm-checkpoint-path runs/Satine-V2/CFM_epoch_00000_step_00600.pth It turned out V1 model with Singing Voice Conversion Web UI app_svc.py performs the best. app_svc.py