Table of Contents A Deep Dive into Voice Cloning with SoftVC VITS and Bert-VITS2 Prepare Dataset Extract from a Song UVR Workflows Preparation for vocal recording Cheapskate’s Audio Equipment Audacity workflows audio-slicer Cleaning dataset Match loudness so-vits-svc Set up environment Initialization Download pre-trained models Dataset Preparation Edit Configs Training Inference Audio Editing so-vits-svc-fork Installation Preparation Training Inference DDSP-SVC Preparation Training Inference Bert-vits2-V2.3 Initialization Preparation Transcription Training and Inference vits-simple-api Tweaks Share models In the , I tried a little bit of and found it interesting. So, I decided to train a usable model with my own voice. previous post TTS Generation WebUI This voice cloning project explores both SVC for Voice Changing and for Text-to-Speech. There is no one tool that does all jobs. VITS I have tested several tools for this project. Many of the good guides, like , , and , are in Chinese. So, I thought it would be useful to post my notes in English. this this this Although has been archived for a few months, probably due to oppression, it is still the tool for the best result. so-vits-svc Other related tools such as , , , and provide either faster/liter optimization, more features, or better interfaces. so-vits-svc-fork so-vits-svc-5.0 DDSP-SVC RVC But with enough time and resources, none of these alternatives can compete with the superior result generated by the original so-vits-svc. For TTS, a new tool called works fantastically and has already matured with its final release last month. It has some very different use cases, for example, audio content creation. Bert-VITS2 Prepare Dataset The audio files of the dataset should be in WAV format, 44100 Hz, 16bit, mono, 1-2 hours ideally. Extract From a Song is the easiest tool for this job. There is a that explains everything in detail. Ultimate Vocal Remover thread UVR Workflows Remove and extract Instrumental Model: VR - UVR(4_HP-Vocal-UVR) Settings: 512 - 10 - GPU Output and unclean vocal Instrumental Remove and extract background vocals. Model: VR - UVR(5_HP-Karaoke-UVR) Settings: 512 - 10 - GPU Output and unclean main vocal background vocal Remove reverb and noise. Model: VR - UVR-DeEcho-DeReverb & UVR-DeNoise Settings: 512 - 10 - GPU - No Other Only Output clean main vocal (Optional) Using RipX (non-free) to perform a manual fine-cleaning. Preparation for Vocal Recording It’s better to record in a treated room with a condenser microphone, otherwise, use a directional or dynamic microphone to reduce noise. Cheapskate’s Audio Equipment The very first time I got into music was during high school, with the blue Sennheiser MX500 and Koss Porta Pro. I still remember the first time I was recording a song that was on a Sony VAIO with Cool Edit Pro. Nowadays, I still resist spending a lot of money on audio hardware as an amateur because it is literally a money-sucking black hole. Nonetheless, I really appreciate the reliability of those cheap production equipment. The core part of my setup is a Behringer UCA202, and it’s perfect for my use cases. I bought it for $10 during a price drop. It is a so-called “Audio Interface” but basically just a sound card with multiple ports. I used RCA to 3.5mm TRS cables for my headphones, a semi-open K240s for regular output, and a closed-back HD669/MDR7506 for monitor output. All three mentioned headphones are under $100 for the normal price. And there are clones from Samson, Tascam, Knox Gear, and more out there for less than $50. For the input device, I’m using a dynamic microphone for the sake of my environmental noises. It is an SM58 copy (Pyle) + a Tascam DR-05 recorder (as an amplifier). Other clones such as SL84c or wm58 would do it too. I use an XLR to 3.5mm TRS cable to connect the microphone to the MIC/External-input of the recorder, and then use an AUX cable to connect between the line-out of the recorder and the input of the UCA202. It’s not recommended to buy an “audio interface” and a dedicated amplifier to replicate my setup. A $10 c-media USB sound card should be good enough. The Syba model that I owned is capable of “pre-amping” dynamic microphones directly and even some lower-end phantom-powered microphones. The setup can go extremely cheap ($40~60), but with UCA202 and DR-05, the sound is much cleaner. And I really like the physical controls, versatility, and portability of my old good digital recorder. Audacity Workflows Although when I was getting paid as a designer, I was pretty happy with Audition. But for personal use on a fun project, Audacity is the way to avoid the chaotic evil of Adobe. Noise Reduction Dereverb Truncate Silence Normalize audio-slicer Use or to slice the audio file into small pieces for later use. audio-slicer audio-slicer (gui) The default setting works great. Cleaning Dataset Remove those very short ones and re-slice those that are still over 10 seconds. In case of a large dataset, remove all that are less than 4 sec. In the case of a small dataset, remove those only under 2 sec. If necessary, perform a manual inspection for every single file. Match Loudness Use Audacity again with ; 0db should do it. Loudness Normalization so-vits-svc Set Up the Environment Virtual environment is essential to run multiple Python tools inside one system. I used to use VMs and Docker, but now, I found that is way quicker and handier than the others. anaconda Create a new environment for so-vits-svc, and activate it. conda create -n so-vits-svc python=3.8 conda activate so-vits-svc Then, install requirements. git clone https://github.com/svc-develop-team/so-vits-svc cd so-vits-svc pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118 #for linux pip install -r requirements.txt #for windows pip install -r requirements_win.txt pip install --upgrade fastapi==0.84.0 pip install --upgrade gradio==3.41.2 pip install --upgrade pydantic==1.10.12 pip install fastapi uvicorn Initialization Download pre-trained models pre-train wget https://huggingface.co/WitchHuntTV/checkpoint_best_legacy_500.pt/resolve/main/checkpoint_best_legacy_500.pt wget https://huggingface.co/lj1995/VoiceConversionWebUI/resolve/main/rmvpe.pt logs/44k wget https://huggingface.co/datasets/ms903/sovits4.0-768vec-layer12/resolve/main/sovits_768l12_pre_large_320k/clean_D_320000.pth wget https://huggingface.co/datasets/ms903/sovits4.0-768vec-layer12/resolve/main/sovits_768l12_pre_large_320k/clean_G_320000.pth logs/44k/diffusion wget https://huggingface.co/datasets/ms903/Diff-SVC-refactor-pre-trained-model/resolve/main/fix_pitch_add_vctk_600k/model_0.pt (Alternative) wget https://huggingface.co/datasets/ms903/DDSP-SVC-4.0/resolve/main/pre-trained-model/model_0.pt (Alternative) wget https://huggingface.co/datasets/ms903/Diff-SVC-refactor-pre-trained-model/blob/main/hubertsoft_fix_pitch_add_vctk_500k/model_0.pt pre-train/nsf_hifigan wget -P pretrain/ https://github.com/openvpi/vocoders/releases/download/nsf-hifigan-v1/nsf_hifigan_20221211.zip unzip -od pretrain/nsf_hifigan pretrain/nsf_hifigan_20221211.zip Dataset Preparation Put all Prepared audio.wav files into dataset_raw/character cd so-vits-svc python resample.py --skip_loudnorm python preprocess_flist_config.py --speech_encoder vec768l12 --vol_aug python preprocess_hubert_f0.py --use_diff Edit Configs The file is located at configs/config.json : the frequency of printing log : the frequency of saving checkpoints : total steps : numbers of saved checkpoints, 0 for unlimited. : fp32 In my case, : the smaller the faster (rougher), the larger the slower (better). log interval eval interval epochs keep ckpts half_type batch_size Recommended batch_size per VRAM: 4=6G;6=8G;10=12G;14=16G;20=24G Keep default for configs/diffusion.yaml Training python cluster/train_cluster.py --gpu python train_index.py -c configs/config.json python train.py -c configs/config.json -m 44k python train_diff.py -c configs/diffusion.yaml On training steps: Use to train the main model; usually, 20k-30k would be usable, and 50k and up would be good enough. This can take a few days depending on the GPU speed. train.py Feel free to stop it by , and it will be continued training by re-run anytime. ctrl+c python train.py -c configs/config.json -m 44k Use to train the diffusion model; training steps is recommended at 1/3 of the main model. train_diff.py Be aware of over-training. Use to monitor the plots to see if it goes flat. tensorboard --logdir=./logs/44k Change the from 0.0001 to 0.00005 if necessary. learning rate When done, share/transport these files for inference. config/ config.json diffusion.yaml logs/44k feature_and_index.pkl kmeans_10000.pt model_0.pt G_xxxxx.pt Inference It’s time to try out the trained model. I’d prefer Webui for the convenience of tweaking the parameters. But before firing it up, edit the following lines in for LAN access: webUI.py os.system("start http://localhost:7860") app.launch(server_name="0.0.0.0", server_port=7860) Run ; then access its from a web browser. python webUI.py ipaddress:7860 The webui has no English localization, but would be helpful. Immersive Translate Most parameters would work well with the default value. Refer to and to make changes. this this Upload these 5 files: and its main model.pt config.json and its diffusion model.pt diffusion.yaml Either cluster model for speaking or feature retrieval for singing. kmeans_10000.pt feature_and_index.pkl is for speaking only, not for singing. Recommend when using. F0 predictor RMVPE is useful when singing a feminine song using a model with a masculine voice, or vice versa. Pitch change is the way of controlling the tone. Use to get the clearest speech, and use to get the closest tone to the model. Clustering model/feature retrieval mixing ratio 0.1 0.9 should be set around , it enhances the result at steps. shallow diffusion steps 50 30-100 Audio Editing This procedure is optional. Just for the production of a better song. I won’t go into details about this since the audio editing software, or so-called DAW (digital audio workstation), that I’m using is non-free. I have no intention to advocate proprietary software even though the entire industry is paywalled and closed-source. Audacity supports multitrack, effects, and a lot more. It does load some advanced VST plugins as well. It’s not hard to find tutorials on mastering songs with Audacity. Typically, the mastering process should be mixing/balancing, EQ/compressing, reverb, and imaging. The more advanced the tool is, the easier the process will be. I’ll definitely spend more time on adopting Audacity for my mastering process in the future, and I recommend everyone do so. so-vits-svc-fork This is a so-vits-svc with real-time support, and the models are compatible. Easier to use but does not support the Diffusion model. For dedicated real-time voice changing, a is recommended. fork voice-changer Installation conda create -n so-vits-svc-fork python=3.10 pip conda activate so-vits-svc-fork git clone https://github.com/voicepaw/so-vits-svc-fork cd so-vits-svc-fork python -m pip install -U pip setuptools wheel pip install -U torch torchaudio --index-url https://download.pytorch.org/whl/cu118 pip install -U so-vits-svc-fork pip install click sudo apt-get install libportaudio2 Preparation Put dataset .wav files into so-vits-svc-fork/dataset_raw svc pre-resample svc pre-config Edit a in . This fork is a larger size than the original. batch_size configs/44k/config.json Training svc pre-hubert svc train -t svc train-cluster Inference Use GUI with . This requires a local desktop environment. svcg Or use CLI with for real time and for generating. svc vc svc infer -m "logs/44k/xxxxx.pth" -c "configs/config.json" raw/xxx.wav DDSP-SVC requires fewer hardware resources and runs faster than so-vits-svc. It supports both real-time and diffusion models (Diff-SVC). DDSP-SVC conda create -n DDSP-SVC python=3.8 conda activate DDSP-SVC git clone https://github.com/yxlllc/DDSP-SVC cd DDSP-SVC pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118 pip install -r requirements.txt Refer to the for the two files: Initialization section pretrain/rmvpe/model.pt pretrain/contentvec/checkpoint_best_legacy_500.pt Preparation python draw.py python preprocess.py -c configs/combsub.yaml python preprocess.py -c configs/diffusion-new.yaml Edit configs/ batch_size: 32 (16 for diffusion) cache_all_data: false cache_device: 'cuda' cache_fp16: false Training conda activate DDSP-SVC python train.py -c configs/combsub.yaml python train_diff.py -c configs/diffusion-new.yaml tensorboard --logdir=exp Inference It’s recommended to use since it includes both DDSP and diffusion model. main_diff.py python main_diff.py -i "input.wav" -diff "model_xxxxxx.pt" -o "output.wav" Real-time GUI for voice cloning: python gui_diff.py Bert-vits2-V2.3 This is a TTS tool that is completely different from everything above. By using it, I have already created several audiobooks with my voice for my parents, and they really enjoy it. Instead of using the , I used the fork by for an easier setup. original v3u Initialization conda create -n bert-vits2 python=3.9 conda activate bert-vits2 git clone https://github.com/v3ucn/Bert-vits2-V2.3.git cd Bert-vits2-V2.3 pip install torch==2.0.1+cu118 torchvision==0.15.2+cu118 torchaudio==2.0.2+cu118 --index-url https://download.pytorch.org/whl/cu118 pip install -r requirements.txt Download pre-trained models (includes Chinese, Japanese, and English): wget -P slm/wavlm-base-plus/ https://huggingface.co/microsoft/wavlm-base-plus/resolve/main/pytorch_model.bin wget -P emotional/clap-htsat-fused/ https://huggingface.co/laion/clap-htsat-fused/resolve/main/pytorch_model.bin wget -P emotional/wav2vec2-large-robust-12-ft-emotion-msp-dim/ https://huggingface.co/audeering/wav2vec2-large-robust-12-ft-emotion-msp-dim/resolve/main/pytorch_model.bin wget -P bert/chinese-roberta-wwm-ext-large/ https://huggingface.co/hfl/chinese-roberta-wwm-ext-large/resolve/main/pytorch_model.bin wget -P bert/bert-base-japanese-v3/ https://huggingface.co/cl-tohoku/bert-base-japanese-v3/resolve/main/pytorch_model.bin wget -P bert/deberta-v3-large/ https://huggingface.co/microsoft/deberta-v3-large/resolve/main/pytorch_model.bin wget -P bert/deberta-v3-large/ https://huggingface.co/microsoft/deberta-v3-large/resolve/main/pytorch_model.generator.bin wget -P bert/deberta-v2-large-japanese/ https://huggingface.co/ku-nlp/deberta-v2-large-japanese/resolve/main/pytorch_model.bin Create a character model folder mkdir -p Data/xxx/models/ Download base models: !wget -P Data/xxx/models/ https://huggingface.co/OedoSoldier/Bert-VITS2-2.3/resolve/main/DUR_0.pth !wget -P Data/xxx/models/ https://huggingface.co/OedoSoldier/Bert-VITS2-2.3/resolve/main/D_0.pth !wget -P Data/xxx/models/ https://huggingface.co/OedoSoldier/Bert-VITS2-2.3/resolve/main/G_0.pth !wget -P Data/xxx/models/ https://huggingface.co/OedoSoldier/Bert-VITS2-2.3/resolve/main/WD_0.pth #More options https://openi.pcl.ac.cn/Stardust_minus/Bert-VITS2/modelmanage/model_filelist_tmpl?name=Bert-VITS2_2.3%E5%BA%95%E6%A8%A1 https://huggingface.co/Erythrocyte/bert-vits2_base_model/tree/main https://huggingface.co/OedoSoldier/Bert-VITS2-2.3/tree/main Edit by replacing all to train_ms.py bfloat16 float16 Edit for LAN access: webui.py webbrowser.open(f"start http://localhost:7860") app.launch(server_name="0.0.0.0", server_port=7860) Edit for and Data/xxx/config.json batch_size spk2id Preparation Similar workflow as in the . previous section Remove noise and silence, normalization, then put the un-sliced WAV file into . Data/xxx/raw Edit for , , and . config.yml dataset_path num_workers keep_ckpts Run to slice the WAV file. python3 audio_slicer.py Clean the dataset ( ) by removing small files that are under 2 sec. Data/xxx/raw Transcription Install whisper pip install git+https://github.com/openai/whisper.git To turn off language auto-detection, set it to English only, and use model; edit as below: large short_audio_transcribe.py # set the spoken language to english print('language: en') lang = 'en' options = whisper.DecodingOptions(language='en') result = whisper.decode(model, mel, options) # set to use large model parser.add_argument("--whisper_size", default="large") #Solve error "Given groups=1, weight of size [1280, 128, 3], expected input[1, 80, 3000] to have 128 channels, but got 80 channels instead" while using large model mel = whisper.log_mel_spectrogram(audio,n_mels = 128).to(model.device) Run to start the transcription. python3 short_audio_transcribe.py Re-sample the sliced dataset: python3 resample.py --sr 44100 --in_dir ./Data/zizek/raw/ --out_dir ./Data/zizek/wavs/ Preprocess transcription: python3 preprocess_text.py --transcription-path ./Data/zizek/esd.list Generate BERT feature config: python3 bert_gen.py --config-path ./Data/zizek/configs/config.json Training and Inference Run to start training python3 train_ms.py Edit for model path: config.yml model: "models/G_20900.pth" Run to start webui for inference python3 webui.py vits-simple-api is a web frontend for using trained models. I use this mainly for its long text support which the original project doesn’t have. vits-simple-api git clone https://github.com/Artrajz/vits-simple-api git pull https://github.com/Artrajz/vits-simple-api cd vits-simple-api conda create -n vits-simple-api python=3.10 pip conda activate vits-simple-api && pip install -r requirements.txt (Optional) Copy pre-trained model files from to Bert-vits2-V2.3/ vits-simple-api/bert_vits2/ Copy and to Bert-vits2-V2.3/Data/xxx/models/G_xxxxx.pth Bert-vits2-V2.3/Data/xxx/config.json vits-simple-api/Model/xxx/ Edit for and as preferred config.py MODEL_LIST Default parameter Edit as below: Model/xxx/config.json "data": { "training_files": "Data/train.list", "validation_files": "Data/val.list", "version": "2.3" Check/Edit in as model_list config.yml [xxx/G_xxxxx.pth, xxx/config.json] Run python app.py Tweaks for tone, for randomness, for pronunciation, and for speed. and are self-explanatory SDP Ratio Noise Noise_W Length emotion style Share Models In its , there are a lot of VITS models shared by others. You can try it out first, and then download the desired models from . Hugging Face repo Files The is widely used in some content creation communities because of its high quality. It contains hundreds of characters, although only Chinese and Japanese are supported. Genshin model In , there are a lot of Bert-vits2 models that are made from popular Chinese streamers and VTubers. another repo There are already projects making AI Vtubers like and . I’m looking forward to how this technology can change the industry in the near future. this this https://techshinobi.org/posts/voice-vits/https://techshinobi.org/posts/voice-vits/