Table of Contents A Deep Dive into Voice Cloning with SoftVC VITS and Bert-VITS2 Prepare Dataset Extract from a Song UVR Workflows Preparation for vocal recording Cheapskate’s Audio Equipment Audacity workflows audio-slicer Cleaning dataset Match loudness so-vits-svc Set up environment Initialization Download pre-trained models Dataset Preparation Edit Configs Training Inference Audio Editing so-vits-svc-fork Installation Preparation Training Inference DDSP-SVC Preparation Training Inference Bert-vits2-V2.3 Initialization Preparation Transcription Training and Inference vits-simple-api Tweaks Share models In the , I tried a little bit of and found it interesting. So, I decided to train a usable model with my own voice. previous post TTS Generation WebUI This voice cloning project explores both SVC for Voice Changing and for Text-to-Speech. There is no one tool that does all jobs. VITS I have tested several tools for this project. Many of the good guides, like , , and , are in Chinese. So, I thought it would be useful to post my notes in English. this this this Although has been archived for a few months, probably due to oppression, it is still the tool for the best result. so-vits-svc Other related tools such as , , , and provide either faster/liter optimization, more features, or better interfaces. so-vits-svc-fork so-vits-svc-5.0 DDSP-SVC RVC But with enough time and resources, none of these alternatives can compete with the superior result generated by the original so-vits-svc. For TTS, a new tool called works fantastically and has already matured with its final release last month. It has some very different use cases, for example, audio content creation. Bert-VITS2 Prepare Dataset The audio files of the dataset should be in WAV format, 44100 Hz, 16bit, mono, 1-2 hours ideally. Extract From a Song is the easiest tool for this job. There is a that explains everything in detail. Ultimate Vocal Remover thread UVR Workflows Remove and extract Instrumental Model: VR - UVR(4_HP-Vocal-UVR) Settings: 512 - 10 - GPU Output and unclean vocal Instrumental Remove and extract background vocals. Model: VR - UVR(5_HP-Karaoke-UVR) Settings: 512 - 10 - GPU Output and unclean main vocal background vocal Remove reverb and noise. Model: VR - UVR-DeEcho-DeReverb & UVR-DeNoise Settings: 512 - 10 - GPU - No Other Only Output clean main vocal (Optional) Using RipX (non-free) to perform a manual fine-cleaning. Preparation for Vocal Recording It’s better to record in a treated room with a condenser microphone, otherwise, use a directional or dynamic microphone to reduce noise. Cheapskate’s Audio Equipment The very first time I got into music was during high school, with the blue Sennheiser MX500 and Koss Porta Pro. I still remember the first time I was recording a song that was on a Sony VAIO with Cool Edit Pro. Nowadays, I still resist spending a lot of money on audio hardware as an amateur because it is literally a money-sucking black hole. Nonetheless, I really appreciate the reliability of those cheap production equipment. The core part of my setup is a Behringer UCA202, and it’s perfect for my use cases. I bought it for $10 during a price drop. It is a so-called “Audio Interface” but basically just a sound card with multiple ports. I used RCA to 3.5mm TRS cables for my headphones, a semi-open K240s for regular output, and a closed-back HD669/MDR7506 for monitor output. All three mentioned headphones are under $100 for the normal price. And there are clones from Samson, Tascam, Knox Gear, and more out there for less than $50. For the input device, I’m using a dynamic microphone for the sake of my environmental noises. It is an SM58 copy (Pyle) + a Tascam DR-05 recorder (as an amplifier). Other clones such as SL84c or wm58 would do it too. I use an XLR to 3.5mm TRS cable to connect the microphone to the MIC/External-input of the recorder, and then use an AUX cable to connect between the line-out of the recorder and the input of the UCA202. It’s not recommended to buy an “audio interface” and a dedicated amplifier to replicate my setup. A $10 c-media USB sound card should be good enough. The Syba model that I owned is capable of “pre-amping” dynamic microphones directly and even some lower-end phantom-powered microphones. The setup can go extremely cheap ($40~60), but with UCA202 and DR-05, the sound is much cleaner. And I really like the physical controls, versatility, and portability of my old good digital recorder. Audacity Workflows Although when I was getting paid as a designer, I was pretty happy with Audition. But for personal use on a fun project, Audacity is the way to avoid the chaotic evil of Adobe. Noise Reduction Dereverb Truncate Silence Normalize audio-slicer Use or to slice the audio file into small pieces for later use. audio-slicer audio-slicer (gui) The default setting works great. Cleaning Dataset Remove those very short ones and re-slice those that are still over 10 seconds. In case of a large dataset, remove all that are less than 4 sec. In the case of a small dataset, remove those only under 2 sec. If necessary, perform a manual inspection for every single file. Match Loudness Use Audacity again with ; 0db should do it. Loudness Normalization so-vits-svc Set Up the Environment Virtual environment is essential to run multiple Python tools inside one system. I used to use VMs and Docker, but now, I found that is way quicker and handier than the others. anaconda Create a new environment for so-vits-svc, and activate it. conda create -n so-vits-svc python=3.8
conda activate so-vits-svc Then, install requirements. git clone https://github.com/svc-develop-team/so-vits-svc
cd so-vits-svc

pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118

#for linux
pip install -r requirements.txt

#for windows
pip install -r requirements_win.txt
pip install --upgrade fastapi==0.84.0
pip install --upgrade gradio==3.41.2
pip install --upgrade pydantic==1.10.12
pip install fastapi uvicorn Initialization Download pre-trained models pre-train wget https://huggingface.co/WitchHuntTV/checkpoint_best_legacy_500.pt/resolve/main/checkpoint_best_legacy_500.pt wget https://huggingface.co/lj1995/VoiceConversionWebUI/resolve/main/rmvpe.pt logs/44k wget https://huggingface.co/datasets/ms903/sovits4.0-768vec-layer12/resolve/main/sovits_768l12_pre_large_320k/clean_D_320000.pth wget https://huggingface.co/datasets/ms903/sovits4.0-768vec-layer12/resolve/main/sovits_768l12_pre_large_320k/clean_G_320000.pth logs/44k/diffusion wget https://huggingface.co/datasets/ms903/Diff-SVC-refactor-pre-trained-model/resolve/main/fix_pitch_add_vctk_600k/model_0.pt (Alternative) wget https://huggingface.co/datasets/ms903/DDSP-SVC-4.0/resolve/main/pre-trained-model/model_0.pt (Alternative) wget https://huggingface.co/datasets/ms903/Diff-SVC-refactor-pre-trained-model/blob/main/hubertsoft_fix_pitch_add_vctk_500k/model_0.pt pre-train/nsf_hifigan wget -P pretrain/ https://github.com/openvpi/vocoders/releases/download/nsf-hifigan-v1/nsf_hifigan_20221211.zip unzip -od pretrain/nsf_hifigan pretrain/nsf_hifigan_20221211.zip Dataset Preparation Put all Prepared audio.wav files into dataset_raw/character cd so-vits-svc
python resample.py --skip_loudnorm
python preprocess_flist_config.py --speech_encoder vec768l12 --vol_aug
python preprocess_hubert_f0.py --use_diff Edit Configs The file is located at configs/config.json : the frequency of printing log : the frequency of saving checkpoints : total steps : numbers of saved checkpoints, 0 for unlimited. : fp32 In my case, : the smaller the faster (rougher), the larger the slower (better). log interval eval interval epochs keep ckpts half_type batch_size Recommended batch_size per VRAM: 4=6G；6=8G；10=12G；14=16G；20=24G Keep default for configs/diffusion.yaml Training python cluster/train_cluster.py --gpu
python train_index.py -c configs/config.json
python train.py -c configs/config.json -m 44k
python train_diff.py -c configs/diffusion.yaml On training steps: Use to train the main model; usually, 20k-30k would be usable, and 50k and up would be good enough. This can take a few days depending on the GPU speed. train.py Feel free to stop it by , and it will be continued training by re-run anytime. ctrl+c python train.py -c configs/config.json -m 44k Use to train the diffusion model; training steps is recommended at 1/3 of the main model. train_diff.py Be aware of over-training. Use to monitor the plots to see if it goes flat. tensorboard --logdir=./logs/44k Change the from 0.0001 to 0.00005 if necessary. learning rate When done, share/transport these files for inference. config/ config.json diffusion.yaml logs/44k feature_and_index.pkl kmeans_10000.pt model_0.pt G_xxxxx.pt Inference It’s time to try out the trained model. I’d prefer Webui for the convenience of tweaking the parameters. But before firing it up, edit the following lines in for LAN access: webUI.py os.system("start http://localhost:7860")
app.launch(server_name="0.0.0.0", server_port=7860) Run ; then access its from a web browser. python webUI.py ipaddress:7860 The webui has no English localization, but would be helpful. Immersive Translate Most parameters would work well with the default value. Refer to and to make changes. this this Upload these 5 files: and its main model.pt config.json and its diffusion model.pt diffusion.yaml Either cluster model for speaking or feature retrieval for singing. kmeans_10000.pt feature_and_index.pkl is for speaking only, not for singing. Recommend when using. F0 predictor RMVPE is useful when singing a feminine song using a model with a masculine voice, or vice versa. Pitch change is the way of controlling the tone. Use to get the clearest speech, and use to get the closest tone to the model. Clustering model/feature retrieval mixing ratio 0.1 0.9 should be set around , it enhances the result at steps. shallow diffusion steps 50 30-100 Audio Editing This procedure is optional. Just for the production of a better song. I won’t go into details about this since the audio editing software, or so-called DAW (digital audio workstation), that I’m using is non-free. I have no intention to advocate proprietary software even though the entire industry is paywalled and closed-source. Audacity supports multitrack, effects, and a lot more. It does load some advanced VST plugins as well. It’s not hard to find tutorials on mastering songs with Audacity. Typically, the mastering process should be mixing/balancing, EQ/compressing, reverb, and imaging. The more advanced the tool is, the easier the process will be. I’ll definitely spend more time on adopting Audacity for my mastering process in the future, and I recommend everyone do so. so-vits-svc-fork This is a so-vits-svc with real-time support, and the models are compatible. Easier to use but does not support the Diffusion model. For dedicated real-time voice changing, a is recommended. fork voice-changer Installation conda create -n so-vits-svc-fork python=3.10 pip
conda activate so-vits-svc-fork

git clone https://github.com/voicepaw/so-vits-svc-fork
cd so-vits-svc-fork

python -m pip install -U pip setuptools wheel
pip install -U torch torchaudio --index-url https://download.pytorch.org/whl/cu118
pip install -U so-vits-svc-fork
pip install click
sudo apt-get install libportaudio2 Preparation Put dataset .wav files into so-vits-svc-fork/dataset_raw svc pre-resample
svc pre-config Edit a in . This fork is a larger size than the original. batch_size configs/44k/config.json Training svc pre-hubert
svc train -t
svc train-cluster Inference Use GUI with . This requires a local desktop environment. svcg Or use CLI with for real time and for generating. svc vc svc infer -m "logs/44k/xxxxx.pth" -c "configs/config.json" raw/xxx.wav DDSP-SVC requires fewer hardware resources and runs faster than so-vits-svc. It supports both real-time and diffusion models (Diff-SVC). DDSP-SVC conda create -n DDSP-SVC python=3.8
conda activate DDSP-SVC

git clone https://github.com/yxlllc/DDSP-SVC
cd DDSP-SVC

pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118
pip install -r requirements.txt Refer to the for the two files: Initialization section pretrain/rmvpe/model.pt
pretrain/contentvec/checkpoint_best_legacy_500.pt Preparation python draw.py
python preprocess.py -c configs/combsub.yaml
python preprocess.py -c configs/diffusion-new.yaml Edit configs/ batch_size: 32  (16 for diffusion)
cache_all_data: false
cache_device: 'cuda'
cache_fp16: false Training conda activate DDSP-SVC
python train.py -c configs/combsub.yaml
python train_diff.py -c configs/diffusion-new.yaml

tensorboard --logdir=exp Inference It’s recommended to use since it includes both DDSP and diffusion model. main_diff.py python main_diff.py -i "input.wav" -diff "model_xxxxxx.pt" -o "output.wav" Real-time GUI for voice cloning: python gui_diff.py Bert-vits2-V2.3 This is a TTS tool that is completely different from everything above. By using it, I have already created several audiobooks with my voice for my parents, and they really enjoy it. Instead of using the , I used the fork by for an easier setup. original v3u Initialization conda create -n bert-vits2 python=3.9
conda activate bert-vits2

git clone https://github.com/v3ucn/Bert-vits2-V2.3.git
cd Bert-vits2-V2.3

pip install torch==2.0.1+cu118 torchvision==0.15.2+cu118 torchaudio==2.0.2+cu118 --index-url https://download.pytorch.org/whl/cu118
pip install -r requirements.txt Download pre-trained models (includes Chinese, Japanese, and English): wget -P slm/wavlm-base-plus/ https://huggingface.co/microsoft/wavlm-base-plus/resolve/main/pytorch_model.bin
wget -P emotional/clap-htsat-fused/ https://huggingface.co/laion/clap-htsat-fused/resolve/main/pytorch_model.bin
wget -P emotional/wav2vec2-large-robust-12-ft-emotion-msp-dim/ https://huggingface.co/audeering/wav2vec2-large-robust-12-ft-emotion-msp-dim/resolve/main/pytorch_model.bin
wget -P bert/chinese-roberta-wwm-ext-large/ https://huggingface.co/hfl/chinese-roberta-wwm-ext-large/resolve/main/pytorch_model.bin
wget -P bert/bert-base-japanese-v3/ https://huggingface.co/cl-tohoku/bert-base-japanese-v3/resolve/main/pytorch_model.bin
wget -P bert/deberta-v3-large/ https://huggingface.co/microsoft/deberta-v3-large/resolve/main/pytorch_model.bin
wget -P bert/deberta-v3-large/ https://huggingface.co/microsoft/deberta-v3-large/resolve/main/pytorch_model.generator.bin
wget -P bert/deberta-v2-large-japanese/ https://huggingface.co/ku-nlp/deberta-v2-large-japanese/resolve/main/pytorch_model.bin Create a character model folder mkdir -p Data/xxx/models/ Download base models: !wget -P Data/xxx/models/ https://huggingface.co/OedoSoldier/Bert-VITS2-2.3/resolve/main/DUR_0.pth
!wget -P Data/xxx/models/ https://huggingface.co/OedoSoldier/Bert-VITS2-2.3/resolve/main/D_0.pth
!wget -P Data/xxx/models/ https://huggingface.co/OedoSoldier/Bert-VITS2-2.3/resolve/main/G_0.pth
!wget -P Data/xxx/models/ https://huggingface.co/OedoSoldier/Bert-VITS2-2.3/resolve/main/WD_0.pth

#More options
https://openi.pcl.ac.cn/Stardust_minus/Bert-VITS2/modelmanage/model_filelist_tmpl?name=Bert-VITS2_2.3%E5%BA%95%E6%A8%A1
https://huggingface.co/Erythrocyte/bert-vits2_base_model/tree/main
https://huggingface.co/OedoSoldier/Bert-VITS2-2.3/tree/main Edit by replacing all to train_ms.py bfloat16 float16 Edit for LAN access: webui.py webbrowser.open(f"start http://localhost:7860")
app.launch(server_name="0.0.0.0", server_port=7860) Edit for and Data/xxx/config.json batch_size spk2id Preparation Similar workflow as in the . previous section Remove noise and silence, normalization, then put the un-sliced WAV file into . Data/xxx/raw Edit for , , and . config.yml dataset_path num_workers keep_ckpts Run to slice the WAV file. python3 audio_slicer.py Clean the dataset ( ) by removing small files that are under 2 sec. Data/xxx/raw Transcription Install whisper pip install git+https://github.com/openai/whisper.git To turn off language auto-detection, set it to English only, and use model; edit as below: large short_audio_transcribe.py # set the spoken language to english
    print('language: en')
    lang = 'en'
    options = whisper.DecodingOptions(language='en')
    result = whisper.decode(model, mel, options)
	
    # set to use large model
    parser.add_argument("--whisper_size", default="large")

    #Solve error "Given groups=1, weight of size [1280, 128, 3], expected input[1, 80, 3000] to have 128 channels, but got 80 channels instead" while using large model
    mel = whisper.log_mel_spectrogram(audio,n_mels = 128).to(model.device) Run to start the transcription. python3 short_audio_transcribe.py Re-sample the sliced dataset: python3 resample.py --sr 44100 --in_dir ./Data/zizek/raw/ --out_dir ./Data/zizek/wavs/ Preprocess transcription: python3 preprocess_text.py --transcription-path ./Data/zizek/esd.list Generate BERT feature config: python3 bert_gen.py --config-path ./Data/zizek/configs/config.json Training and Inference Run to start training python3 train_ms.py Edit for model path: config.yml model: "models/G_20900.pth" Run to start webui for inference python3 webui.py vits-simple-api is a web frontend for using trained models. I use this mainly for its long text support which the original project doesn’t have. vits-simple-api git clone https://github.com/Artrajz/vits-simple-api
git pull https://github.com/Artrajz/vits-simple-api
cd vits-simple-api

conda create -n vits-simple-api python=3.10 pip
conda activate vits-simple-api && 

pip install -r requirements.txt (Optional) Copy pre-trained model files from to Bert-vits2-V2.3/ vits-simple-api/bert_vits2/ Copy and to Bert-vits2-V2.3/Data/xxx/models/G_xxxxx.pth Bert-vits2-V2.3/Data/xxx/config.json vits-simple-api/Model/xxx/ Edit for and as preferred config.py MODEL_LIST Default parameter Edit as below: Model/xxx/config.json "data": {
    "training_files": "Data/train.list",
    "validation_files": "Data/val.list",
	
  "version": "2.3" Check/Edit in as model_list config.yml [xxx/G_xxxxx.pth, xxx/config.json] Run python app.py Tweaks for tone, for randomness, for pronunciation, and for speed. and are self-explanatory SDP Ratio Noise Noise_W Length emotion style Share Models In its , there are a lot of VITS models shared by others. You can try it out first, and then download the desired models from . Hugging Face repo Files The is widely used in some content creation communities because of its high quality. It contains hundreds of characters, although only Chinese and Japanese are supported. Genshin model In , there are a lot of Bert-vits2 models that are made from popular Chinese streamers and VTubers. another repo There are already projects making AI Vtubers like and . I’m looking forward to how this technology can change the industry in the near future. this this https://techshinobi.org/posts/voice-vits/https://techshinobi.org/posts/voice-vits/

Walkthroughs, tutorials, guides, and tips. This story will teach you how to do something new or how to do something better.

How to Voice Clone With SoftVC VITS and Bert-VITS2: A Deep Dive

About Author

Comments

TOPICS

THIS ARTICLE WAS FEATURED IN

Related Stories

How I Built a DIY Audiobook Voice Cloning Pipeline Using GPT-SoVITS and Fish-Speech

Why Being Likable At Work Matters

Windows Sticky Keys Exploit: The War Veteran That Never Dies

Zero People Charged With Online Pirating, Swedish Prosecutor's Office Reports

Is ‘bias for action’ making product managers lazier?

What No One Told Me About Being a Product Manager at an Early Stage Startup

How I Built a DIY Audiobook Voice Cloning Pipeline Using GPT-SoVITS and Fish-Speech

Why Being Likable At Work Matters

Windows Sticky Keys Exploit: The War Veteran That Never Dies

Zero People Charged With Online Pirating, Swedish Prosecutor's Office Reports

Is ‘bias for action’ making product managers lazier?

What No One Told Me About Being a Product Manager at an Early Stage Startup

Light-Mode

Classic

Newspaper

Minty

Dark-Mode

Neon Noir

Minty

HN StartUps