We’ve seen image inpainting, which aims to remove an undesirable object from a picture. These machine learning-based techniques do not simply remove the objects, but they also understand the picture and fill the missing parts of the image with what the background should look like. The recent advancements are incredible, and this inpainting task can be quite useful for many applications like advertisements or improving your future Instagram post. We also covered an even more challenging task: video inpainting, where the same process is applied to videos to remove objects or people. The challenge with videos comes with staying consistent from frame to frame without any buggy artifacts. But now, what happens if we correctly remove a person from a movie and the sound is still there, unchanged? Well, we may hear a ghost and ruin all our work. You heard it right, researchers from Google just published a paper aiming at inpainting speech, and, as we will see, the results are quite impressive. This is where a task I never covered on my channel comes in: speech inpainting. Okay, we might rather hear than see the results, but you get the point. It can correct your grammar, pronunciation or even remove background noise. All things I definitely need to keep working on, or… simply use their new model… Listen to the examples in my video! Watch the video References Register to GTC22 for free (don't forget to leave a comment and subscribe to enter the giveaway, steps in the video!): ►Read the full article: ►Borsos, Z., Sharifi, M. and Tagliasacchi, M., 2022. SpeechPainter: Text-conditioned Speech Inpainting. ►Listen to all the examples: ►Subscribe to the newsletter where the winner will be announced: https://nvda.ws/3upUQkF https://www.louisbouchard.ai/speech-inpainting-with-ai/ https://arxiv.org/pdf/2202.07273.pdf https://google-research.github.io/seanet/speechpainter/examples/ https://www.louisbouchard.ai/newsletter/ Video Transcript 00:00 [Music] 00:00 we've seen image in painting which aims 00:02 to remove an undesirable object from a 00:04 picture these in-painting machine 00:06 learning based techniques do not simply 00:08 remove the objects but they also 00:10 understand the picture and fill the 00:11 missing parts of the image with what the 00:13 background should look like as we saw 00:16 the recent advancements are incredible 00:18 just like the results and this 00:20 in-painting task can be quite useful for 00:22 many applications like advertisements or 00:24 improving your future instagram posts we 00:27 also covered an even more challenging 00:29 task video in painting where the same 00:31 process is applied to videos to remove 00:33 objects or people the challenge with 00:35 videos come with staying consistent from 00:37 frame to frame without any buggy 00:39 artifacts but now what happens if we 00:42 correctly remove a person from a movie 00:44 and the sound is still there unchanged 00:46 well we may hear a ghost and ruin all 00:49 our work 00:50 this is where a task i never covered on 00:52 my channel comes in speech in painting 00:55 you heard it right researchers from 00:57 google just published a paper aiming at 01:00 in-painting speech and as we will see 01:02 the results are quite impressive okay we 01:05 might rather hear than see the results 01:07 but you get the point it can correct 01:09 your grammar pronunciation or even 01:11 background noise all things i definitely 01:14 need to keep working on are 01:16 simply use their model be sure to stay 01:19 until the end of the video to listen to 01:21 the examples you will be mind blown and 01:23 there's also a big surprise at the end 01:25 of the video that the thumbnail and 01:27 title may have spoiled which you surely 01:29 want to stay for but first a word from 01:31 this episode sponsor weights and biases 01:34 as most of you know weighted biases is 01:36 the tool you want to have for tracking 01:38 your experiments and sharing them with 01:40 your team it just got even better they 01:43 just launched a new feature that i 01:45 already set up for myself called alerts 01:48 as its name says alerts can be used to 01:50 notify you via slack or email if your 01:52 run has crashed or whether a custom 01:54 trigger such as your loss going to none 01:57 or a step in your ml pipeline has been 01:59 reached alerts can be applied to all 02:01 projects where you launch runs including 02:03 both personal and team projects and of 02:05 course it's super straightforward to set 02:07 up i love that they are constantly 02:09 adding new features and this one is 02:11 pretty cool in my opinion it will save 02:13 me a lot of time well it may not help me 02:16 get off my phone if you work in the 02:17 field check them out with the first link 02:19 below you won't be disappointed now 02:22 let's get into the most exciting part of 02:24 the video how these three researchers 02:27 from google created speech painter their 02:29 speech and painting model to understand 02:32 their approach we must first define the 02:34 goal of speech and painting here we want 02:36 to take an audio clip and its transcript 02:38 and in paint a small section of the 02:41 audio clip you can see many examples 02:43 here that we will play during the video 02:45 but you are free to also have a look 02:46 yourself with the link in the 02:48 description the texts you see are the 02:50 transcripts of the audio tracks with the 02:52 bold part being removed from the audio 02:54 clip and implanted by the network here's 02:57 what the first one sounds like initially 02:59 george robertson just 03:01 isn't outrageous and just 03:04 and after being invited by the network 03:07 george robertson described the plan as 03:09 an outrageous and just 03:11 here are two examples where it was used 03:13 to correct grammar and then 03:15 pronunciation 03:17 we swimmed in the river last weekend and 03:19 the water was cold 03:20 we swam in the river last weekend and 03:22 the water was cold 03:26 yesterday i bought a very nice wash 03:30 yesterday i bought a very nice watch 03:33 as you just heard the model not only 03:35 performs speech and painting but it does 03:37 that while maintaining the speaker's 03:39 identity and recording environments 03:41 following the line of text how cool is 03:43 that now that we know what the model can 03:45 do how does it achieve that as you 03:48 suspect this is pretty similar to image 03:50 and painting where we replace missing 03:52 pixels in an image instead we replace 03:54 missing data in an audio track following 03:57 a specific transcript so the model knows 03:59 what to say and its goal is to fill the 04:02 gap in the audio track following the 04:03 text and imitating the person's voice 04:06 and overall atmosphere of the track to 04:08 feel real 04:09 since image and speech and painting are 04:11 similar tasks they will use similar 04:14 architectures in this case they use a 04:16 model called perceiver io it will do the 04:19 same as with an image where you will 04:21 encode the image extract the most useful 04:23 information perform modifications and 04:26 finally decode it to reconstruct another 04:28 image with what you want to achieve in 04:31 the image and painting example the new 04:33 image will simply be the same but with 04:35 some pixels changed in this case instead 04:38 of pixels coming from an image the 04:40 perceiver io architecture can work with 04:42 pretty much any type of data including 04:45 male spectrograms which are basically 04:47 our voice prints representing our audio 04:49 track using frequencies then this 04:52 spectrogram and the text transcript are 04:54 encoded edited and decoded to replace 04:57 the gap in the spectrogram with what 04:59 should appear as you see this is just 05:01 like generating an image and we use the 05:04 same process as an imaging painting but 05:06 the output and input data are 05:08 spectrograms or basically images of our 05:11 soundtrack if you are interested in 05:13 learning more about the perceiver i o 05:15 architecture i'd strongly recommend 05:17 watching yannick kilchor's video about 05:20 it 05:20 they train their model on a speech data 05:23 set creating random gaps in the audio 05:25 tracks and trying to fill in the gaps 05:27 during training then they used again 05:30 approach for training to further improve 05:32 the realism of the results quickly with 05:35 guns there will be the model we saw 05:37 called a generator and another model 05:39 called a discriminator the generator 05:41 will be trained to generate the new data 05:44 in our case the unpainted audio track 05:47 simultaneously the discriminator will be 05:49 fed samples from the training data set 05:52 and generated samples and we'll try to 05:54 guess if the sample scene was generated 05:57 called fake or real from the training 06:00 set ideally we'd want to have our 06:02 discriminator be right half of the time 06:05 so that it basically chooses randomly 06:07 meaning that our generated samples sound 06:09 just like a real one the discriminator 06:12 will then penalize the generator model 06:14 in order to make it sound more realistic 06:17 and voila you end up with a model that 06:20 can take speech and its transcript to 06:22 correct your grammar or pronunciation or 06:25 even fill in gaps following your voice 06:27 and track's atmosphere 06:29 this is so cool so you just have to 06:32 train this model once on a general data 06:34 set and then use it with your own audio 06:37 tracks as it should ideally be able to 06:39 generalize and work quite well of course 06:42 there are some failure cases but the 06:44 results are pretty impressive and you 06:46 can listen to more examples on their 06:48 project page linked below 07:00 one last thing before ending this video 07:03 which you may have been waiting for 07:04 since the intro once again this year 07:07 nvidia gtc 2022 will run online starting 07:11 march 21st and continues the rest of the 07:14 week and i will be running a giveaway in 07:16 collaboration with nvidia for the event 07:18 there are only two steps to have a 07:20 chance to win one of the rarest things 07:23 on the market right now a gpu more 07:25 precisely an nvidia rtx 3080 ti 07:30 first attend the event and send me a 07:32 screenshot of one of the sessions on 07:34 twitter second be sure to be subscribed 07:37 to the channel and comment with your 07:39 twitter handle and what you are most 07:41 excited about the gtc event gtc 22 will 07:44 focus on accelerated computing ai deep 07:47 learning data science and much more with 07:49 amazing speakers you can find more 07:51 detail with the link below but if you 07:52 are interested in ai and its 07:54 applications this is one of the coolest 07:56 events to attend for sure and it's 07:59 completely free for everyone using the 08:01 link in the description below good luck 08:03 to anyone participating in the giveaway 08:05 and i hope you are as excited as i am 08:08 for gtc i will draw the winners after 08:10 the event at the end of march and share 08:12 it in my newsletter i hope you enjoyed 08:15 this video and thank you to everyone 08:17 participating in the giveaway and 08:19 listening to the full video 08:22 [Music]