We’ve seen image inpainting, which aims to remove an undesirable object from a picture.
These machine learning-based techniques do not simply remove the objects, but they also understand the picture and fill the missing parts of the image with what the background should look like.
The recent advancements are incredible, and this inpainting task can be quite useful for many applications like advertisements or improving your future Instagram post. We also covered an even more challenging task: video inpainting, where the same process is applied to videos to remove objects or people.
The challenge with videos comes with staying consistent from frame to frame without any buggy artifacts. But now, what happens if we correctly remove a person from a movie and the sound is still there, unchanged? Well, we may hear a ghost and ruin all our work.
This is where a task I never covered on my channel comes in: speech inpainting. You heard it right, researchers from Google just published a paper aiming at inpainting speech, and, as we will see, the results are quite impressive.
Okay, we might rather hear than see the results, but you get the point. It can correct your grammar, pronunciation or even remove background noise. All things I definitely need to keep working on, or… simply use their new model… Listen to the examples in my video!
Register to GTC22 for free (don't forget to leave a comment and subscribe to enter the giveaway, steps in the video!): https://nvda.ws/3upUQkF
►Read the full article: https://www.louisbouchard.ai/speech-inpainting-with-ai/
►Borsos, Z., Sharifi, M. and Tagliasacchi, M., 2022. SpeechPainter: Text-conditioned Speech Inpainting. https://arxiv.org/pdf/2202.07273.pdf
►Listen to all the examples: https://google-research.github.io/seanet/speechpainter/examples/
►Subscribe to the newsletter where the winner will be announced: https://www.louisbouchard.ai/newsletter/
00:00
[Music]
00:00
we've seen image in painting which aims
00:02
to remove an undesirable object from a
00:04
picture these in-painting machine
00:06
learning based techniques do not simply
00:08
remove the objects but they also
00:10
understand the picture and fill the
00:11
missing parts of the image with what the
00:13
background should look like as we saw
00:16
the recent advancements are incredible
00:18
just like the results and this
00:20
in-painting task can be quite useful for
00:22
many applications like advertisements or
00:24
improving your future instagram posts we
00:27
also covered an even more challenging
00:29
task video in painting where the same
00:31
process is applied to videos to remove
00:33
objects or people the challenge with
00:35
videos come with staying consistent from
00:37
frame to frame without any buggy
00:39
artifacts but now what happens if we
00:42
correctly remove a person from a movie
00:44
and the sound is still there unchanged
00:46
well we may hear a ghost and ruin all
00:49
our work
00:50
this is where a task i never covered on
00:52
my channel comes in speech in painting
00:55
you heard it right researchers from
00:57
google just published a paper aiming at
01:00
in-painting speech and as we will see
01:02
the results are quite impressive okay we
01:05
might rather hear than see the results
01:07
but you get the point it can correct
01:09
your grammar pronunciation or even
01:11
background noise all things i definitely
01:14
need to keep working on are
01:16
simply use their model be sure to stay
01:19
until the end of the video to listen to
01:21
the examples you will be mind blown and
01:23
there's also a big surprise at the end
01:25
of the video that the thumbnail and
01:27
title may have spoiled which you surely
01:29
want to stay for but first a word from
01:31
this episode sponsor weights and biases
01:34
as most of you know weighted biases is
01:36
the tool you want to have for tracking
01:38
your experiments and sharing them with
01:40
your team it just got even better they
01:43
just launched a new feature that i
01:45
already set up for myself called alerts
01:48
as its name says alerts can be used to
01:50
notify you via slack or email if your
01:52
run has crashed or whether a custom
01:54
trigger such as your loss going to none
01:57
or a step in your ml pipeline has been
01:59
reached alerts can be applied to all
02:01
projects where you launch runs including
02:03
both personal and team projects and of
02:05
course it's super straightforward to set
02:07
up i love that they are constantly
02:09
adding new features and this one is
02:11
pretty cool in my opinion it will save
02:13
me a lot of time well it may not help me
02:16
get off my phone if you work in the
02:17
field check them out with the first link
02:19
below you won't be disappointed now
02:22
let's get into the most exciting part of
02:24
the video how these three researchers
02:27
from google created speech painter their
02:29
speech and painting model to understand
02:32
their approach we must first define the
02:34
goal of speech and painting here we want
02:36
to take an audio clip and its transcript
02:38
and in paint a small section of the
02:41
audio clip you can see many examples
02:43
here that we will play during the video
02:45
but you are free to also have a look
02:46
yourself with the link in the
02:48
description the texts you see are the
02:50
transcripts of the audio tracks with the
02:52
bold part being removed from the audio
02:54
clip and implanted by the network here's
02:57
what the first one sounds like initially
02:59
george robertson just
03:01
isn't outrageous and just
03:04
and after being invited by the network
03:07
george robertson described the plan as
03:09
an outrageous and just
03:11
here are two examples where it was used
03:13
to correct grammar and then
03:15
pronunciation
03:17
we swimmed in the river last weekend and
03:19
the water was cold
03:20
we swam in the river last weekend and
03:22
the water was cold
03:26
yesterday i bought a very nice wash
03:30
yesterday i bought a very nice watch
03:33
as you just heard the model not only
03:35
performs speech and painting but it does
03:37
that while maintaining the speaker's
03:39
identity and recording environments
03:41
following the line of text how cool is
03:43
that now that we know what the model can
03:45
do how does it achieve that as you
03:48
suspect this is pretty similar to image
03:50
and painting where we replace missing
03:52
pixels in an image instead we replace
03:54
missing data in an audio track following
03:57
a specific transcript so the model knows
03:59
what to say and its goal is to fill the
04:02
gap in the audio track following the
04:03
text and imitating the person's voice
04:06
and overall atmosphere of the track to
04:08
feel real
04:09
since image and speech and painting are
04:11
similar tasks they will use similar
04:14
architectures in this case they use a
04:16
model called perceiver io it will do the
04:19
same as with an image where you will
04:21
encode the image extract the most useful
04:23
information perform modifications and
04:26
finally decode it to reconstruct another
04:28
image with what you want to achieve in
04:31
the image and painting example the new
04:33
image will simply be the same but with
04:35
some pixels changed in this case instead
04:38
of pixels coming from an image the
04:40
perceiver io architecture can work with
04:42
pretty much any type of data including
04:45
male spectrograms which are basically
04:47
our voice prints representing our audio
04:49
track using frequencies then this
04:52
spectrogram and the text transcript are
04:54
encoded edited and decoded to replace
04:57
the gap in the spectrogram with what
04:59
should appear as you see this is just
05:01
like generating an image and we use the
05:04
same process as an imaging painting but
05:06
the output and input data are
05:08
spectrograms or basically images of our
05:11
soundtrack if you are interested in
05:13
learning more about the perceiver i o
05:15
architecture i'd strongly recommend
05:17
watching yannick kilchor's video about
05:20
it
05:20
they train their model on a speech data
05:23
set creating random gaps in the audio
05:25
tracks and trying to fill in the gaps
05:27
during training then they used again
05:30
approach for training to further improve
05:32
the realism of the results quickly with
05:35
guns there will be the model we saw
05:37
called a generator and another model
05:39
called a discriminator the generator
05:41
will be trained to generate the new data
05:44
in our case the unpainted audio track
05:47
simultaneously the discriminator will be
05:49
fed samples from the training data set
05:52
and generated samples and we'll try to
05:54
guess if the sample scene was generated
05:57
called fake or real from the training
06:00
set ideally we'd want to have our
06:02
discriminator be right half of the time
06:05
so that it basically chooses randomly
06:07
meaning that our generated samples sound
06:09
just like a real one the discriminator
06:12
will then penalize the generator model
06:14
in order to make it sound more realistic
06:17
and voila you end up with a model that
06:20
can take speech and its transcript to
06:22
correct your grammar or pronunciation or
06:25
even fill in gaps following your voice
06:27
and track's atmosphere
06:29
this is so cool so you just have to
06:32
train this model once on a general data
06:34
set and then use it with your own audio
06:37
tracks as it should ideally be able to
06:39
generalize and work quite well of course
06:42
there are some failure cases but the
06:44
results are pretty impressive and you
06:46
can listen to more examples on their
06:48
project page linked below
07:00
one last thing before ending this video
07:03
which you may have been waiting for
07:04
since the intro once again this year
07:07
nvidia gtc 2022 will run online starting
07:11
march 21st and continues the rest of the
07:14
week and i will be running a giveaway in
07:16
collaboration with nvidia for the event
07:18
there are only two steps to have a
07:20
chance to win one of the rarest things
07:23
on the market right now a gpu more
07:25
precisely an nvidia rtx 3080 ti
07:30
first attend the event and send me a
07:32
screenshot of one of the sessions on
07:34
twitter second be sure to be subscribed
07:37
to the channel and comment with your
07:39
twitter handle and what you are most
07:41
excited about the gtc event gtc 22 will
07:44
focus on accelerated computing ai deep
07:47
learning data science and much more with
07:49
amazing speakers you can find more
07:51
detail with the link below but if you
07:52
are interested in ai and its
07:54
applications this is one of the coolest
07:56
events to attend for sure and it's
07:59
completely free for everyone using the
08:01
link in the description below good luck
08:03
to anyone participating in the giveaway
08:05
and i hope you are as excited as i am
08:08
for gtc i will draw the winners after
08:10
the event at the end of march and share
08:12
it in my newsletter i hope you enjoyed
08:15
this video and thank you to everyone
08:17
participating in the giveaway and
08:19
listening to the full video
08:22
[Music]