paint-brush
SpeechPainter: Text-Conditioned Speech Inpaintingby@whatsai
827 reads
827 reads

SpeechPainter: Text-Conditioned Speech Inpainting

by Louis Bouchard6mFebruary 26th, 2022
Read on Terminal Reader
Read this story w/o Javascript
tldt arrow

Too Long; Didn't Read

Machine learning-based techniques do not simply remove objects, but they also understand the picture and fill the missing parts of the image with what the background should look like. We also covered an even more challenging task: video inpainting, where the same process is applied to videos to remove objects or people. Inpainting speech can correct your grammar, pronunciation or even remove background noise.Register to GTC22 for free (don't forget to leave a comment and subscribe to the giveaway, steps in the video!

Companies Mentioned

Mention Thumbnail
Mention Thumbnail
featured image - SpeechPainter: Text-Conditioned Speech Inpainting
Louis Bouchard HackerNoon profile picture

We’ve seen image inpainting, which aims to remove an undesirable object from a picture.

These machine learning-based techniques do not simply remove the objects, but they also understand the picture and fill the missing parts of the image with what the background should look like.

The recent advancements are incredible, and this inpainting task can be quite useful for many applications like advertisements or improving your future Instagram post. We also covered an even more challenging task: video inpainting, where the same process is applied to videos to remove objects or people.

The challenge with videos comes with staying consistent from frame to frame without any buggy artifacts. But now, what happens if we correctly remove a person from a movie and the sound is still there, unchanged? Well, we may hear a ghost and ruin all our work.

This is where a task I never covered on my channel comes in: speech inpainting. You heard it right, researchers from Google just published a paper aiming at inpainting speech, and, as we will see, the results are quite impressive.

Okay, we might rather hear than see the results, but you get the point. It can correct your grammar, pronunciation or even remove background noise. All things I definitely need to keep working on, or… simply use their new model… Listen to the examples in my video!

Watch the video

References

Register to GTC22 for free (don't forget to leave a comment and subscribe to enter the giveaway, steps in the video!): https://nvda.ws/3upUQkF
►Read the full article: https://www.louisbouchard.ai/speech-inpainting-with-ai/
►Borsos, Z., Sharifi, M. and Tagliasacchi, M., 2022. SpeechPainter: Text-conditioned Speech Inpainting. https://arxiv.org/pdf/2202.07273.pdf
►Listen to all the examples: https://google-research.github.io/seanet/speechpainter/examples/
►Subscribe to the newsletter where the winner will be announced: https://www.louisbouchard.ai/newsletter/

Video Transcript

       00:00

[Music]

00:00

we've seen image in painting which aims

00:02

to remove an undesirable object from a

00:04

picture these in-painting machine

00:06

learning based techniques do not simply

00:08

remove the objects but they also

00:10

understand the picture and fill the

00:11

missing parts of the image with what the

00:13

background should look like as we saw

00:16

the recent advancements are incredible

00:18

just like the results and this

00:20

in-painting task can be quite useful for

00:22

many applications like advertisements or

00:24

improving your future instagram posts we

00:27

also covered an even more challenging

00:29

task video in painting where the same

00:31

process is applied to videos to remove

00:33

objects or people the challenge with

00:35

videos come with staying consistent from

00:37

frame to frame without any buggy

00:39

artifacts but now what happens if we

00:42

correctly remove a person from a movie

00:44

and the sound is still there unchanged

00:46

well we may hear a ghost and ruin all

00:49

our work

00:50

this is where a task i never covered on

00:52

my channel comes in speech in painting

00:55

you heard it right researchers from

00:57

google just published a paper aiming at

01:00

in-painting speech and as we will see

01:02

the results are quite impressive okay we

01:05

might rather hear than see the results

01:07

but you get the point it can correct

01:09

your grammar pronunciation or even

01:11

background noise all things i definitely

01:14

need to keep working on are

01:16

simply use their model be sure to stay

01:19

until the end of the video to listen to

01:21

the examples you will be mind blown and

01:23

there's also a big surprise at the end

01:25

of the video that the thumbnail and

01:27

title may have spoiled which you surely

01:29

want to stay for but first a word from

01:31

this episode sponsor weights and biases

01:34

as most of you know weighted biases is

01:36

the tool you want to have for tracking

01:38

your experiments and sharing them with

01:40

your team it just got even better they

01:43

just launched a new feature that i

01:45

already set up for myself called alerts

01:48

as its name says alerts can be used to

01:50

notify you via slack or email if your

01:52

run has crashed or whether a custom

01:54

trigger such as your loss going to none

01:57

or a step in your ml pipeline has been

01:59

reached alerts can be applied to all

02:01

projects where you launch runs including

02:03

both personal and team projects and of

02:05

course it's super straightforward to set

02:07

up i love that they are constantly

02:09

adding new features and this one is

02:11

pretty cool in my opinion it will save

02:13

me a lot of time well it may not help me

02:16

get off my phone if you work in the

02:17

field check them out with the first link

02:19

below you won't be disappointed now

02:22

let's get into the most exciting part of

02:24

the video how these three researchers

02:27

from google created speech painter their

02:29

speech and painting model to understand

02:32

their approach we must first define the

02:34

goal of speech and painting here we want

02:36

to take an audio clip and its transcript

02:38

and in paint a small section of the

02:41

audio clip you can see many examples

02:43

here that we will play during the video

02:45

but you are free to also have a look

02:46

yourself with the link in the

02:48

description the texts you see are the

02:50

transcripts of the audio tracks with the

02:52

bold part being removed from the audio

02:54

clip and implanted by the network here's

02:57

what the first one sounds like initially

02:59

george robertson just

03:01

isn't outrageous and just

03:04

and after being invited by the network

03:07

george robertson described the plan as

03:09

an outrageous and just

03:11

here are two examples where it was used

03:13

to correct grammar and then

03:15

pronunciation

03:17

we swimmed in the river last weekend and

03:19

the water was cold

03:20

we swam in the river last weekend and

03:22

the water was cold

03:26

yesterday i bought a very nice wash

03:30

yesterday i bought a very nice watch

03:33

as you just heard the model not only

03:35

performs speech and painting but it does

03:37

that while maintaining the speaker's

03:39

identity and recording environments

03:41

following the line of text how cool is

03:43

that now that we know what the model can

03:45

do how does it achieve that as you

03:48

suspect this is pretty similar to image

03:50

and painting where we replace missing

03:52

pixels in an image instead we replace

03:54

missing data in an audio track following

03:57

a specific transcript so the model knows

03:59

what to say and its goal is to fill the

04:02

gap in the audio track following the

04:03

text and imitating the person's voice

04:06

and overall atmosphere of the track to

04:08

feel real

04:09

since image and speech and painting are

04:11

similar tasks they will use similar

04:14

architectures in this case they use a

04:16

model called perceiver io it will do the

04:19

same as with an image where you will

04:21

encode the image extract the most useful

04:23

information perform modifications and

04:26

finally decode it to reconstruct another

04:28

image with what you want to achieve in

04:31

the image and painting example the new

04:33

image will simply be the same but with

04:35

some pixels changed in this case instead

04:38

of pixels coming from an image the

04:40

perceiver io architecture can work with

04:42

pretty much any type of data including

04:45

male spectrograms which are basically

04:47

our voice prints representing our audio

04:49

track using frequencies then this

04:52

spectrogram and the text transcript are

04:54

encoded edited and decoded to replace

04:57

the gap in the spectrogram with what

04:59

should appear as you see this is just

05:01

like generating an image and we use the

05:04

same process as an imaging painting but

05:06

the output and input data are

05:08

spectrograms or basically images of our

05:11

soundtrack if you are interested in

05:13

learning more about the perceiver i o

05:15

architecture i'd strongly recommend

05:17

watching yannick kilchor's video about

05:20

it

05:20

they train their model on a speech data

05:23

set creating random gaps in the audio

05:25

tracks and trying to fill in the gaps

05:27

during training then they used again

05:30

approach for training to further improve

05:32

the realism of the results quickly with

05:35

guns there will be the model we saw

05:37

called a generator and another model

05:39

called a discriminator the generator

05:41

will be trained to generate the new data

05:44

in our case the unpainted audio track

05:47

simultaneously the discriminator will be

05:49

fed samples from the training data set

05:52

and generated samples and we'll try to

05:54

guess if the sample scene was generated

05:57

called fake or real from the training

06:00

set ideally we'd want to have our

06:02

discriminator be right half of the time

06:05

so that it basically chooses randomly

06:07

meaning that our generated samples sound

06:09

just like a real one the discriminator

06:12

will then penalize the generator model

06:14

in order to make it sound more realistic

06:17

and voila you end up with a model that

06:20

can take speech and its transcript to

06:22

correct your grammar or pronunciation or

06:25

even fill in gaps following your voice

06:27

and track's atmosphere

06:29

this is so cool so you just have to

06:32

train this model once on a general data

06:34

set and then use it with your own audio

06:37

tracks as it should ideally be able to

06:39

generalize and work quite well of course

06:42

there are some failure cases but the

06:44

results are pretty impressive and you

06:46

can listen to more examples on their

06:48

project page linked below

07:00

one last thing before ending this video

07:03

which you may have been waiting for

07:04

since the intro once again this year

07:07

nvidia gtc 2022 will run online starting

07:11

march 21st and continues the rest of the

07:14

week and i will be running a giveaway in

07:16

collaboration with nvidia for the event

07:18

there are only two steps to have a

07:20

chance to win one of the rarest things

07:23

on the market right now a gpu more

07:25

precisely an nvidia rtx 3080 ti

07:30

first attend the event and send me a

07:32

screenshot of one of the sessions on

07:34

twitter second be sure to be subscribed

07:37

to the channel and comment with your

07:39

twitter handle and what you are most

07:41

excited about the gtc event gtc 22 will

07:44

focus on accelerated computing ai deep

07:47

learning data science and much more with

07:49

amazing speakers you can find more

07:51

detail with the link below but if you

07:52

are interested in ai and its

07:54

applications this is one of the coolest

07:56

events to attend for sure and it's

07:59

completely free for everyone using the

08:01

link in the description below good luck

08:03

to anyone participating in the giveaway

08:05

and i hope you are as excited as i am

08:08

for gtc i will draw the winners after

08:10

the event at the end of march and share

08:12

it in my newsletter i hope you enjoyed

08:15

this video and thank you to everyone

08:17

participating in the giveaway and

08:19

listening to the full video

08:22

[Music]