Have you ever tuned in to a video or a TV show and the actors were completely inaudible, or the music was way too loud? Well, this problem, also called the cocktail party problem, may never happen again. Mitsubishi and Indiana University just published a new model as well as a new dataset tackling this task of identifying the right soundtrack. For example, if we take the same audio clip we just ran with the music way too loud, you can simply turn up or down the audio track you want to give more importance to the speech than the music.
The problem here is isolating any independent sound source from a complex acoustic scene like a movie scene or a youtube video where some sounds are not well balanced.
Sometimes you simply cannot hear some actors because of the music playing or explosions or other ambient sounds in the background.
Well, if you successfully isolate the different categories in a soundtrack, it means that you can also turn up or down only one of them, like turning down the music a bit to hear all the other actors correctly.
This is exactly what the researchers achieved. Learn more in the video!
►Read the full article:
https://www.louisbouchard.ai/isolate-voice-music-and-sound-effects-with-ai/
►Petermann, D., Wichern, G., Wang, Z., & Roux, J.L. (2021). The
Cocktail Fork Problem: Three-Stem Audio Separation for Real-World
Soundtracks. https://arxiv.org/pdf/2110.09958.pdf
►Project page: https://cocktail-fork.github.io/
►DnR dataset: https://github.com/darius522/dnr-utils#overview
►My Newsletter (A new AI application explained weekly to your emails!):
https://www.louisbouchard.ai/newsletter/
00:01
have you ever tuned in to a video or a
00:03
tv show and the sound was like this
00:10
where the actors are completely
00:11
inaudible or something like
00:14
this where the music is way too loud
00:18
well this problem also called the
00:20
cocktail party problem may never happen
00:22
again mitsubishi and indiana university
00:25
just published a new model as well as a
00:27
new data set tackling this task of
00:29
identifying the right soundtrack for
00:31
example if we take the same audio clip
00:34
we just ran with the music way too loud
00:36
you can simply turn up or down the audio
00:38
attack you want to give more importance
00:40
to the speech than the
00:42
music and straight into the hot pans the
00:46
problem here is isolating any
00:48
independent sound source from a complex
00:50
acoustic scene like a movie scene or
00:52
youtube video where some sounds are not
00:55
well balanced sometimes you simply
00:56
cannot hear some actors because of the
00:58
music playing or explosions or other
01:01
ambient sounds in the background well if
01:03
you successfully isolate the different
01:05
categories in a soundtrack it means that
01:07
you can also turn up or down only one of
01:10
them like turning down the music a bit
01:12
to hear all the other actors correctly
01:15
as we just did from someone that isn't a
01:17
native english speaker this will be
01:19
incredibly useful when listening to
01:21
videos with loud background music and
01:23
actors or speakers with a strong accent
01:26
i am not used to just imagine having
01:28
these three sliders in a youtube video
01:30
to manually tweak them how cool would
01:33
that be
01:39
you have a metal arm it could also be
01:41
incredibly useful for translations or
01:43
speech-to-speech applications where we
01:45
could just isolate the speaker to
01:47
improve the task's results here the
01:49
researchers focused on the task of
01:51
splitting a soundtrack into three
01:53
categories music speech and sound
01:56
effects three categories that are often
01:58
seen in movies or tv shows they called
02:00
this task the cocktail fork problem and
02:02
you can clearly see where they got the
02:04
name from and i'll spoil you the results
02:07
they are quite amazing as we will hear
02:09
in the next few seconds but first let's
02:11
take a look at how they receive a movie
02:14
soundtrack and transform it into three
02:16
independent soundtracks this is the
02:18
architecture of the model you can see
02:20
the input mixture y which is the
02:23
complete soundtrack at the top and at
02:25
the bottom all of our three output
02:27
sources x which i repeat are the speech
02:30
music and other sound effects separated
02:33
the first step is to encode the
02:34
soundtrack using a fourier transform on
02:37
different resolutions called stft or
02:40
short time fourier transform this means
02:42
that the input which is a soundtrack
02:44
having frequencies over time is first
02:47
split into shorter segments for example
02:49
here it is either split with 32 64 or
02:53
256 milliseconds windows then we compute
02:57
the fourier transform on each of these
02:59
shorter segments sending 8 milliseconds
03:01
at a time for each window or segment
03:04
this will give the fourier spectrum of
03:06
each segment analyzed on different
03:08
segment sizes for the same soundtrack
03:11
allowing us to have short-term and
03:13
long-term information on the soundtrack
03:15
by emphasizing specific frequencies from
03:17
the initial input if they appear more
03:20
often in a longer segment for example
03:22
this information initially represented
03:24
in time frequency is now replaced by the
03:27
fourier phase and magnitude components
03:29
or fourier spectrum which can be shown
03:32
in a spectrogram similar to this one
03:34
note that here we have only an
03:36
overlapping segment of 0.10 seconds but
03:39
it is the same thing in our case with
03:41
three different segment sizes also
03:44
overlapping then this transformed
03:46
representation simply containing more
03:48
information about the soundtrack is sent
03:50
into a fully connected block to be
03:52
transformed into the same dimension for
03:54
all branches this transformation is the
03:56
first module which is learned during
03:59
training of the algorithm we then
04:01
average the results as it is shown to
04:03
improve the model's capacity to consider
04:05
these multiple sources as a whole rather
04:08
than independently here the multiple
04:10
sources are the transformed soundtrack
04:12
using differently sized windows don't
04:15
give up yet we just have a few steps
04:17
left before hearing the final results
04:19
this average information is then sent
04:22
into a bidirectional long short-term
04:24
memory which is a type of recurrent
04:26
neural network allowing the model to
04:28
understand the inputs over time just
04:30
like a convolutional neural network
04:32
understands images over space if you are
04:35
not familiar with recurrent neural
04:36
networks i invite you to watch the video
04:38
i made introducing them this is the
04:40
second module that is trained during
04:42
training we average the results once
04:44
again and finally send them to each of
04:47
our three branches that will extract the
04:49
appropriate sounds for the category here
04:51
the decoder is simply fully connected
04:53
layers again as you can see on the right
04:56
they will be responsible for extracting
04:58
only the wanted information from our
05:00
encoded information of course this is
05:03
the third and last module that learns
05:05
during training in order to achieve this
05:07
and all these three modules are trained
05:10
simultaneously finally we just reverse
05:13
the first step taking the spectrum data
05:15
back into time frequency components and
05:18
voila we have our final soundtrack
05:20
divided into three categories as i said
05:22
earlier in the video this research
05:24
allows you to turn up or down the volume
05:26
of each category independently but an
05:29
example is always better than words so
05:31
let's quickly hear that on two different
05:33
clips
05:35
[Music]
05:41
i phil swift here for flex tape the
05:43
super strong waterproof tape
05:49
that could instantly patch bonds
05:53
as if it wasn't already cool enough the
05:55
separation also allows you to edit
05:57
specific soundtrack imminently to add
06:00
some filters or reverb we're strong and
06:03
once it's on it holds on tight and for
06:06
emergency auto repairs
06:09
it's correct even in the toughest
06:10
conditions
06:13
they also released the data set for this
06:15
new test by merging three separate data
06:17
sets one for speech one for music and
06:20
another for sound effects this way they
06:22
created soundtracks from which they
06:24
already had their real separated auto
06:26
channels and could train their model to
06:28
replicate this ideal separation of
06:30
course the merging or mixing steps
06:33
wasn't as simple as it sounds they had
06:35
to make the final soundtrack as
06:36
challenging as a real movie scene this
06:39
means that they had to make
06:40
transformations to the independent audio
06:42
tracks to have a good blend that sounds
06:44
realistic in order to be able to train a
06:47
model on this data set and then use it
06:50
in the real world i invite you to read
06:52
their paper for more technical detail
06:54
about their implementation and this new
06:56
data set they introduced if you'd like
06:58
to tackle this task as well if you do so
07:01
please let me know and send me your
07:02
progress i'd love to see that or rather
07:05
to hear that both are linked in the
07:07
description below thank you very much
07:09
for watching for those of you who are
07:10
still here and huge thanks to anthony
07:13
manilow the most recent youtube members
07:15
supporting the videos