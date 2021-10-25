Site Color
I explain Artificial Intelligence terms and news to non-experts.
Have you ever tuned in to a video or a TV show and the actors were completely inaudible, or the music was way too loud? Well, this problem, also called the cocktail party problem, may never happen again. Mitsubishi and Indiana University just published a new model as well as a new dataset tackling this task of identifying the right soundtrack. For example, if we take the same audio clip we just ran with the music way too loud, you can simply turn up or down the audio track you want to give more importance to the speech than the music.
The problem here is isolating any independent sound source from a complex acoustic scene like a movie scene or a youtube video where some sounds are not well balanced.
Sometimes you simply cannot hear some actors because of the music playing or explosions or other ambient sounds in the background.
Well, if you successfully isolate the different categories in a soundtrack, it means that you can also turn up or down only one of them, like turning down the music a bit to hear all the other actors correctly.
This is exactly what the researchers achieved. Learn more in the video!
►Read the full article:
https://www.louisbouchard.ai/isolate-voice-music-and-sound-effects-with-ai/
►Petermann, D., Wichern, G., Wang, Z., & Roux, J.L. (2021). The
Cocktail Fork Problem: Three-Stem Audio Separation for Real-World
Soundtracks. https://arxiv.org/pdf/2110.09958.pdf
►Project page: https://cocktail-fork.github.io/
►DnR dataset: https://github.com/darius522/dnr-utils#overview
have you ever tuned in to a video or a
tv show and the sound was like this
where the actors are completely
inaudible or something like
this where the music is way too loud
well this problem also called the
cocktail party problem may never happen
again mitsubishi and indiana university
just published a new model as well as a
new data set tackling this task of
identifying the right soundtrack for
example if we take the same audio clip
we just ran with the music way too loud
you can simply turn up or down the audio
attack you want to give more importance
to the speech than the
music and straight into the hot pans the
problem here is isolating any
independent sound source from a complex
acoustic scene like a movie scene or
youtube video where some sounds are not
well balanced sometimes you simply
cannot hear some actors because of the
music playing or explosions or other
ambient sounds in the background well if
you successfully isolate the different
categories in a soundtrack it means that
you can also turn up or down only one of
them like turning down the music a bit
to hear all the other actors correctly
as we just did from someone that isn't a
native english speaker this will be
incredibly useful when listening to
videos with loud background music and
actors or speakers with a strong accent
i am not used to just imagine having
these three sliders in a youtube video
to manually tweak them how cool would
that be
you have a metal arm it could also be
incredibly useful for translations or
speech-to-speech applications where we
could just isolate the speaker to
improve the task's results here the
researchers focused on the task of
splitting a soundtrack into three
categories music speech and sound
effects three categories that are often
seen in movies or tv shows they called
this task the cocktail fork problem and
you can clearly see where they got the
name from and i'll spoil you the results
they are quite amazing as we will hear
in the next few seconds but first let's
take a look at how they receive a movie
soundtrack and transform it into three
independent soundtracks this is the
architecture of the model you can see
the input mixture y which is the
complete soundtrack at the top and at
the bottom all of our three output
sources x which i repeat are the speech
music and other sound effects separated
the first step is to encode the
soundtrack using a fourier transform on
different resolutions called stft or
short time fourier transform this means
that the input which is a soundtrack
having frequencies over time is first
split into shorter segments for example
here it is either split with 32 64 or
256 milliseconds windows then we compute
the fourier transform on each of these
shorter segments sending 8 milliseconds
at a time for each window or segment
this will give the fourier spectrum of
each segment analyzed on different
segment sizes for the same soundtrack
allowing us to have short-term and
long-term information on the soundtrack
by emphasizing specific frequencies from
the initial input if they appear more
often in a longer segment for example
this information initially represented
in time frequency is now replaced by the
fourier phase and magnitude components
or fourier spectrum which can be shown
in a spectrogram similar to this one
note that here we have only an
overlapping segment of 0.10 seconds but
it is the same thing in our case with
three different segment sizes also
overlapping then this transformed
representation simply containing more
information about the soundtrack is sent
into a fully connected block to be
transformed into the same dimension for
all branches this transformation is the
first module which is learned during
training of the algorithm we then
average the results as it is shown to
improve the model's capacity to consider
these multiple sources as a whole rather
than independently here the multiple
sources are the transformed soundtrack
using differently sized windows don't
give up yet we just have a few steps
left before hearing the final results
this average information is then sent
into a bidirectional long short-term
memory which is a type of recurrent
neural network allowing the model to
understand the inputs over time just
like a convolutional neural network
understands images over space if you are
not familiar with recurrent neural
networks i invite you to watch the video
i made introducing them this is the
second module that is trained during
training we average the results once
again and finally send them to each of
our three branches that will extract the
appropriate sounds for the category here
the decoder is simply fully connected
layers again as you can see on the right
they will be responsible for extracting
only the wanted information from our
encoded information of course this is
the third and last module that learns
during training in order to achieve this
and all these three modules are trained
simultaneously finally we just reverse
the first step taking the spectrum data
back into time frequency components and
voila we have our final soundtrack
divided into three categories as i said
earlier in the video this research
allows you to turn up or down the volume
of each category independently but an
example is always better than words so
let's quickly hear that on two different
clips
[Music]
i phil swift here for flex tape the
super strong waterproof tape
that could instantly patch bonds
as if it wasn't already cool enough the
separation also allows you to edit
specific soundtrack imminently to add
some filters or reverb we're strong and
once it's on it holds on tight and for
emergency auto repairs
it's correct even in the toughest
conditions
they also released the data set for this
new test by merging three separate data
sets one for speech one for music and
another for sound effects this way they
created soundtracks from which they
already had their real separated auto
channels and could train their model to
replicate this ideal separation of
course the merging or mixing steps
wasn't as simple as it sounds they had
to make the final soundtrack as
challenging as a real movie scene this
means that they had to make
transformations to the independent audio
tracks to have a good blend that sounds
realistic in order to be able to train a
model on this data set and then use it
in the real world i invite you to read
their paper for more technical detail
about their implementation and this new
data set they introduced if you'd like
to tackle this task as well if you do so
please let me know and send me your
progress i'd love to see that or rather
to hear that both are linked in the
description below thank you very much
for watching for those of you who are
07:10
still here and huge thanks to anthony
07:13
manilow the most recent youtube members
07:15
supporting the videos