Have you ever tuned in to a video or a TV show and the actors were completely inaudible, or the music was way too loud? Well, this problem, also called the cocktail party problem, may never happen again. Mitsubishi and Indiana University just published a new model as well as a new dataset tackling this task of identifying the right soundtrack. For example, if we take the same audio clip we just ran with the music way too loud, you can simply turn up or down the audio track you want to give more importance to the speech than the music. The problem here is isolating any independent sound source from a complex acoustic scene like a movie scene or a youtube video where some sounds are not well balanced. Sometimes you simply cannot hear some actors because of the music playing or explosions or other ambient sounds in the background. Well, if you successfully isolate the different categories in a soundtrack, it means that you can also turn up or down only one of them, like turning down the music a bit to hear all the other actors correctly. This is exactly what the researchers achieved. Learn more in the video! Learn more: References ►Read the full article: ►Petermann, D., Wichern, G., Wang, Z., & Roux, J.L. (2021). The Cocktail Fork Problem: Three-Stem Audio Separation for Real-World Soundtracks. ►Project page: ►DnR dataset: ►My Newsletter (A new AI application explained weekly to your emails!): https://www.louisbouchard.ai/isolate-voice-music-and-sound-effects-with-ai/ https://arxiv.org/pdf/2110.09958.pdf https://cocktail-fork.github.io/ https://github.com/darius522/dnr-utils#overview https://www.louisbouchard.ai/newsletter/ Video transcript 00:01 have you ever tuned in to a video or a 00:03 tv show and the sound was like this 00:10 where the actors are completely 00:11 inaudible or something like 00:14 this where the music is way too loud 00:18 well this problem also called the 00:20 cocktail party problem may never happen 00:22 again mitsubishi and indiana university 00:25 just published a new model as well as a 00:27 new data set tackling this task of 00:29 identifying the right soundtrack for 00:31 example if we take the same audio clip 00:34 we just ran with the music way too loud 00:36 you can simply turn up or down the audio 00:38 attack you want to give more importance 00:40 to the speech than the 00:42 music and straight into the hot pans the 00:46 problem here is isolating any 00:48 independent sound source from a complex 00:50 acoustic scene like a movie scene or 00:52 youtube video where some sounds are not 00:55 well balanced sometimes you simply 00:56 cannot hear some actors because of the 00:58 music playing or explosions or other 01:01 ambient sounds in the background well if 01:03 you successfully isolate the different 01:05 categories in a soundtrack it means that 01:07 you can also turn up or down only one of 01:10 them like turning down the music a bit 01:12 to hear all the other actors correctly 01:15 as we just did from someone that isn't a 01:17 native english speaker this will be 01:19 incredibly useful when listening to 01:21 videos with loud background music and 01:23 actors or speakers with a strong accent 01:26 i am not used to just imagine having 01:28 these three sliders in a youtube video 01:30 to manually tweak them how cool would 01:33 that be 01:39 you have a metal arm it could also be 01:41 incredibly useful for translations or 01:43 speech-to-speech applications where we 01:45 could just isolate the speaker to 01:47 improve the task's results here the 01:49 researchers focused on the task of 01:51 splitting a soundtrack into three 01:53 categories music speech and sound 01:56 effects three categories that are often 01:58 seen in movies or tv shows they called 02:00 this task the cocktail fork problem and 02:02 you can clearly see where they got the 02:04 name from and i'll spoil you the results 02:07 they are quite amazing as we will hear 02:09 in the next few seconds but first let's 02:11 take a look at how they receive a movie 02:14 soundtrack and transform it into three 02:16 independent soundtracks this is the 02:18 architecture of the model you can see 02:20 the input mixture y which is the 02:23 complete soundtrack at the top and at 02:25 the bottom all of our three output 02:27 sources x which i repeat are the speech 02:30 music and other sound effects separated 02:33 the first step is to encode the 02:34 soundtrack using a fourier transform on 02:37 different resolutions called stft or 02:40 short time fourier transform this means 02:42 that the input which is a soundtrack 02:44 having frequencies over time is first 02:47 split into shorter segments for example 02:49 here it is either split with 32 64 or 02:53 256 milliseconds windows then we compute 02:57 the fourier transform on each of these 02:59 shorter segments sending 8 milliseconds 03:01 at a time for each window or segment 03:04 this will give the fourier spectrum of 03:06 each segment analyzed on different 03:08 segment sizes for the same soundtrack 03:11 allowing us to have short-term and 03:13 long-term information on the soundtrack 03:15 by emphasizing specific frequencies from 03:17 the initial input if they appear more 03:20 often in a longer segment for example 03:22 this information initially represented 03:24 in time frequency is now replaced by the 03:27 fourier phase and magnitude components 03:29 or fourier spectrum which can be shown 03:32 in a spectrogram similar to this one 03:34 note that here we have only an 03:36 overlapping segment of 0.10 seconds but 03:39 it is the same thing in our case with 03:41 three different segment sizes also 03:44 overlapping then this transformed 03:46 representation simply containing more 03:48 information about the soundtrack is sent 03:50 into a fully connected block to be 03:52 transformed into the same dimension for 03:54 all branches this transformation is the 03:56 first module which is learned during 03:59 training of the algorithm we then 04:01 average the results as it is shown to 04:03 improve the model's capacity to consider 04:05 these multiple sources as a whole rather 04:08 than independently here the multiple 04:10 sources are the transformed soundtrack 04:12 using differently sized windows don't 04:15 give up yet we just have a few steps 04:17 left before hearing the final results 04:19 this average information is then sent 04:22 into a bidirectional long short-term 04:24 memory which is a type of recurrent 04:26 neural network allowing the model to 04:28 understand the inputs over time just 04:30 like a convolutional neural network 04:32 understands images over space if you are 04:35 not familiar with recurrent neural 04:36 networks i invite you to watch the video 04:38 i made introducing them this is the 04:40 second module that is trained during 04:42 training we average the results once 04:44 again and finally send them to each of 04:47 our three branches that will extract the 04:49 appropriate sounds for the category here 04:51 the decoder is simply fully connected 04:53 layers again as you can see on the right 04:56 they will be responsible for extracting 04:58 only the wanted information from our 05:00 encoded information of course this is 05:03 the third and last module that learns 05:05 during training in order to achieve this 05:07 and all these three modules are trained 05:10 simultaneously finally we just reverse 05:13 the first step taking the spectrum data 05:15 back into time frequency components and 05:18 voila we have our final soundtrack 05:20 divided into three categories as i said 05:22 earlier in the video this research 05:24 allows you to turn up or down the volume 05:26 of each category independently but an 05:29 example is always better than words so 05:31 let's quickly hear that on two different 05:33 clips 05:35 [Music] 05:41 i phil swift here for flex tape the 05:43 super strong waterproof tape 05:49 that could instantly patch bonds 05:53 as if it wasn't already cool enough the 05:55 separation also allows you to edit 05:57 specific soundtrack imminently to add 06:00 some filters or reverb we're strong and 06:03 once it's on it holds on tight and for 06:06 emergency auto repairs 06:09 it's correct even in the toughest 06:10 conditions 06:13 they also released the data set for this 06:15 new test by merging three separate data 06:17 sets one for speech one for music and 06:20 another for sound effects this way they 06:22 created soundtracks from which they 06:24 already had their real separated auto 06:26 channels and could train their model to 06:28 replicate this ideal separation of 06:30 course the merging or mixing steps 06:33 wasn't as simple as it sounds they had 06:35 to make the final soundtrack as 06:36 challenging as a real movie scene this 06:39 means that they had to make 06:40 transformations to the independent audio 06:42 tracks to have a good blend that sounds 06:44 realistic in order to be able to train a 06:47 model on this data set and then use it 06:50 in the real world i invite you to read 06:52 their paper for more technical detail 06:54 about their implementation and this new 06:56 data set they introduced if you'd like 06:58 to tackle this task as well if you do so 07:01 please let me know and send me your 07:02 progress i'd love to see that or rather 07:05 to hear that both are linked in the 07:07 description below thank you very much 07:09 for watching for those of you who are 07:10 still here and huge thanks to anthony 07:13 manilow the most recent youtube members 07:15 supporting the videos