They reconstruct sound using cameras and a laser beam on any vibrating surface, allowing them to isolate music instruments, focus on a specific speaker, remove ambient noises, and many more amazing applications. TLDR: Watch the video to learn more and hear some crazy results! References ►Read the full article: / ►Sheinin, Mark and Chan, Dorian and O'Toole, Matthew and Narasimhan, Srinivasa G., 2022, Dual-Shutter Optical Vibration Sensing, Proc. IEEE CVPR. ►Project page: ►My Newsletter (A new AI application explained weekly to your emails!): https://www.louisbouchard.ai/cvpr-2022-best-paper https://imaging.cs.cmu.edu/vibration/ https://www.louisbouchard.ai/newsletter/ Video Transcript 0:00 this year i had the chance to be at cvpr 0:02 in person and attend the amazing best 0:05 paper award presentation with this 0:07 fantastic paper i had to cover on the 0:09 channel called dual shutter optical 0:12 vibration sensing by mark shanin dorian 0:15 chan mathew o'toole and srinivasa 0:18 narasimhan in one sentence they 0:21 reconstruct sound using cameras in a 0:23 laser beam on any vibrating surface 0:26 allowing them to isolate music 0:28 instruments focus on a specific speaker 0:30 remove ambient noises and many more 0:33 amazing applications let's dive into how 0:35 they achieve that and hear some crazy 0:37 results but first allow me one minute of 0:40 your time to introduce you to a 0:41 fantastic company the sponsor of this 0:44 video assembly ai assembly ai is a 0:47 company that offers accurate apis for 0:49 speech to text and audio intelligence 0:52 you can use their apis to automatically 0:54 transcribe and understand audio and 0:56 video data in just a few lines of code 0:58 and automatically convert asynchronous 1:00 and live audio streams into text 1:03 something extremely challenging to do 1:05 and typically requiring robust and 1:07 costly models of course it doesn't stop 1:10 here assembly ai will also process your 1:12 audio data and have informative feature 1:15 representations allowing you to easily 1:17 add text-based features like 1:19 summarization content moderation topic 1:21 detection and more all in one if you 1:24 need to understand or transcribe audio 1:26 or video data try assembly ai with the 1:29 first link below 1:33 let's start by listening to this example 1:35 of what the method can achieve 1:38 [Music] 1:53 you could clearly hear the two 1:54 individual guitars in each audio track 1:57 this was made using not a recorded sound 2:00 but a laser and two cameras equipped 2:02 with rolling and global shutter sensors 2:05 respectively it seems like tackling this 2:08 task through vision makes it much easier 2:10 than trying to split the audio tracks 2:12 after recording it also means we can 2:15 record anything through glasses and from 2:18 any vibrating objects here they used 2:21 their method on the speakers themselves 2:23 to isolate the left and right speakers 2:25 whereas a microphone will automatically 2:27 record both and blend the audio tracks 2:41 [Music] 2:45 typically this kind of spy technology 2:48 called visual vibrometry requires 2:51 perfect lighting conditions and 2:52 high-speed cameras that look like a 2:54 camouflaged sniper to capture high-speed 2:56 vibrations of up to 63 kilohertz here 3:00 they achieve similar results with 3:02 sensors built for only 60 and 130 hertz 3:06 and even better they can process 3:08 multiple objects at once still this is a 3:11 very challenging task requiring a lot of 3:13 engineering and great ideas to make it 3:16 happen they do not simply record the 3:18 instruments and send the video to a 3:20 model that automatically creates and 3:22 separates the audio they first need to 3:24 understand the laser they receive and 3:26 process it correctly they orient a laser 3:29 on the surface to listen to then this 3:32 laser bounces from the surface into a 3:34 focus plane this focus plane is where we 3:37 will take our information from not the 3:39 instruments or objects themselves so we 3:42 will analyze the tiny vibrations of the 3:44 objects of interest through the laser 3:46 response creating a representation like 3:49 this 3:50 this two-dimensional laser response 3:52 pattern cut by our cameras called a 3:54 speckle is then processed both globally 3:58 and locally using our two cameras our 4:01 local camera or the rolling shutter 4:03 camera will capture frames at only 60 4:06 fps so it will take multiple pictures 4:08 and roll them on the y-axis to get a 4:11 really noisy and inaccurate 63 kilohertz 4:14 representation this is where the global 4:16 shutter camera is necessary because of 4:18 the randomness in the speckled imaging 4:21 due to the roughness of the object's 4:23 surface and its movements it will 4:25 basically take a global screenshot of 4:27 the same speckle image we used with our 4:29 first camera and used this new image as 4:32 a reference frame to isolate only 4:34 relevant vibrations from the rolling 4:37 shutter captures 4:38 the rolling shutter camera will sample 4:40 the scene row by row with a high 4:42 frequency while the global shutter 4:44 camera will sample the entire scene at 4:47 once to serve as a reference frame and 4:49 we repeat this process for the whole 4:51 video 4:52 and voila this is how they are able to 4:55 split sound from a recording extract 4:57 only a single instrument remove ambient 5:00 noise or even reconstruct speech from 5:02 the vibrations of a bag of chips 5:05 mary had a little lamb this leaf was 5:08 white as snow of course this is just a 5:10 simple overview of this great paper and 5:12 i strongly invite you to read it for 5:14 more information congratulations to the 5:16 authorities for the honorable mention i 5:18 was glad to attend the event and see the 5:21 presentation live i'm super excited to 5:23 the future publications this paper will 5:25 motivate i also invite you to double 5:27 check all the bags of chips you may 5:29 leave near a window or otherwise some 5:31 people may listen to what you say thank 5:34 you for watching the whole video and let 5:36 me know how you'd apply this technology 5:38 and if you see any potential risks or 5:40 exciting use cases i'd love to discuss 5:42 these with you and a special thanks to 5:45 cvpr for inviting me to the event it was 5:47 really cool to be there in new orleans 5:49 with all the researchers and companies i 5:52 will see you next week with another amazing paper