TLDR: They reconstruct sound using cameras and a laser beam on any vibrating surface, allowing them to isolate music instruments, focus on a specific speaker, remove ambient noises, and many more amazing applications.
►Read the full article: https://www.louisbouchard.ai/cvpr-2022-best-paper/
►Sheinin, Mark and Chan, Dorian and O'Toole, Matthew and Narasimhan,
Srinivasa G., 2022, Dual-Shutter Optical Vibration Sensing, Proc. IEEE
CVPR.
►Project page: https://imaging.cs.cmu.edu/vibration/
►My Newsletter (A new AI application explained weekly to your emails!): https://www.louisbouchard.ai/newsletter/
0:00
this year i had the chance to be at cvpr
0:02
in person and attend the amazing best
0:05
paper award presentation with this
0:07
fantastic paper i had to cover on the
0:09
channel called dual shutter optical
0:12
vibration sensing by mark shanin dorian
0:15
chan mathew o'toole and srinivasa
0:18
narasimhan in one sentence they
0:21
reconstruct sound using cameras in a
0:23
laser beam on any vibrating surface
0:26
allowing them to isolate music
0:28
instruments focus on a specific speaker
0:30
remove ambient noises and many more
0:33
amazing applications let's dive into how
0:35
they achieve that and hear some crazy
0:37
results but first allow me one minute of
0:40
your time to introduce you to a
0:41
fantastic company the sponsor of this
0:44
video assembly ai assembly ai is a
0:47
company that offers accurate apis for
0:49
speech to text and audio intelligence
0:52
you can use their apis to automatically
0:54
transcribe and understand audio and
0:56
video data in just a few lines of code
0:58
and automatically convert asynchronous
1:00
and live audio streams into text
1:03
something extremely challenging to do
1:05
and typically requiring robust and
1:07
costly models of course it doesn't stop
1:10
here assembly ai will also process your
1:12
audio data and have informative feature
1:15
representations allowing you to easily
1:17
add text-based features like
1:19
summarization content moderation topic
1:21
detection and more all in one if you
1:24
need to understand or transcribe audio
1:26
or video data try assembly ai with the
1:29
first link below
1:33
let's start by listening to this example
1:35
of what the method can achieve
1:38
[Music]
1:53
you could clearly hear the two
1:54
individual guitars in each audio track
1:57
this was made using not a recorded sound
2:00
but a laser and two cameras equipped
2:02
with rolling and global shutter sensors
2:05
respectively it seems like tackling this
2:08
task through vision makes it much easier
2:10
than trying to split the audio tracks
2:12
after recording it also means we can
2:15
record anything through glasses and from
2:18
any vibrating objects here they used
2:21
their method on the speakers themselves
2:23
to isolate the left and right speakers
2:25
whereas a microphone will automatically
2:27
record both and blend the audio tracks
2:41
[Music]
2:45
typically this kind of spy technology
2:48
called visual vibrometry requires
2:51
perfect lighting conditions and
2:52
high-speed cameras that look like a
2:54
camouflaged sniper to capture high-speed
2:56
vibrations of up to 63 kilohertz here
3:00
they achieve similar results with
3:02
sensors built for only 60 and 130 hertz
3:06
and even better they can process
3:08
multiple objects at once still this is a
3:11
very challenging task requiring a lot of
3:13
engineering and great ideas to make it
3:16
happen they do not simply record the
3:18
instruments and send the video to a
3:20
model that automatically creates and
3:22
separates the audio they first need to
3:24
understand the laser they receive and
3:26
process it correctly they orient a laser
3:29
on the surface to listen to then this
3:32
laser bounces from the surface into a
3:34
focus plane this focus plane is where we
3:37
will take our information from not the
3:39
instruments or objects themselves so we
3:42
will analyze the tiny vibrations of the
3:44
objects of interest through the laser
3:46
response creating a representation like
3:49
this
3:50
this two-dimensional laser response
3:52
pattern cut by our cameras called a
3:54
speckle is then processed both globally
3:58
and locally using our two cameras our
4:01
local camera or the rolling shutter
4:03
camera will capture frames at only 60
4:06
fps so it will take multiple pictures
4:08
and roll them on the y-axis to get a
4:11
really noisy and inaccurate 63 kilohertz
4:14
representation this is where the global
4:16
shutter camera is necessary because of
4:18
the randomness in the speckled imaging
4:21
due to the roughness of the object's
4:23
surface and its movements it will
4:25
basically take a global screenshot of
4:27
the same speckle image we used with our
4:29
first camera and used this new image as
4:32
a reference frame to isolate only
4:34
relevant vibrations from the rolling
4:37
shutter captures
4:38
the rolling shutter camera will sample
4:40
the scene row by row with a high
4:42
frequency while the global shutter
4:44
camera will sample the entire scene at
4:47
once to serve as a reference frame and
4:49
we repeat this process for the whole
4:51
video
4:52
and voila this is how they are able to
4:55
split sound from a recording extract
4:57
only a single instrument remove ambient
5:00
noise or even reconstruct speech from
5:02
the vibrations of a bag of chips
5:05
mary had a little lamb this leaf was
5:08
white as snow of course this is just a
5:10
simple overview of this great paper and
5:12
i strongly invite you to read it for
5:14
more information congratulations to the
5:16
authorities for the honorable mention i
5:18
was glad to attend the event and see the
5:21
presentation live i'm super excited to
5:23
the future publications this paper will
5:25
motivate i also invite you to double
5:27
check all the bags of chips you may
5:29
leave near a window or otherwise some
5:31
people may listen to what you say thank
5:34
you for watching the whole video and let
5:36
me know how you'd apply this technology
5:38
and if you see any potential risks or
5:40
exciting use cases i'd love to discuss
5:42
these with you and a special thanks to
5:45
cvpr for inviting me to the event it was
5:47
really cool to be there in new orleans
5:49
with all the researchers and companies i
5:52
will see you next week with another
amazing paper