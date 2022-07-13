I explain Artificial Intelligence terms and news to non-experts.
TLDR: They reconstruct sound using cameras and a laser beam on any vibrating surface, allowing them to isolate music instruments, focus on a specific speaker, remove ambient noises, and many more amazing applications.
►Read the full article: https://www.louisbouchard.ai/cvpr-2022-best-paper/
►Sheinin, Mark and Chan, Dorian and O'Toole, Matthew and Narasimhan,
Srinivasa G., 2022, Dual-Shutter Optical Vibration Sensing, Proc. IEEE
CVPR.
►Project page: https://imaging.cs.cmu.edu/vibration/
this year i had the chance to be at cvpr
in person and attend the amazing best
paper award presentation with this
fantastic paper i had to cover on the
channel called dual shutter optical
vibration sensing by mark shanin dorian
chan mathew o'toole and srinivasa
narasimhan in one sentence they
reconstruct sound using cameras in a
laser beam on any vibrating surface
allowing them to isolate music
instruments focus on a specific speaker
remove ambient noises and many more
amazing applications let's dive into how
they achieve that and hear some crazy
results but first allow me one minute of
your time to introduce you to a
fantastic company the sponsor of this
video assembly ai assembly ai is a
company that offers accurate apis for
speech to text and audio intelligence
you can use their apis to automatically
transcribe and understand audio and
video data in just a few lines of code
and automatically convert asynchronous
and live audio streams into text
something extremely challenging to do
and typically requiring robust and
costly models of course it doesn't stop
here assembly ai will also process your
audio data and have informative feature
representations allowing you to easily
add text-based features like
summarization content moderation topic
detection and more all in one if you
need to understand or transcribe audio
or video data try assembly ai with the
first link below
let's start by listening to this example
of what the method can achieve
[Music]
you could clearly hear the two
individual guitars in each audio track
this was made using not a recorded sound
but a laser and two cameras equipped
with rolling and global shutter sensors
respectively it seems like tackling this
task through vision makes it much easier
than trying to split the audio tracks
after recording it also means we can
record anything through glasses and from
any vibrating objects here they used
their method on the speakers themselves
to isolate the left and right speakers
whereas a microphone will automatically
record both and blend the audio tracks
[Music]
typically this kind of spy technology
called visual vibrometry requires
perfect lighting conditions and
high-speed cameras that look like a
camouflaged sniper to capture high-speed
vibrations of up to 63 kilohertz here
they achieve similar results with
sensors built for only 60 and 130 hertz
and even better they can process
multiple objects at once still this is a
very challenging task requiring a lot of
engineering and great ideas to make it
happen they do not simply record the
instruments and send the video to a
model that automatically creates and
separates the audio they first need to
understand the laser they receive and
process it correctly they orient a laser
on the surface to listen to then this
laser bounces from the surface into a
focus plane this focus plane is where we
will take our information from not the
instruments or objects themselves so we
will analyze the tiny vibrations of the
objects of interest through the laser
response creating a representation like
this
this two-dimensional laser response
pattern cut by our cameras called a
speckle is then processed both globally
and locally using our two cameras our
local camera or the rolling shutter
camera will capture frames at only 60
fps so it will take multiple pictures
and roll them on the y-axis to get a
really noisy and inaccurate 63 kilohertz
representation this is where the global
shutter camera is necessary because of
the randomness in the speckled imaging
due to the roughness of the object's
surface and its movements it will
basically take a global screenshot of
the same speckle image we used with our
first camera and used this new image as
a reference frame to isolate only
relevant vibrations from the rolling
shutter captures
the rolling shutter camera will sample
the scene row by row with a high
frequency while the global shutter
camera will sample the entire scene at
once to serve as a reference frame and
we repeat this process for the whole
video
and voila this is how they are able to
split sound from a recording extract
only a single instrument remove ambient
noise or even reconstruct speech from
the vibrations of a bag of chips
mary had a little lamb this leaf was
white as snow of course this is just a
simple overview of this great paper and
i strongly invite you to read it for
more information congratulations to the
authorities for the honorable mention i
was glad to attend the event and see the
presentation live i'm super excited to
the future publications this paper will
motivate i also invite you to double
check all the bags of chips you may
leave near a window or otherwise some
people may listen to what you say thank
you for watching the whole video and let
me know how you'd apply this technology
and if you see any potential risks or
exciting use cases i'd love to discuss
these with you and a special thanks to
cvpr for inviting me to the event it was
really cool to be there in new orleans
with all the researchers and companies i
will see you next week with another
amazing paper