CVPR 2022 Best Paper Honorable Mention: Dual-Shutter Optical Vibration Sensing

Written by whatsai | Published 2022/07/13
Tech Story Tags: artificial-intelligence | ai | computer-vision | machine-learning | ml | data-science | technology | hackernoon-top-story | web-monetization | hackernoon-es | hackernoon-hi | hackernoon-zh | hackernoon-vi | hackernoon-fr | hackernoon-pt | hackernoon-ja

TLDR

A new AI application explained weekly to your emails! Join in to learn more and hear some crazy results. They reconstruct sound using cameras and a laser beam on any vibrating surface, allowing them to isolate music instruments, focus on a specific speaker, remove ambient noises, and many more amazing applications. The company behind the project is called dual shutter optical vibration sensing by mark shanin dorian and Chan Dorian and Dorian O'Toole, Matthew and Narasimhan.via the TL;DR App

TLDR: They reconstruct sound using cameras and a laser beam on any vibrating surface, allowing them to isolate music instruments, focus on a specific speaker, remove ambient noises, and many more amazing applications.

Watch the video to learn more and hear some crazy results!

References

►Read the full article: https://www.louisbouchard.ai/cvpr-2022-best-paper/
►Sheinin, Mark and Chan, Dorian and O'Toole, Matthew and Narasimhan,
Srinivasa G., 2022, Dual-Shutter Optical Vibration Sensing, Proc. IEEE
CVPR.
►Project page: https://imaging.cs.cmu.edu/vibration/
►My Newsletter (A new AI application explained weekly to your emails!): https://www.louisbouchard.ai/newsletter/

Video Transcript

0:00

this year i had the chance to be at cvpr

0:02

in person and attend the amazing best

0:05

paper award presentation with this

0:07

fantastic paper i had to cover on the

0:09

channel called dual shutter optical

0:12

vibration sensing by mark shanin dorian

0:15

chan mathew o'toole and srinivasa

0:18

narasimhan in one sentence they

0:21

reconstruct sound using cameras in a

0:23

laser beam on any vibrating surface

0:26

allowing them to isolate music

0:28

instruments focus on a specific speaker

0:30

remove ambient noises and many more

0:33

amazing applications let's dive into how

0:35

they achieve that and hear some crazy

0:37

results but first allow me one minute of

0:40

your time to introduce you to a

0:41

fantastic company the sponsor of this

0:44

video assembly ai assembly ai is a

0:47

company that offers accurate apis for

0:49

speech to text and audio intelligence

0:52

you can use their apis to automatically

0:54

transcribe and understand audio and

0:56

video data in just a few lines of code

0:58

and automatically convert asynchronous

1:00

and live audio streams into text

1:03

something extremely challenging to do

1:05

and typically requiring robust and

1:07

costly models of course it doesn't stop

1:10

here assembly ai will also process your

1:12

audio data and have informative feature

1:15

representations allowing you to easily

1:17

add text-based features like

1:19

summarization content moderation topic

1:21

detection and more all in one if you

1:24

need to understand or transcribe audio

1:26

or video data try assembly ai with the

1:29

first link below

1:33

let's start by listening to this example

1:35

of what the method can achieve

1:38

[Music]

1:53

you could clearly hear the two

1:54

individual guitars in each audio track

1:57

this was made using not a recorded sound

2:00

but a laser and two cameras equipped

2:02

with rolling and global shutter sensors

2:05

respectively it seems like tackling this

2:08

task through vision makes it much easier

2:10

than trying to split the audio tracks

2:12

after recording it also means we can

2:15

record anything through glasses and from

2:18

any vibrating objects here they used

2:21

their method on the speakers themselves

2:23

to isolate the left and right speakers

2:25

whereas a microphone will automatically

2:27

record both and blend the audio tracks

2:41

[Music]

2:45

typically this kind of spy technology

2:48

called visual vibrometry requires

2:51

perfect lighting conditions and

2:52

high-speed cameras that look like a

2:54

camouflaged sniper to capture high-speed

2:56

vibrations of up to 63 kilohertz here

3:00

they achieve similar results with

3:02

sensors built for only 60 and 130 hertz

3:06

and even better they can process

3:08

multiple objects at once still this is a

3:11

very challenging task requiring a lot of

3:13

engineering and great ideas to make it

3:16

happen they do not simply record the

3:18

instruments and send the video to a

3:20

model that automatically creates and

3:22

separates the audio they first need to

3:24

understand the laser they receive and

3:26

process it correctly they orient a laser

3:29

on the surface to listen to then this

3:32

laser bounces from the surface into a

3:34

focus plane this focus plane is where we

3:37

will take our information from not the

3:39

instruments or objects themselves so we

3:42

will analyze the tiny vibrations of the

3:44

objects of interest through the laser

3:46

response creating a representation like

3:49

this

3:50

this two-dimensional laser response

3:52

pattern cut by our cameras called a

3:54

speckle is then processed both globally

3:58

and locally using our two cameras our

4:01

local camera or the rolling shutter

4:03

camera will capture frames at only 60

4:06

fps so it will take multiple pictures

4:08

and roll them on the y-axis to get a

4:11

really noisy and inaccurate 63 kilohertz

4:14

representation this is where the global

4:16

shutter camera is necessary because of

4:18

the randomness in the speckled imaging

4:21

due to the roughness of the object's

4:23

surface and its movements it will

4:25

basically take a global screenshot of

4:27

the same speckle image we used with our

4:29

first camera and used this new image as

4:32

a reference frame to isolate only

4:34

relevant vibrations from the rolling

4:37

shutter captures

4:38

the rolling shutter camera will sample

4:40

the scene row by row with a high

4:42

frequency while the global shutter

4:44

camera will sample the entire scene at

4:47

once to serve as a reference frame and

4:49

we repeat this process for the whole

4:51

video

4:52

and voila this is how they are able to

4:55

split sound from a recording extract

4:57

only a single instrument remove ambient

5:00

noise or even reconstruct speech from

5:02

the vibrations of a bag of chips

5:05

mary had a little lamb this leaf was

5:08

white as snow of course this is just a

5:10

simple overview of this great paper and

5:12

i strongly invite you to read it for

5:14

more information congratulations to the

5:16

authorities for the honorable mention i

5:18

was glad to attend the event and see the

5:21

presentation live i'm super excited to

5:23

the future publications this paper will

5:25

motivate i also invite you to double

5:27

check all the bags of chips you may

5:29

leave near a window or otherwise some

5:31

people may listen to what you say thank

5:34

you for watching the whole video and let

5:36

me know how you'd apply this technology

5:38

and if you see any potential risks or

5:40

exciting use cases i'd love to discuss

5:42

these with you and a special thanks to

5:45

cvpr for inviting me to the event it was

5:47

really cool to be there in new orleans

5:49

with all the researchers and companies i

5:52

will see you next week with another

amazing paper

Written by whatsai | I explain Artificial Intelligence terms and news to non-experts.

Published by HackerNoon on 2022/07/13