paint-brush
What Did AI Bring to Computer Vision?by@whatsai
169 reads

What Did AI Bring to Computer Vision?

by Louis BouchardMay 5th, 2021
Read on Terminal Reader
Read this story w/o Javascript
tldt arrow

Too Long; Didn't Read

In this video, I will openly share everything about deep nets for computer vision applications, their successes, and the limitations we have yet to address. I explain Artificial Intelligence terms and news to non-experts. The video is a transcript of an interview with Louis Bouchard. He explains the architecture of deep nets and how they are used to learn more about the power of our computers. In the video, he explains how deep nets can learn more information about the feature feature that we can learn during training.
featured image - What Did AI Bring to Computer Vision?
Louis Bouchard HackerNoon profile picture

In this video, I will openly share everything about deep nets for computer vision applications, their successes, and the limitations we have yet to address.

Watch the video

References

Read the article: https://www.louisbouchard.ai/ai-in-computer-vision/

Yuille, A.L., and Liu, C., 2021. Deep nets: What have they ever done for vision?. International Journal of Computer Vision, 129(3), pp.781–802, https://arxiv.org/abs/1805.04025.

Video Transcript

00:00

if you clicked on this video you are

00:01

certainly interested in computer vision

00:04

applications

00:04

like image classification image

00:06

segmentation object detection

00:08

and more complex tasks like face

00:10

recognition image generation or even

00:12

star transfer application

00:14

as you may already know with the growing

00:16

power of our computers

00:17

most of these applications are now being

00:19

realized using similar deep neural

00:21

networks

00:22

what we often refer to as artificial

00:25

intelligence models

00:26

there are of course some differences

00:28

between the deep nets used in these

00:30

different vision applications

00:31

but as of now they all use the same

00:34

basis of convolutions

00:35

introduced in 1989 by yan loken the

00:38

major difference here

00:40

is our computation power coming from the

00:42

recent advancements of gpus

00:44

to quickly go over the architecture as

00:46

the name says convolution

00:48

is a process where an original image or

00:50

video frame which is

00:51

our input in a computer vision

00:53

applications is convolved

00:55

using filters that detect important

00:57

small features of an image

00:59

such as edges the network will

01:01

autonomously learn

01:02

filter values that detect important

01:04

features to match the output we want to

01:06

have

01:07

such as the object's name in a specific

01:09

image sent as input

01:11

for a classification task these filters

01:13

are usually of size

01:15

3x3 or 5x5 pixel squares allowing them

01:18

to detect the direction of the edges

01:20

left right up or down just like you can

01:23

see in this image

01:24

the process of convolution makes a dot

01:26

product between the filter and the

01:28

pixels it faces

01:29

it's basically just a sum of all the

01:31

filter pixels multiplied with the values

01:34

of the images pixels

01:35

at the corresponding positions then it

01:38

goes to the right and does it again

01:40

convolving the whole image once it's

01:42

done these convolved features

01:44

give us the output of the first

01:46

convolution layer which we call this

01:48

output a feature map

01:49

we repeat this process with many other

01:51

filters giving us multiple feature maps

01:54

one for each filter used in the

01:56

convolution process

01:57

having more than one feature map gives

01:59

us more information about the image

02:01

and especially more information that we

02:03

can learn during training

02:04

since these filters are what we aim to

02:06

learn for our task

02:08

these feature maps are all sent into the

02:10

next layer

02:11

as input to produce many other smaller

02:14

sized

02:14

feature maps again the deeper we get

02:17

into the network the smaller these

02:18

feature maps gets

02:20

because of the nature of convolutions

02:21

and the more general the information of

02:23

these feature maps become

02:25

until it reaches the end of the network

02:27

with extremely general information

02:29

about what the image contains disposed

02:31

of our many feature maps

02:33

which is used for classification or to

02:35

build a latent code

02:37

to represent information present in the

02:39

image in the case of a gan architecture

02:41

to generate a new image

02:43

based on this code which we refer to as

02:45

encoded information

02:47

in the example of image classification

02:49

simply put

02:50

we can see that at the end of the

02:52

network these small feature maps contain

02:54

the information about the presence of

02:56

each possible class telling you whether

02:58

it's a dog a cat

03:00

a person etc of course this is super

03:03

simplified

03:04

and there are other steps but i feel

03:06

like this is an accurate summary of

03:07

what's going on

03:08

inside a deep convolutional neural

03:10

network

03:11

if you've been following my channel and

03:13

posts you know that deep neural networks

03:15

proved

03:16

to be extremely powerful again and again

03:18

but they also have weaknesses and

03:20

weaknesses

03:21

that we should not try to hide as with

03:24

all things in life

03:25

deep nets have strength and weaknesses

03:27

while strengths are widely shared

03:30

the latter is often omitted or even

03:32

discarded by companies

03:34

and ultimately by some researchers this

03:36

paper

03:37

by alan yule and chenxileo aims to

03:40

openly share

03:41

everything about deep nets for vision

03:43

applications their success and the

03:45

limitations we have to address

03:47

moreover just like for our brain we

03:50

still do not fully understand their

03:52

inner workings

03:53

which makes the use of deep nets even

03:55

more limited since we cannot maximize

03:57

their strength

03:58

and limit weaknesses as stated by o

04:00

hobart

04:01

it's like a road map that tells you

04:02

where cars can drive but doesn't tell

04:04

you when or where

04:06

cars are actually driving this is

04:08

another point they discuss

04:09

in their paper namely what is the future

04:12

of computer vision algorithms

04:14

as you may be thinking one way to

04:16

improve computer vision applications is

04:18

to understand our own visual system

04:20

better starting with our brain which is

04:23

why

04:24

neuroscience is such an important field

04:26

for ai

04:27

indeed current deep nets are

04:29

surprisingly different than our own

04:31

vision system

04:32

firstly humans can learn from very small

04:35

numbers of examples

04:37

by exploiting our memory and the

04:39

knowledge we already acquired

04:40

we can also exploit our understanding of

04:43

the world and its physical properties to

04:45

make

04:45

deductions something that a deep net

04:47

cannot do in 1999

04:50

gupp nick ital explained that babies are

04:52

more like tiny scientists

04:54

who understand the world by performing

04:56

experiments

04:57

and seeking causal explanations for

05:00

phenomena rather than

05:01

simply receiving stimulus from images

05:03

like current

05:04

deep nets do also we humans

05:07

are much more robust as we can easily

05:10

identify an object from any viewpoint

05:12

texture it has occlusions it may

05:14

encounter and novel context

05:16

as a concrete example you can just

05:18

visualize the annoying captcha you

05:20

always have to fill in

05:22

when logging into a website this captcha

05:24

is used to detect butts since they are

05:26

awful

05:27

when there are occlusions like this as

05:30

you can see here

05:31

the deep net got fooled by all the

05:33

examples because of the jungle context

05:35

and the fact that a monkey

05:37

is not typically holding a guitar this

05:39

happens because it's certainly not in

05:41

the training data set

05:43

of course this exact situation might not

05:45

happen very often

05:46

in real life but i will show some more

05:48

concrete examples

05:49

that are more relatable and that already

05:52

happened later on

05:53

in the video deep nets also have

05:55

strength that we must highlight

05:57

they can outperform us for face

05:59

recognition tasks since

06:00

humans are not used to until recently

06:03

seeing more than a few thousands of

06:05

people

06:05

in their whole lifetime but this

06:07

strength of deep nets also comes with a

06:09

limitation

06:10

where these faces need to be straight

06:12

centered clear

06:14

without any occlusions etc indeed

06:17

the algorithm could not recognize your

06:19

best friend at the alwyn party

06:21

disguised in harry potter having only

06:23

glasses and a lightning bolt on the

06:25

forehead

06:26

where you would instantly recognize him

06:28

and see

06:29

whoa that's not very original it looks

06:31

like you just put glasses on

06:33

similarly such algorithms are extremely

06:36

precise radiologists

06:37

if all the settings are similar to what

06:39

they have been seeing

06:40

during their training they will

06:42

outperform any human

06:43

this is mainly because even the most

06:45

expert radiologists have only seen

06:48

a fairly small number of ct scans in

06:50

their lives as they suggest

06:51

the superiority of algorithms may also

06:54

be because they are doing a low priority

06:56

task

06:57

for humans for example a computer vision

06:59

app on your phone can identify the

07:02

hundreds of plants in your garden much

07:04

better than

07:05

most of us watching the video can but a

07:08

plant expert

07:08

will surely outperform it and all of us

07:11

together as well

07:12

but again this strength comes with a

07:14

huge problem

07:15

related to the data the algorithm needs

07:17

in order to be this powerful

07:19

as they mentioned and as we often see on

07:21

twitter and article titles

07:23

there are biases due to the data set

07:25

these deep nets are trained on

07:28

since an algorithm is only as good as

07:31

the data set it is evaluated on

07:33

and the performance measures used this

07:35

dataset limitation

07:37

comes with the price that these deep

07:39

neural networks are much

07:40

less general purpose flexible and

07:43

adaptative

07:44

than our own visual system they are less

07:46

general purpose

07:47

and flexible in the way that contrary to

07:50

our visual system

07:51

where we automatically perform edge

07:53

detection binocular stereo

07:55

semantic segmentation object

07:57

classification scene classification and

07:59

3d

08:00

depth estimation deep nets can only be

08:02

trained to achieve

08:03

one of these tasks indeed simply by

08:05

looking around

08:06

your vision system automatically

08:08

achieves all these tasks with extreme

08:10

precision

08:10

where deep nets have difficulty

08:12

achieving similar precision on one of

08:14

them

08:14

but even if this seems effortless to us

08:17

half of our neurons are at work

08:19

processing the information

08:20

and analyzing what's going on we are

08:23

still

08:23

far from mimicking our vision system

08:26

even with the current depth of our

08:27

networks

08:28

but is that really the goal of our

08:30

algorithms will it be better to just

08:33

use them as a tool to improve our

08:35

weaknesses i couldn't say

08:36

but i am sure that we want to address

08:39

the deep nets limitations

08:40

that can cause serious consequences

08:43

rather than omitting them

08:44

i will show some concrete examples of

08:47

such consequences

08:48

just after introducing these limitations

08:50

but if you are too intrigued you can

08:52

skip

08:52

right to it following the timestamps

08:54

under the video and come back to the

08:56

explanation

08:57

afterwards indeed the lack of precision

08:59

we previously mentioned

09:01

by deepnets arises mainly because of the

09:03

disparity between the data we use to

09:05

train our algorithm

09:06

and what it sees in real life as you

09:09

know an algorithm needs to see a lot of

09:11

data

09:11

to iteratively improve at the task it is

09:14

trained for

09:14

this data is often referred to as a

09:17

training data set

09:18

this data disparity between the training

09:20

data set and the real

09:22

world is a problem because the real

09:23

world is too complicated to accurately

09:25

be represented

09:27

in a single data set which is why deep

09:29

nets are less additive than our vision

09:31

system

09:32

in the paper they call this the

09:34

combinatorial complexity explosion of

09:36

natural images the combinatorial

09:38

complexity

09:39

comes from the multitude of possible

09:41

variations within a natural image

09:43

like the camera pose lighting texture

09:46

material

09:47

background the position of the objects

09:49

etc biases can appear at any of these

09:52

levels of complexity

09:53

the data set is missing you can see how

09:56

these large data sets now seem

09:57

very small due to all these factors

10:00

considering that having only

10:02

let's say 13 of these different

10:04

parameters and we allow only

10:06

1000 different values for each of them

10:08

we quickly jump to this number of

10:10

different images

10:11

to represent only a single object the

10:14

current data sets only cover a handful

10:16

of these multitudes of possible

10:18

variations for each object

10:20

thus missing most real-world situations

10:22

that it will encounter in production

10:24

it's also worth mentioning that since

10:26

the variety of images is very limited

10:28

the network may find shortcuts to

10:31

detecting some objects as we saw

10:33

previously with the monkey where it was

10:34

detecting a human instead of a monkey

10:37

because of the guitar in front of it

10:39

similarly you can see that it's

10:40

detecting a bird here

10:42

instead of a guitar probably because the

10:44

model has never

10:45

seen a guitar with a jungle background

10:48

this is called

10:49

overfitting to the background context

10:51

where the algorithm does not focus on

10:53

the right thing

10:54

and instead finds a pattern in the

10:56

images themselves rather than on the

10:58

object of interest

11:00

also these data sets are all built from

11:03

images taken by photographs

11:04

meaning that they only cover specific

11:06

angles and poses that do not transfer to

11:09

all orientation possibilities in the

11:11

real world

11:12

currently we use benchmarks with the

11:14

most complex data sets possible to

11:16

compare the current algorithms and rate

11:18

them

11:19

which if you recall are very incomplete

11:21

compared to the real world

11:23

nonetheless we are often happy with 99

11:26

accuracy for a task on such benchmarks

11:29

firstly the problem is that this

11:31

one-person error is determined on a

11:32

benchmark data set

11:34

meaning that it's similar to our

11:35

training data set in the way that it

11:37

doesn't

11:37

represent the richness of natural images

11:40

it's normal because it's impossible to

11:42

represent the real world in just a bunch

11:44

of images

11:45

it's just way too complicated and there

11:47

are too many situations possible

11:49

these benchmarks we use to test our data

11:52

set to determine whether or not

11:53

they are ready to be deployed in the

11:55

real world application are not really

11:57

accurate to determine how well it will

11:59

actually perform

12:00

which leads to the second problem that

12:02

is how it will actually perform

12:04

in the real world let's see that the

12:06

benchmark data set is huge

12:08

and most cases are covered and we really

12:11

have 99

12:12

accuracy what are the consequences of

12:14

the one percent of cases where the

12:16

algorithm fails in the real world

12:19

this number will be represented in

12:21

misdiagnosis

12:22

accidents financial mistakes or even

12:25

worse

12:25

death such cases could be a self-driving

12:28

car

12:28

during a heavy rainy day heavily

12:30

affecting the death sensors

12:32

used by the vehicle causing it to fail

12:34

many depth estimations

12:36

would you trust your life to this

12:37

partially blind robot taxi

12:40

i don't think i would similarly would

12:42

you trust a self-driving car at night to

12:44

avoid

12:44

driving over pedestrians or cyclists

12:47

where even yourself had difficulty

12:49

seeing them

12:50

these kinds of life-threatening

12:51

situations are so broad

12:53

that it's almost impossible that they

12:55

are all represented in the training data

12:57

set

12:57

and of course here i use extreme

12:59

examples of the most relatable

13:01

application

13:02

but you can just imagine how harmful

13:04

this could be

13:05

when the perfectly trained and tested

13:07

algorithm misclassifies your ct scan

13:09

leading to misdiagnosis just because

13:12

your hospital has different settings in

13:13

their scanner or because you didn't

13:15

drink enough water

13:16

or die anything that would be different

13:19

from your training data

13:20

could lead to a major problem in real

13:22

life even if the benchmark

13:24

used to test it says it's perfect also

13:27

as it already happened

13:28

this can lead to people in

13:30

underrepresented demographics being

13:32

unfairly treated by these algorithms

13:34

and even worse this is why i argue that

13:37

we must focus on the task

13:38

where the algorithms help us and not

13:41

where they replace

13:42

us as long as they are that dependent on

13:44

data

13:45

this brings us to the two questions they

13:47

highlight how can we efficiently test

13:49

these algorithms to ensure that they

13:51

work on these enormous data sets

13:53

if we can only test them on a finite

13:55

subset and two

13:57

how can we train algorithms infinite

13:59

size data sets so that they can perform

14:01

well

14:02

on the truly enormous datasets required

14:04

to capture the combinatorial complexity

14:07

of the real world

14:08

in the paper they suggest to rethink our

14:11

methods for benchmarking performance

14:13

and evaluating vision algorithms and i

14:15

agree entirely

14:17

especially now where most applications

14:19

are made for real life users instead of

14:21

only academic competitions

14:23

it's crucial to get out of these

14:24

academia evaluation metrics

14:26

and create more appropriate evaluation

14:28

tools we also have to accept that data

14:31

bias exists

14:32

and that it can cause real world

14:34

problems of course we need to learn to

14:36

reduce these biases

14:38

but also to accept them biases are

14:40

inevitable due to the combinatorial

14:42

complexity of the real world

14:44

that cannot be realistically represented

14:46

in a single data set of images

14:48

yet thus focusing our attention without

14:51

any play of words with transformers

14:53

on better algorithms that can learn to

14:55

be fair

14:56

even when trained on such incomplete

14:58

data sets

14:59

rather than having bigger and bigger

15:01

models trying to represent the most data

15:04

possible

15:05

even if it may look like it this paper

15:07

was not a criticism of current

15:08

approaches

15:09

instead it's an opinion piece motivated

15:11

by discussions with other researchers in

15:14

several disciplines

15:15

as they state we stress that views

15:17

expressed in the paper

15:18

are our own and do not necessarily

15:20

reflect

15:21

those of the computer vision community

15:23

but i must say

15:24

this was a very interesting read and my

15:27

views are quite similar

15:28

they also discuss many important

15:30

innovations that happen over the last 40

15:32

years in computer vision

15:34

that is definitely worth reading as

15:36

always the link to the paper

15:38

is in the description below to end on a

15:40

more positive note we are nearly

15:42

a decade into the revolution of deep

15:44

neural networks that started in 2012

15:47

with alexnet and the imagenet

15:49

competition since then

15:51

there has been immense progress on our

15:53

computation power

15:54

and the deep net architectures like the

15:56

use of batch normalization

15:58

residual connections and more recently

16:00

self-attention

16:01

researchers will undoubtedly improve the

16:03

architecture of deep nets but we shall

16:05

not forget that there are other ways to

16:07

achieve intelligent models than going

16:09

deeper and using more data of course

16:12

these ways are yet to be discovered

16:14

if this story of deep neural networks

16:16

sounds interesting to you

16:18

i made a video of one of the most

16:19

interesting architecture

16:21

along with a short historical review of

16:23

deep nets i'm sure you'll love it

16:25

thank you for watching

Also published on: https://www.louisbouchard.me/ai-in-computer-vision/