Hackernoon logoWhat Did AI Bring to Computer Vision? by@whatsai

What Did AI Bring to Computer Vision?

Louis Bouchard Hacker Noon profile picture

@whatsaiLouis Bouchard

I explain Artificial Intelligence terms and news to non-experts.

In this video, I will openly share everything about deep nets for computer vision applications, their successes, and the limitations we have yet to address.

Watch the video

References

Read the article: https://www.louisbouchard.ai/ai-in-computer-vision/

Yuille, A.L., and Liu, C., 2021. Deep nets: What have they ever done for vision?. International Journal of Computer Vision, 129(3), pp.781–802, https://arxiv.org/abs/1805.04025.

Video Transcript

00:00

if you clicked on this video you are

00:01

certainly interested in computer vision

00:04

applications

00:04

like image classification image

00:06

segmentation object detection

00:08

and more complex tasks like face

00:10

recognition image generation or even

00:12

star transfer application

00:14

as you may already know with the growing

00:16

power of our computers

00:17

most of these applications are now being

00:19

realized using similar deep neural

00:21

networks

00:22

what we often refer to as artificial

00:25

intelligence models

00:26

there are of course some differences

00:28

between the deep nets used in these

00:30

different vision applications

00:31

but as of now they all use the same

00:34

basis of convolutions

00:35

introduced in 1989 by yan loken the

00:38

major difference here

00:40

is our computation power coming from the

00:42

recent advancements of gpus

00:44

to quickly go over the architecture as

00:46

the name says convolution

00:48

is a process where an original image or

00:50

video frame which is

00:51

our input in a computer vision

00:53

applications is convolved

00:55

using filters that detect important

00:57

small features of an image

00:59

such as edges the network will

01:01

autonomously learn

01:02

filter values that detect important

01:04

features to match the output we want to

01:06

have

01:07

such as the object's name in a specific

01:09

image sent as input

01:11

for a classification task these filters

01:13

are usually of size

01:15

3x3 or 5x5 pixel squares allowing them

01:18

to detect the direction of the edges

01:20

left right up or down just like you can

01:23

see in this image

01:24

the process of convolution makes a dot

01:26

product between the filter and the

01:28

pixels it faces

01:29

it's basically just a sum of all the

01:31

filter pixels multiplied with the values

01:34

of the images pixels

01:35

at the corresponding positions then it

01:38

goes to the right and does it again

01:40

convolving the whole image once it's

01:42

done these convolved features

01:44

give us the output of the first

01:46

convolution layer which we call this

01:48

output a feature map

01:49

we repeat this process with many other

01:51

filters giving us multiple feature maps

01:54

one for each filter used in the

01:56

convolution process

01:57

having more than one feature map gives

01:59

us more information about the image

02:01

and especially more information that we

02:03

can learn during training

02:04

since these filters are what we aim to

02:06

learn for our task

02:08

these feature maps are all sent into the

02:10

next layer

02:11

as input to produce many other smaller

02:14

sized

02:14

feature maps again the deeper we get

02:17

into the network the smaller these

02:18

feature maps gets

02:20

because of the nature of convolutions

02:21

and the more general the information of

02:23

these feature maps become

02:25

until it reaches the end of the network

02:27

with extremely general information

02:29

about what the image contains disposed

02:31

of our many feature maps

02:33

which is used for classification or to

02:35

build a latent code

02:37

to represent information present in the

02:39

image in the case of a gan architecture

02:41

to generate a new image

02:43

based on this code which we refer to as

02:45

encoded information

02:47

in the example of image classification

02:49

simply put

02:50

we can see that at the end of the

02:52

network these small feature maps contain

02:54

the information about the presence of

02:56

each possible class telling you whether

02:58

it's a dog a cat

03:00

a person etc of course this is super

03:03

simplified

03:04

and there are other steps but i feel

03:06

like this is an accurate summary of

03:07

what's going on

03:08

inside a deep convolutional neural

03:10

network

03:11

if you've been following my channel and

03:13

posts you know that deep neural networks

03:15

proved

03:16

to be extremely powerful again and again

03:18

but they also have weaknesses and

03:20

weaknesses

03:21

that we should not try to hide as with

03:24

all things in life

03:25

deep nets have strength and weaknesses

03:27

while strengths are widely shared

03:30

the latter is often omitted or even

03:32

discarded by companies

03:34

and ultimately by some researchers this

03:36

paper

03:37

by alan yule and chenxileo aims to

03:40

openly share

03:41

everything about deep nets for vision

03:43

applications their success and the

03:45

limitations we have to address

03:47

moreover just like for our brain we

03:50

still do not fully understand their

03:52

inner workings

03:53

which makes the use of deep nets even

03:55

more limited since we cannot maximize

03:57

their strength

03:58

and limit weaknesses as stated by o

04:00

hobart

04:01

it's like a road map that tells you

04:02

where cars can drive but doesn't tell

04:04

you when or where

04:06

cars are actually driving this is

04:08

another point they discuss

04:09

in their paper namely what is the future

04:12

of computer vision algorithms

04:14

as you may be thinking one way to

04:16

improve computer vision applications is

04:18

to understand our own visual system

04:20

better starting with our brain which is

04:23

why

04:24

neuroscience is such an important field

04:26

for ai

04:27

indeed current deep nets are

04:29

surprisingly different than our own

04:31

vision system

04:32

firstly humans can learn from very small

04:35

numbers of examples

04:37

by exploiting our memory and the

04:39

knowledge we already acquired

04:40

we can also exploit our understanding of

04:43

the world and its physical properties to

04:45

make

04:45

deductions something that a deep net

04:47

cannot do in 1999

04:50

gupp nick ital explained that babies are

04:52

more like tiny scientists

04:54

who understand the world by performing

04:56

experiments

04:57

and seeking causal explanations for

05:00

phenomena rather than

05:01

simply receiving stimulus from images

05:03

like current

05:04

deep nets do also we humans

05:07

are much more robust as we can easily

05:10

identify an object from any viewpoint

05:12

texture it has occlusions it may

05:14

encounter and novel context

05:16

as a concrete example you can just

05:18

visualize the annoying captcha you

05:20

always have to fill in

05:22

when logging into a website this captcha

05:24

is used to detect butts since they are

05:26

awful

05:27

when there are occlusions like this as

05:30

you can see here

05:31

the deep net got fooled by all the

05:33

examples because of the jungle context

05:35

and the fact that a monkey

05:37

is not typically holding a guitar this

05:39

happens because it's certainly not in

05:41

the training data set

05:43

of course this exact situation might not

05:45

happen very often

05:46

in real life but i will show some more

05:48

concrete examples

05:49

that are more relatable and that already

05:52

happened later on

05:53

in the video deep nets also have

05:55

strength that we must highlight

05:57

they can outperform us for face

05:59

recognition tasks since

06:00

humans are not used to until recently

06:03

seeing more than a few thousands of

06:05

people

06:05

in their whole lifetime but this

06:07

strength of deep nets also comes with a

06:09

limitation

06:10

where these faces need to be straight

06:12

centered clear

06:14

without any occlusions etc indeed

06:17

the algorithm could not recognize your

06:19

best friend at the alwyn party

06:21

disguised in harry potter having only

06:23

glasses and a lightning bolt on the

06:25

forehead

06:26

where you would instantly recognize him

06:28

and see

06:29

whoa that's not very original it looks

06:31

like you just put glasses on

06:33

similarly such algorithms are extremely

06:36

precise radiologists

06:37

if all the settings are similar to what

06:39

they have been seeing

06:40

during their training they will

06:42

outperform any human

06:43

this is mainly because even the most

06:45

expert radiologists have only seen

06:48

a fairly small number of ct scans in

06:50

their lives as they suggest

06:51

the superiority of algorithms may also

06:54

be because they are doing a low priority

06:56

task

06:57

for humans for example a computer vision

06:59

app on your phone can identify the

07:02

hundreds of plants in your garden much

07:04

better than

07:05

most of us watching the video can but a

07:08

plant expert

07:08

will surely outperform it and all of us

07:11

together as well

07:12

but again this strength comes with a

07:14

huge problem

07:15

related to the data the algorithm needs

07:17

in order to be this powerful

07:19

as they mentioned and as we often see on

07:21

twitter and article titles

07:23

there are biases due to the data set

07:25

these deep nets are trained on

07:28

since an algorithm is only as good as

07:31

the data set it is evaluated on

07:33

and the performance measures used this

07:35

dataset limitation

07:37

comes with the price that these deep

07:39

neural networks are much

07:40

less general purpose flexible and

07:43

adaptative

07:44

than our own visual system they are less

07:46

general purpose

07:47

and flexible in the way that contrary to

07:50

our visual system

07:51

where we automatically perform edge

07:53

detection binocular stereo

07:55

semantic segmentation object

07:57

classification scene classification and

07:59

3d

08:00

depth estimation deep nets can only be

08:02

trained to achieve

08:03

one of these tasks indeed simply by

08:05

looking around

08:06

your vision system automatically

08:08

achieves all these tasks with extreme

08:10

precision

08:10

where deep nets have difficulty

08:12

achieving similar precision on one of

08:14

them

08:14

but even if this seems effortless to us

08:17

half of our neurons are at work

08:19

processing the information

08:20

and analyzing what's going on we are

08:23

still

08:23

far from mimicking our vision system

08:26

even with the current depth of our

08:27

networks

08:28

but is that really the goal of our

08:30

algorithms will it be better to just

08:33

use them as a tool to improve our

08:35

weaknesses i couldn't say

08:36

but i am sure that we want to address

08:39

the deep nets limitations

08:40

that can cause serious consequences

08:43

rather than omitting them

08:44

i will show some concrete examples of

08:47

such consequences

08:48

just after introducing these limitations

08:50

but if you are too intrigued you can

08:52

skip

08:52

right to it following the timestamps

08:54

under the video and come back to the

08:56

explanation

08:57

afterwards indeed the lack of precision

08:59

we previously mentioned

09:01

by deepnets arises mainly because of the

09:03

disparity between the data we use to

09:05

train our algorithm

09:06

and what it sees in real life as you

09:09

know an algorithm needs to see a lot of

09:11

data

09:11

to iteratively improve at the task it is

09:14

trained for

09:14

this data is often referred to as a

09:17

training data set

09:18

this data disparity between the training

09:20

data set and the real

09:22

world is a problem because the real

09:23

world is too complicated to accurately

09:25

be represented

09:27

in a single data set which is why deep

09:29

nets are less additive than our vision

09:31

system

09:32

in the paper they call this the

09:34

combinatorial complexity explosion of

09:36

natural images the combinatorial

09:38

complexity

09:39

comes from the multitude of possible

09:41

variations within a natural image

09:43

like the camera pose lighting texture

09:46

material

09:47

background the position of the objects

09:49

etc biases can appear at any of these

09:52

levels of complexity

09:53

the data set is missing you can see how

09:56

these large data sets now seem

09:57

very small due to all these factors

10:00

considering that having only

10:02

let's say 13 of these different

10:04

parameters and we allow only

10:06

1000 different values for each of them

10:08

we quickly jump to this number of

10:10

different images

10:11

to represent only a single object the

10:14

current data sets only cover a handful

10:16

of these multitudes of possible

10:18

variations for each object

10:20

thus missing most real-world situations

10:22

that it will encounter in production

10:24

it's also worth mentioning that since

10:26

the variety of images is very limited

10:28

the network may find shortcuts to

10:31

detecting some objects as we saw

10:33

previously with the monkey where it was

10:34

detecting a human instead of a monkey

10:37

because of the guitar in front of it

10:39

similarly you can see that it's

10:40

detecting a bird here

10:42

instead of a guitar probably because the

10:44

model has never

10:45

seen a guitar with a jungle background

10:48

this is called

10:49

overfitting to the background context

10:51

where the algorithm does not focus on

10:53

the right thing

10:54

and instead finds a pattern in the

10:56

images themselves rather than on the

10:58

object of interest

11:00

also these data sets are all built from

11:03

images taken by photographs

11:04

meaning that they only cover specific

11:06

angles and poses that do not transfer to

11:09

all orientation possibilities in the

11:11

real world

11:12

currently we use benchmarks with the

11:14

most complex data sets possible to

11:16

compare the current algorithms and rate

11:18

them

11:19

which if you recall are very incomplete

11:21

compared to the real world

11:23

nonetheless we are often happy with 99

11:26

accuracy for a task on such benchmarks

11:29

firstly the problem is that this

11:31

one-person error is determined on a

11:32

benchmark data set

11:34

meaning that it's similar to our

11:35

training data set in the way that it

11:37

doesn't

11:37

represent the richness of natural images

11:40

it's normal because it's impossible to

11:42

represent the real world in just a bunch

11:44

of images

11:45

it's just way too complicated and there

11:47

are too many situations possible

11:49

these benchmarks we use to test our data

11:52

set to determine whether or not

11:53

they are ready to be deployed in the

11:55

real world application are not really

11:57

accurate to determine how well it will

11:59

actually perform

12:00

which leads to the second problem that

12:02

is how it will actually perform

12:04

in the real world let's see that the

12:06

benchmark data set is huge

12:08

and most cases are covered and we really

12:11

have 99

12:12

accuracy what are the consequences of

12:14

the one percent of cases where the

12:16

algorithm fails in the real world

12:19

this number will be represented in

12:21

misdiagnosis

12:22

accidents financial mistakes or even

12:25

worse

12:25

death such cases could be a self-driving

12:28

car

12:28

during a heavy rainy day heavily

12:30

affecting the death sensors

12:32

used by the vehicle causing it to fail

12:34

many depth estimations

12:36

would you trust your life to this

12:37

partially blind robot taxi

12:40

i don't think i would similarly would

12:42

you trust a self-driving car at night to

12:44

avoid

12:44

driving over pedestrians or cyclists

12:47

where even yourself had difficulty

12:49

seeing them

12:50

these kinds of life-threatening

12:51

situations are so broad

12:53

that it's almost impossible that they

12:55

are all represented in the training data

12:57

set

12:57

and of course here i use extreme

12:59

examples of the most relatable

13:01

application

13:02

but you can just imagine how harmful

13:04

this could be

13:05

when the perfectly trained and tested

13:07

algorithm misclassifies your ct scan

13:09

leading to misdiagnosis just because

13:12

your hospital has different settings in

13:13

their scanner or because you didn't

13:15

drink enough water

13:16

or die anything that would be different

13:19

from your training data

13:20

could lead to a major problem in real

13:22

life even if the benchmark

13:24

used to test it says it's perfect also

13:27

as it already happened

13:28

this can lead to people in

13:30

underrepresented demographics being

13:32

unfairly treated by these algorithms

13:34

and even worse this is why i argue that

13:37

we must focus on the task

13:38

where the algorithms help us and not

13:41

where they replace

13:42

us as long as they are that dependent on

13:44

data

13:45

this brings us to the two questions they

13:47

highlight how can we efficiently test

13:49

these algorithms to ensure that they

13:51

work on these enormous data sets

13:53

if we can only test them on a finite

13:55

subset and two

13:57

how can we train algorithms infinite

13:59

size data sets so that they can perform

14:01

well

14:02

on the truly enormous datasets required

14:04

to capture the combinatorial complexity

14:07

of the real world

14:08

in the paper they suggest to rethink our

14:11

methods for benchmarking performance

14:13

and evaluating vision algorithms and i

14:15

agree entirely

14:17

especially now where most applications

14:19

are made for real life users instead of

14:21

only academic competitions

14:23

it's crucial to get out of these

14:24

academia evaluation metrics

14:26

and create more appropriate evaluation

14:28

tools we also have to accept that data

14:31

bias exists

14:32

and that it can cause real world

14:34

problems of course we need to learn to

14:36

reduce these biases

14:38

but also to accept them biases are

14:40

inevitable due to the combinatorial

14:42

complexity of the real world

14:44

that cannot be realistically represented

14:46

in a single data set of images

14:48

yet thus focusing our attention without

14:51

any play of words with transformers

14:53

on better algorithms that can learn to

14:55

be fair

14:56

even when trained on such incomplete

14:58

data sets

14:59

rather than having bigger and bigger

15:01

models trying to represent the most data

15:04

possible

15:05

even if it may look like it this paper

15:07

was not a criticism of current

15:08

approaches

15:09

instead it's an opinion piece motivated

15:11

by discussions with other researchers in

15:14

several disciplines

15:15

as they state we stress that views

15:17

expressed in the paper

15:18

are our own and do not necessarily

15:20

reflect

15:21

those of the computer vision community

15:23

but i must say

15:24

this was a very interesting read and my

15:27

views are quite similar

15:28

they also discuss many important

15:30

innovations that happen over the last 40

15:32

years in computer vision

15:34

that is definitely worth reading as

15:36

always the link to the paper

15:38

is in the description below to end on a

15:40

more positive note we are nearly

15:42

a decade into the revolution of deep

15:44

neural networks that started in 2012

15:47

with alexnet and the imagenet

15:49

competition since then

15:51

there has been immense progress on our

15:53

computation power

15:54

and the deep net architectures like the

15:56

use of batch normalization

15:58

residual connections and more recently

16:00

self-attention

16:01

researchers will undoubtedly improve the

16:03

architecture of deep nets but we shall

16:05

not forget that there are other ways to

16:07

achieve intelligent models than going

16:09

deeper and using more data of course

16:12

these ways are yet to be discovered

16:14

if this story of deep neural networks

16:16

sounds interesting to you

16:18

i made a video of one of the most

16:19

interesting architecture

16:21

along with a short historical review of

16:23

deep nets i'm sure you'll love it

16:25

thank you for watching

Also published on: https://www.louisbouchard.me/ai-in-computer-vision/

Tags

Join Hacker Noon

Create your free account to unlock your custom reading experience.