In this video, I will openly share everything about deep nets for computer vision applications, their successes, and the limitations we have yet to address.
Watch the video
References
Read the article: https://www.louisbouchard.ai/ai-in-computer-vision/
Yuille, A.L., and Liu, C., 2021. Deep nets: What have they ever done for vision?. International Journal of Computer Vision, 129(3), pp.781–802, https://arxiv.org/abs/1805.04025.
Video Transcript
00:00
if you clicked on this video you are
00:01
certainly interested in computer vision
00:04
applications
00:04
like image classification image
00:06
segmentation object detection
00:08
and more complex tasks like face
00:10
recognition image generation or even
00:12
star transfer application
00:14
as you may already know with the growing
00:16
power of our computers
00:17
most of these applications are now being
00:19
realized using similar deep neural
00:21
networks
00:22
what we often refer to as artificial
00:25
intelligence models
00:26
there are of course some differences
00:28
between the deep nets used in these
00:30
different vision applications
00:31
but as of now they all use the same
00:34
basis of convolutions
00:35
introduced in 1989 by yan loken the
00:38
major difference here
00:40
is our computation power coming from the
00:42
recent advancements of gpus
00:44
to quickly go over the architecture as
00:46
the name says convolution
00:48
is a process where an original image or
00:50
video frame which is
00:51
our input in a computer vision
00:53
applications is convolved
00:55
using filters that detect important
00:57
small features of an image
00:59
such as edges the network will
01:01
autonomously learn
01:02
filter values that detect important
01:04
features to match the output we want to
01:06
have
01:07
such as the object's name in a specific
01:09
image sent as input
01:11
for a classification task these filters
01:13
are usually of size
01:15
3x3 or 5x5 pixel squares allowing them
01:18
to detect the direction of the edges
01:20
left right up or down just like you can
01:23
see in this image
01:24
the process of convolution makes a dot
01:26
product between the filter and the
01:28
pixels it faces
01:29
it's basically just a sum of all the
01:31
filter pixels multiplied with the values
01:34
of the images pixels
01:35
at the corresponding positions then it
01:38
goes to the right and does it again
01:40
convolving the whole image once it's
01:42
done these convolved features
01:44
give us the output of the first
01:46
convolution layer which we call this
01:48
output a feature map
01:49
we repeat this process with many other
01:51
filters giving us multiple feature maps
01:54
one for each filter used in the
01:56
convolution process
01:57
having more than one feature map gives
01:59
us more information about the image
02:01
and especially more information that we
02:03
can learn during training
02:04
since these filters are what we aim to
02:06
learn for our task
02:08
these feature maps are all sent into the
02:10
next layer
02:11
as input to produce many other smaller
02:14
sized
02:14
feature maps again the deeper we get
02:17
into the network the smaller these
02:18
feature maps gets
02:20
because of the nature of convolutions
02:21
and the more general the information of
02:23
these feature maps become
02:25
until it reaches the end of the network
02:27
with extremely general information
02:29
about what the image contains disposed
02:31
of our many feature maps
02:33
which is used for classification or to
02:35
build a latent code
02:37
to represent information present in the
02:39
image in the case of a gan architecture
02:41
to generate a new image
02:43
based on this code which we refer to as
02:45
encoded information
02:47
in the example of image classification
02:49
simply put
02:50
we can see that at the end of the
02:52
network these small feature maps contain
02:54
the information about the presence of
02:56
each possible class telling you whether
02:58
it's a dog a cat
03:00
a person etc of course this is super
03:03
simplified
03:04
and there are other steps but i feel
03:06
like this is an accurate summary of
03:07
what's going on
03:08
inside a deep convolutional neural
03:10
network
03:11
if you've been following my channel and
03:13
posts you know that deep neural networks
03:15
proved
03:16
to be extremely powerful again and again
03:18
but they also have weaknesses and
03:20
weaknesses
03:21
that we should not try to hide as with
03:24
all things in life
03:25
deep nets have strength and weaknesses
03:27
while strengths are widely shared
03:30
the latter is often omitted or even
03:32
discarded by companies
03:34
and ultimately by some researchers this
03:36
paper
03:37
by alan yule and chenxileo aims to
03:40
openly share
03:41
everything about deep nets for vision
03:43
applications their success and the
03:45
limitations we have to address
03:47
moreover just like for our brain we
03:50
still do not fully understand their
03:52
inner workings
03:53
which makes the use of deep nets even
03:55
more limited since we cannot maximize
03:57
their strength
03:58
and limit weaknesses as stated by o
04:00
hobart
04:01
it's like a road map that tells you
04:02
where cars can drive but doesn't tell
04:04
you when or where
04:06
cars are actually driving this is
04:08
another point they discuss
04:09
in their paper namely what is the future
04:12
of computer vision algorithms
04:14
as you may be thinking one way to
04:16
improve computer vision applications is
04:18
to understand our own visual system
04:20
better starting with our brain which is
04:23
why
04:24
neuroscience is such an important field
04:26
for ai
04:27
indeed current deep nets are
04:29
surprisingly different than our own
04:31
vision system
04:32
firstly humans can learn from very small
04:35
numbers of examples
04:37
by exploiting our memory and the
04:39
knowledge we already acquired
04:40
we can also exploit our understanding of
04:43
the world and its physical properties to
04:45
make
04:45
deductions something that a deep net
04:47
cannot do in 1999
04:50
gupp nick ital explained that babies are
04:52
more like tiny scientists
04:54
who understand the world by performing
04:56
experiments
04:57
and seeking causal explanations for
05:00
phenomena rather than
05:01
simply receiving stimulus from images
05:03
like current
05:04
deep nets do also we humans
05:07
are much more robust as we can easily
05:10
identify an object from any viewpoint
05:12
texture it has occlusions it may
05:14
encounter and novel context
05:16
as a concrete example you can just
05:18
visualize the annoying captcha you
05:20
always have to fill in
05:22
when logging into a website this captcha
05:24
is used to detect butts since they are
05:26
awful
05:27
when there are occlusions like this as
05:30
you can see here
05:31
the deep net got fooled by all the
05:33
examples because of the jungle context
05:35
and the fact that a monkey
05:37
is not typically holding a guitar this
05:39
happens because it's certainly not in
05:41
the training data set
05:43
of course this exact situation might not
05:45
happen very often
05:46
in real life but i will show some more
05:48
concrete examples
05:49
that are more relatable and that already
05:52
happened later on
05:53
in the video deep nets also have
05:55
strength that we must highlight
05:57
they can outperform us for face
05:59
recognition tasks since
06:00
humans are not used to until recently
06:03
seeing more than a few thousands of
06:05
people
06:05
in their whole lifetime but this
06:07
strength of deep nets also comes with a
06:09
limitation
06:10
where these faces need to be straight
06:12
centered clear
06:14
without any occlusions etc indeed
06:17
the algorithm could not recognize your
06:19
best friend at the alwyn party
06:21
disguised in harry potter having only
06:23
glasses and a lightning bolt on the
06:25
forehead
06:26
where you would instantly recognize him
06:28
and see
06:29
whoa that's not very original it looks
06:31
like you just put glasses on
06:33
similarly such algorithms are extremely
06:36
precise radiologists
06:37
if all the settings are similar to what
06:39
they have been seeing
06:40
during their training they will
06:42
outperform any human
06:43
this is mainly because even the most
06:45
expert radiologists have only seen
06:48
a fairly small number of ct scans in
06:50
their lives as they suggest
06:51
the superiority of algorithms may also
06:54
be because they are doing a low priority
06:56
task
06:57
for humans for example a computer vision
06:59
app on your phone can identify the
07:02
hundreds of plants in your garden much
07:04
better than
07:05
most of us watching the video can but a
07:08
plant expert
07:08
will surely outperform it and all of us
07:11
together as well
07:12
but again this strength comes with a
07:14
huge problem
07:15
related to the data the algorithm needs
07:17
in order to be this powerful
07:19
as they mentioned and as we often see on
07:21
twitter and article titles
07:23
there are biases due to the data set
07:25
these deep nets are trained on
07:28
since an algorithm is only as good as
07:31
the data set it is evaluated on
07:33
and the performance measures used this
07:35
dataset limitation
07:37
comes with the price that these deep
07:39
neural networks are much
07:40
less general purpose flexible and
07:43
adaptative
07:44
than our own visual system they are less
07:46
general purpose
07:47
and flexible in the way that contrary to
07:50
our visual system
07:51
where we automatically perform edge
07:53
detection binocular stereo
07:55
semantic segmentation object
07:57
classification scene classification and
07:59
3d
08:00
depth estimation deep nets can only be
08:02
trained to achieve
08:03
one of these tasks indeed simply by
08:05
looking around
08:06
your vision system automatically
08:08
achieves all these tasks with extreme
08:10
precision
08:10
where deep nets have difficulty
08:12
achieving similar precision on one of
08:14
them
08:14
but even if this seems effortless to us
08:17
half of our neurons are at work
08:19
processing the information
08:20
and analyzing what's going on we are
08:23
still
08:23
far from mimicking our vision system
08:26
even with the current depth of our
08:27
networks
08:28
but is that really the goal of our
08:30
algorithms will it be better to just
08:33
use them as a tool to improve our
08:35
weaknesses i couldn't say
08:36
but i am sure that we want to address
08:39
the deep nets limitations
08:40
that can cause serious consequences
08:43
rather than omitting them
08:44
i will show some concrete examples of
08:47
such consequences
08:48
just after introducing these limitations
08:50
but if you are too intrigued you can
08:52
skip
08:52
right to it following the timestamps
08:54
under the video and come back to the
08:56
explanation
08:57
afterwards indeed the lack of precision
08:59
we previously mentioned
09:01
by deepnets arises mainly because of the
09:03
disparity between the data we use to
09:05
train our algorithm
09:06
and what it sees in real life as you
09:09
know an algorithm needs to see a lot of
09:11
data
09:11
to iteratively improve at the task it is
09:14
trained for
09:14
this data is often referred to as a
09:17
training data set
09:18
this data disparity between the training
09:20
data set and the real
09:22
world is a problem because the real
09:23
world is too complicated to accurately
09:25
be represented
09:27
in a single data set which is why deep
09:29
nets are less additive than our vision
09:31
system
09:32
in the paper they call this the
09:34
combinatorial complexity explosion of
09:36
natural images the combinatorial
09:38
complexity
09:39
comes from the multitude of possible
09:41
variations within a natural image
09:43
like the camera pose lighting texture
09:46
material
09:47
background the position of the objects
09:49
etc biases can appear at any of these
09:52
levels of complexity
09:53
the data set is missing you can see how
09:56
these large data sets now seem
09:57
very small due to all these factors
10:00
considering that having only
10:02
let's say 13 of these different
10:04
parameters and we allow only
10:06
1000 different values for each of them
10:08
we quickly jump to this number of
10:10
different images
10:11
to represent only a single object the
10:14
current data sets only cover a handful
10:16
of these multitudes of possible
10:18
variations for each object
10:20
thus missing most real-world situations
10:22
that it will encounter in production
10:24
it's also worth mentioning that since
10:26
the variety of images is very limited
10:28
the network may find shortcuts to
10:31
detecting some objects as we saw
10:33
previously with the monkey where it was
10:34
detecting a human instead of a monkey
10:37
because of the guitar in front of it
10:39
similarly you can see that it's
10:40
detecting a bird here
10:42
instead of a guitar probably because the
10:44
model has never
10:45
seen a guitar with a jungle background
10:48
this is called
10:49
overfitting to the background context
10:51
where the algorithm does not focus on
10:53
the right thing
10:54
and instead finds a pattern in the
10:56
images themselves rather than on the
10:58
object of interest
11:00
also these data sets are all built from
11:03
images taken by photographs
11:04
meaning that they only cover specific
11:06
angles and poses that do not transfer to
11:09
all orientation possibilities in the
11:11
real world
11:12
currently we use benchmarks with the
11:14
most complex data sets possible to
11:16
compare the current algorithms and rate
11:18
them
11:19
which if you recall are very incomplete
11:21
compared to the real world
11:23
nonetheless we are often happy with 99
11:26
accuracy for a task on such benchmarks
11:29
firstly the problem is that this
11:31
one-person error is determined on a
11:32
benchmark data set
11:34
meaning that it's similar to our
11:35
training data set in the way that it
11:37
doesn't
11:37
represent the richness of natural images
11:40
it's normal because it's impossible to
11:42
represent the real world in just a bunch
11:44
of images
11:45
it's just way too complicated and there
11:47
are too many situations possible
11:49
these benchmarks we use to test our data
11:52
set to determine whether or not
11:53
they are ready to be deployed in the
11:55
real world application are not really
11:57
accurate to determine how well it will
11:59
actually perform
12:00
which leads to the second problem that
12:02
is how it will actually perform
12:04
in the real world let's see that the
12:06
benchmark data set is huge
12:08
and most cases are covered and we really
12:11
have 99
12:12
accuracy what are the consequences of
12:14
the one percent of cases where the
12:16
algorithm fails in the real world
12:19
this number will be represented in
12:21
misdiagnosis
12:22
accidents financial mistakes or even
12:25
worse
12:25
death such cases could be a self-driving
12:28
car
12:28
during a heavy rainy day heavily
12:30
affecting the death sensors
12:32
used by the vehicle causing it to fail
12:34
many depth estimations
12:36
would you trust your life to this
12:37
partially blind robot taxi
12:40
i don't think i would similarly would
12:42
you trust a self-driving car at night to
12:44
avoid
12:44
driving over pedestrians or cyclists
12:47
where even yourself had difficulty
12:49
seeing them
12:50
these kinds of life-threatening
12:51
situations are so broad
12:53
that it's almost impossible that they
12:55
are all represented in the training data
12:57
set
12:57
and of course here i use extreme
12:59
examples of the most relatable
13:01
application
13:02
but you can just imagine how harmful
13:04
this could be
13:05
when the perfectly trained and tested
13:07
algorithm misclassifies your ct scan
13:09
leading to misdiagnosis just because
13:12
your hospital has different settings in
13:13
their scanner or because you didn't
13:15
drink enough water
13:16
or die anything that would be different
13:19
from your training data
13:20
could lead to a major problem in real
13:22
life even if the benchmark
13:24
used to test it says it's perfect also
13:27
as it already happened
13:28
this can lead to people in
13:30
underrepresented demographics being
13:32
unfairly treated by these algorithms
13:34
and even worse this is why i argue that
13:37
we must focus on the task
13:38
where the algorithms help us and not
13:41
where they replace
13:42
us as long as they are that dependent on
13:44
data
13:45
this brings us to the two questions they
13:47
highlight how can we efficiently test
13:49
these algorithms to ensure that they
13:51
work on these enormous data sets
13:53
if we can only test them on a finite
13:55
subset and two
13:57
how can we train algorithms infinite
13:59
size data sets so that they can perform
14:01
well
14:02
on the truly enormous datasets required
14:04
to capture the combinatorial complexity
14:07
of the real world
14:08
in the paper they suggest to rethink our
14:11
methods for benchmarking performance
14:13
and evaluating vision algorithms and i
14:15
agree entirely
14:17
especially now where most applications
14:19
are made for real life users instead of
14:21
only academic competitions
14:23
it's crucial to get out of these
14:24
academia evaluation metrics
14:26
and create more appropriate evaluation
14:28
tools we also have to accept that data
14:31
bias exists
14:32
and that it can cause real world
14:34
problems of course we need to learn to
14:36
reduce these biases
14:38
but also to accept them biases are
14:40
inevitable due to the combinatorial
14:42
complexity of the real world
14:44
that cannot be realistically represented
14:46
in a single data set of images
14:48
yet thus focusing our attention without
14:51
any play of words with transformers
14:53
on better algorithms that can learn to
14:55
be fair
14:56
even when trained on such incomplete
14:58
data sets
14:59
rather than having bigger and bigger
15:01
models trying to represent the most data
15:04
possible
15:05
even if it may look like it this paper
15:07
was not a criticism of current
15:08
approaches
15:09
instead it's an opinion piece motivated
15:11
by discussions with other researchers in
15:14
several disciplines
15:15
as they state we stress that views
15:17
expressed in the paper
15:18
are our own and do not necessarily
15:20
reflect
15:21
those of the computer vision community
15:23
but i must say
15:24
this was a very interesting read and my
15:27
views are quite similar
15:28
they also discuss many important
15:30
innovations that happen over the last 40
15:32
years in computer vision
15:34
that is definitely worth reading as
15:36
always the link to the paper
15:38
is in the description below to end on a
15:40
more positive note we are nearly
15:42
a decade into the revolution of deep
15:44
neural networks that started in 2012
15:47
with alexnet and the imagenet
15:49
competition since then
15:51
there has been immense progress on our
15:53
computation power
15:54
and the deep net architectures like the
15:56
use of batch normalization
15:58
residual connections and more recently
16:00
self-attention
16:01
researchers will undoubtedly improve the
16:03
architecture of deep nets but we shall
16:05
not forget that there are other ways to
16:07
achieve intelligent models than going
16:09
deeper and using more data of course
16:12
these ways are yet to be discovered
16:14
if this story of deep neural networks
16:16
sounds interesting to you
16:18
i made a video of one of the most
16:19
interesting architecture
16:21
along with a short historical review of
16:23
deep nets i'm sure you'll love it
16:25
thank you for watching
Also published on: https://www.louisbouchard.me/ai-in-computer-vision/
