In this video, I will openly share everything about deep nets for computer vision applications, their successes, and the limitations we have yet to address. Watch the video References Read the article: https://www.louisbouchard.ai/ai-in-computer-vision/ Yuille, A.L., and Liu, C., 2021. . International Journal of Computer Vision, 129(3), pp.781–802, . Deep nets: What have they ever done for vision? https://arxiv.org/abs/1805.04025 Video Transcript 00:00 if you clicked on this video you are 00:01 certainly interested in computer vision 00:04 applications 00:04 like image classification image 00:06 segmentation object detection 00:08 and more complex tasks like face 00:10 recognition image generation or even 00:12 star transfer application 00:14 as you may already know with the growing 00:16 power of our computers 00:17 most of these applications are now being 00:19 realized using similar deep neural 00:21 networks 00:22 what we often refer to as artificial 00:25 intelligence models 00:26 there are of course some differences 00:28 between the deep nets used in these 00:30 different vision applications 00:31 but as of now they all use the same 00:34 basis of convolutions 00:35 introduced in 1989 by yan loken the 00:38 major difference here 00:40 is our computation power coming from the 00:42 recent advancements of gpus 00:44 to quickly go over the architecture as 00:46 the name says convolution 00:48 is a process where an original image or 00:50 video frame which is 00:51 our input in a computer vision 00:53 applications is convolved 00:55 using filters that detect important 00:57 small features of an image 00:59 such as edges the network will 01:01 autonomously learn 01:02 filter values that detect important 01:04 features to match the output we want to 01:06 have 01:07 such as the object's name in a specific 01:09 image sent as input 01:11 for a classification task these filters 01:13 are usually of size 01:15 3x3 or 5x5 pixel squares allowing them 01:18 to detect the direction of the edges 01:20 left right up or down just like you can 01:23 see in this image 01:24 the process of convolution makes a dot 01:26 product between the filter and the 01:28 pixels it faces 01:29 it's basically just a sum of all the 01:31 filter pixels multiplied with the values 01:34 of the images pixels 01:35 at the corresponding positions then it 01:38 goes to the right and does it again 01:40 convolving the whole image once it's 01:42 done these convolved features 01:44 give us the output of the first 01:46 convolution layer which we call this 01:48 output a feature map 01:49 we repeat this process with many other 01:51 filters giving us multiple feature maps 01:54 one for each filter used in the 01:56 convolution process 01:57 having more than one feature map gives 01:59 us more information about the image 02:01 and especially more information that we 02:03 can learn during training 02:04 since these filters are what we aim to 02:06 learn for our task 02:08 these feature maps are all sent into the 02:10 next layer 02:11 as input to produce many other smaller 02:14 sized 02:14 feature maps again the deeper we get 02:17 into the network the smaller these 02:18 feature maps gets 02:20 because of the nature of convolutions 02:21 and the more general the information of 02:23 these feature maps become 02:25 until it reaches the end of the network 02:27 with extremely general information 02:29 about what the image contains disposed 02:31 of our many feature maps 02:33 which is used for classification or to 02:35 build a latent code 02:37 to represent information present in the 02:39 image in the case of a gan architecture 02:41 to generate a new image 02:43 based on this code which we refer to as 02:45 encoded information 02:47 in the example of image classification 02:49 simply put 02:50 we can see that at the end of the 02:52 network these small feature maps contain 02:54 the information about the presence of 02:56 each possible class telling you whether 02:58 it's a dog a cat 03:00 a person etc of course this is super 03:03 simplified 03:04 and there are other steps but i feel 03:06 like this is an accurate summary of 03:07 what's going on 03:08 inside a deep convolutional neural 03:10 network 03:11 if you've been following my channel and 03:13 posts you know that deep neural networks 03:15 proved 03:16 to be extremely powerful again and again 03:18 but they also have weaknesses and 03:20 weaknesses 03:21 that we should not try to hide as with 03:24 all things in life 03:25 deep nets have strength and weaknesses 03:27 while strengths are widely shared 03:30 the latter is often omitted or even 03:32 discarded by companies 03:34 and ultimately by some researchers this 03:36 paper 03:37 by alan yule and chenxileo aims to 03:40 openly share 03:41 everything about deep nets for vision 03:43 applications their success and the 03:45 limitations we have to address 03:47 moreover just like for our brain we 03:50 still do not fully understand their 03:52 inner workings 03:53 which makes the use of deep nets even 03:55 more limited since we cannot maximize 03:57 their strength 03:58 and limit weaknesses as stated by o 04:00 hobart 04:01 it's like a road map that tells you 04:02 where cars can drive but doesn't tell 04:04 you when or where 04:06 cars are actually driving this is 04:08 another point they discuss 04:09 in their paper namely what is the future 04:12 of computer vision algorithms 04:14 as you may be thinking one way to 04:16 improve computer vision applications is 04:18 to understand our own visual system 04:20 better starting with our brain which is 04:23 why 04:24 neuroscience is such an important field 04:26 for ai 04:27 indeed current deep nets are 04:29 surprisingly different than our own 04:31 vision system 04:32 firstly humans can learn from very small 04:35 numbers of examples 04:37 by exploiting our memory and the 04:39 knowledge we already acquired 04:40 we can also exploit our understanding of 04:43 the world and its physical properties to 04:45 make 04:45 deductions something that a deep net 04:47 cannot do in 1999 04:50 gupp nick ital explained that babies are 04:52 more like tiny scientists 04:54 who understand the world by performing 04:56 experiments 04:57 and seeking causal explanations for 05:00 phenomena rather than 05:01 simply receiving stimulus from images 05:03 like current 05:04 deep nets do also we humans 05:07 are much more robust as we can easily 05:10 identify an object from any viewpoint 05:12 texture it has occlusions it may 05:14 encounter and novel context 05:16 as a concrete example you can just 05:18 visualize the annoying captcha you 05:20 always have to fill in 05:22 when logging into a website this captcha 05:24 is used to detect butts since they are 05:26 awful 05:27 when there are occlusions like this as 05:30 you can see here 05:31 the deep net got fooled by all the 05:33 examples because of the jungle context 05:35 and the fact that a monkey 05:37 is not typically holding a guitar this 05:39 happens because it's certainly not in 05:41 the training data set 05:43 of course this exact situation might not 05:45 happen very often 05:46 in real life but i will show some more 05:48 concrete examples 05:49 that are more relatable and that already 05:52 happened later on 05:53 in the video deep nets also have 05:55 strength that we must highlight 05:57 they can outperform us for face 05:59 recognition tasks since 06:00 humans are not used to until recently 06:03 seeing more than a few thousands of 06:05 people 06:05 in their whole lifetime but this 06:07 strength of deep nets also comes with a 06:09 limitation 06:10 where these faces need to be straight 06:12 centered clear 06:14 without any occlusions etc indeed 06:17 the algorithm could not recognize your 06:19 best friend at the alwyn party 06:21 disguised in harry potter having only 06:23 glasses and a lightning bolt on the 06:25 forehead 06:26 where you would instantly recognize him 06:28 and see 06:29 whoa that's not very original it looks 06:31 like you just put glasses on 06:33 similarly such algorithms are extremely 06:36 precise radiologists 06:37 if all the settings are similar to what 06:39 they have been seeing 06:40 during their training they will 06:42 outperform any human 06:43 this is mainly because even the most 06:45 expert radiologists have only seen 06:48 a fairly small number of ct scans in 06:50 their lives as they suggest 06:51 the superiority of algorithms may also 06:54 be because they are doing a low priority 06:56 task 06:57 for humans for example a computer vision 06:59 app on your phone can identify the 07:02 hundreds of plants in your garden much 07:04 better than 07:05 most of us watching the video can but a 07:08 plant expert 07:08 will surely outperform it and all of us 07:11 together as well 07:12 but again this strength comes with a 07:14 huge problem 07:15 related to the data the algorithm needs 07:17 in order to be this powerful 07:19 as they mentioned and as we often see on 07:21 twitter and article titles 07:23 there are biases due to the data set 07:25 these deep nets are trained on 07:28 since an algorithm is only as good as 07:31 the data set it is evaluated on 07:33 and the performance measures used this 07:35 dataset limitation 07:37 comes with the price that these deep 07:39 neural networks are much 07:40 less general purpose flexible and 07:43 adaptative 07:44 than our own visual system they are less 07:46 general purpose 07:47 and flexible in the way that contrary to 07:50 our visual system 07:51 where we automatically perform edge 07:53 detection binocular stereo 07:55 semantic segmentation object 07:57 classification scene classification and 07:59 3d 08:00 depth estimation deep nets can only be 08:02 trained to achieve 08:03 one of these tasks indeed simply by 08:05 looking around 08:06 your vision system automatically 08:08 achieves all these tasks with extreme 08:10 precision 08:10 where deep nets have difficulty 08:12 achieving similar precision on one of 08:14 them 08:14 but even if this seems effortless to us 08:17 half of our neurons are at work 08:19 processing the information 08:20 and analyzing what's going on we are 08:23 still 08:23 far from mimicking our vision system 08:26 even with the current depth of our 08:27 networks 08:28 but is that really the goal of our 08:30 algorithms will it be better to just 08:33 use them as a tool to improve our 08:35 weaknesses i couldn't say 08:36 but i am sure that we want to address 08:39 the deep nets limitations 08:40 that can cause serious consequences 08:43 rather than omitting them 08:44 i will show some concrete examples of 08:47 such consequences 08:48 just after introducing these limitations 08:50 but if you are too intrigued you can 08:52 skip 08:52 right to it following the timestamps 08:54 under the video and come back to the 08:56 explanation 08:57 afterwards indeed the lack of precision 08:59 we previously mentioned 09:01 by deepnets arises mainly because of the 09:03 disparity between the data we use to 09:05 train our algorithm 09:06 and what it sees in real life as you 09:09 know an algorithm needs to see a lot of 09:11 data 09:11 to iteratively improve at the task it is 09:14 trained for 09:14 this data is often referred to as a 09:17 training data set 09:18 this data disparity between the training 09:20 data set and the real 09:22 world is a problem because the real 09:23 world is too complicated to accurately 09:25 be represented 09:27 in a single data set which is why deep 09:29 nets are less additive than our vision 09:31 system 09:32 in the paper they call this the 09:34 combinatorial complexity explosion of 09:36 natural images the combinatorial 09:38 complexity 09:39 comes from the multitude of possible 09:41 variations within a natural image 09:43 like the camera pose lighting texture 09:46 material 09:47 background the position of the objects 09:49 etc biases can appear at any of these 09:52 levels of complexity 09:53 the data set is missing you can see how 09:56 these large data sets now seem 09:57 very small due to all these factors 10:00 considering that having only 10:02 let's say 13 of these different 10:04 parameters and we allow only 10:06 1000 different values for each of them 10:08 we quickly jump to this number of 10:10 different images 10:11 to represent only a single object the 10:14 current data sets only cover a handful 10:16 of these multitudes of possible 10:18 variations for each object 10:20 thus missing most real-world situations 10:22 that it will encounter in production 10:24 it's also worth mentioning that since 10:26 the variety of images is very limited 10:28 the network may find shortcuts to 10:31 detecting some objects as we saw 10:33 previously with the monkey where it was 10:34 detecting a human instead of a monkey 10:37 because of the guitar in front of it 10:39 similarly you can see that it's 10:40 detecting a bird here 10:42 instead of a guitar probably because the 10:44 model has never 10:45 seen a guitar with a jungle background 10:48 this is called 10:49 overfitting to the background context 10:51 where the algorithm does not focus on 10:53 the right thing 10:54 and instead finds a pattern in the 10:56 images themselves rather than on the 10:58 object of interest 11:00 also these data sets are all built from 11:03 images taken by photographs 11:04 meaning that they only cover specific 11:06 angles and poses that do not transfer to 11:09 all orientation possibilities in the 11:11 real world 11:12 currently we use benchmarks with the 11:14 most complex data sets possible to 11:16 compare the current algorithms and rate 11:18 them 11:19 which if you recall are very incomplete 11:21 compared to the real world 11:23 nonetheless we are often happy with 99 11:26 accuracy for a task on such benchmarks 11:29 firstly the problem is that this 11:31 one-person error is determined on a 11:32 benchmark data set 11:34 meaning that it's similar to our 11:35 training data set in the way that it 11:37 doesn't 11:37 represent the richness of natural images 11:40 it's normal because it's impossible to 11:42 represent the real world in just a bunch 11:44 of images 11:45 it's just way too complicated and there 11:47 are too many situations possible 11:49 these benchmarks we use to test our data 11:52 set to determine whether or not 11:53 they are ready to be deployed in the 11:55 real world application are not really 11:57 accurate to determine how well it will 11:59 actually perform 12:00 which leads to the second problem that 12:02 is how it will actually perform 12:04 in the real world let's see that the 12:06 benchmark data set is huge 12:08 and most cases are covered and we really 12:11 have 99 12:12 accuracy what are the consequences of 12:14 the one percent of cases where the 12:16 algorithm fails in the real world 12:19 this number will be represented in 12:21 misdiagnosis 12:22 accidents financial mistakes or even 12:25 worse 12:25 death such cases could be a self-driving 12:28 car 12:28 during a heavy rainy day heavily 12:30 affecting the death sensors 12:32 used by the vehicle causing it to fail 12:34 many depth estimations 12:36 would you trust your life to this 12:37 partially blind robot taxi 12:40 i don't think i would similarly would 12:42 you trust a self-driving car at night to 12:44 avoid 12:44 driving over pedestrians or cyclists 12:47 where even yourself had difficulty 12:49 seeing them 12:50 these kinds of life-threatening 12:51 situations are so broad 12:53 that it's almost impossible that they 12:55 are all represented in the training data 12:57 set 12:57 and of course here i use extreme 12:59 examples of the most relatable 13:01 application 13:02 but you can just imagine how harmful 13:04 this could be 13:05 when the perfectly trained and tested 13:07 algorithm misclassifies your ct scan 13:09 leading to misdiagnosis just because 13:12 your hospital has different settings in 13:13 their scanner or because you didn't 13:15 drink enough water 13:16 or die anything that would be different 13:19 from your training data 13:20 could lead to a major problem in real 13:22 life even if the benchmark 13:24 used to test it says it's perfect also 13:27 as it already happened 13:28 this can lead to people in 13:30 underrepresented demographics being 13:32 unfairly treated by these algorithms 13:34 and even worse this is why i argue that 13:37 we must focus on the task 13:38 where the algorithms help us and not 13:41 where they replace 13:42 us as long as they are that dependent on 13:44 data 13:45 this brings us to the two questions they 13:47 highlight how can we efficiently test 13:49 these algorithms to ensure that they 13:51 work on these enormous data sets 13:53 if we can only test them on a finite 13:55 subset and two 13:57 how can we train algorithms infinite 13:59 size data sets so that they can perform 14:01 well 14:02 on the truly enormous datasets required 14:04 to capture the combinatorial complexity 14:07 of the real world 14:08 in the paper they suggest to rethink our 14:11 methods for benchmarking performance 14:13 and evaluating vision algorithms and i 14:15 agree entirely 14:17 especially now where most applications 14:19 are made for real life users instead of 14:21 only academic competitions 14:23 it's crucial to get out of these 14:24 academia evaluation metrics 14:26 and create more appropriate evaluation 14:28 tools we also have to accept that data 14:31 bias exists 14:32 and that it can cause real world 14:34 problems of course we need to learn to 14:36 reduce these biases 14:38 but also to accept them biases are 14:40 inevitable due to the combinatorial 14:42 complexity of the real world 14:44 that cannot be realistically represented 14:46 in a single data set of images 14:48 yet thus focusing our attention without 14:51 any play of words with transformers 14:53 on better algorithms that can learn to 14:55 be fair 14:56 even when trained on such incomplete 14:58 data sets 14:59 rather than having bigger and bigger 15:01 models trying to represent the most data 15:04 possible 15:05 even if it may look like it this paper 15:07 was not a criticism of current 15:08 approaches 15:09 instead it's an opinion piece motivated 15:11 by discussions with other researchers in 15:14 several disciplines 15:15 as they state we stress that views 15:17 expressed in the paper 15:18 are our own and do not necessarily 15:20 reflect 15:21 those of the computer vision community 15:23 but i must say 15:24 this was a very interesting read and my 15:27 views are quite similar 15:28 they also discuss many important 15:30 innovations that happen over the last 40 15:32 years in computer vision 15:34 that is definitely worth reading as 15:36 always the link to the paper 15:38 is in the description below to end on a 15:40 more positive note we are nearly 15:42 a decade into the revolution of deep 15:44 neural networks that started in 2012 15:47 with alexnet and the imagenet 15:49 competition since then 15:51 there has been immense progress on our 15:53 computation power 15:54 and the deep net architectures like the 15:56 use of batch normalization 15:58 residual connections and more recently 16:00 self-attention 16:01 researchers will undoubtedly improve the 16:03 architecture of deep nets but we shall 16:05 not forget that there are other ways to 16:07 achieve intelligent models than going 16:09 deeper and using more data of course 16:12 these ways are yet to be discovered 16:14 if this story of deep neural networks 16:16 sounds interesting to you 16:18 i made a video of one of the most 16:19 interesting architecture 16:21 along with a short historical review of 16:23 deep nets i'm sure you'll love it 16:25 thank you for watching Also published on: https://www.louisbouchard.me/ai-in-computer-vision/

BUNCH

Instantly

Super

Twitter

YouTube

NVIDIA ADA: Train Your GAN With 1/10th of the Data

Infinite Nature: Fly Into a 2D Image and Explore it as a Drone

Watch more on YouTube: https://www.youtube.com/c/WhatsAI

2021 - HackerNoon Contributor of the Year - DEEP-LEARNING

2021 - HackerNoon Contributor of the Year - FACEBOOK

Nominated for 2022 - Best Data Science Newsletter

Nominated for 2022 - HackerNoon Contributor of the Year - Artificial Intelligence

Nominated for 2022 - Top Tech Youtuber

Nominated for 2022 - HackerNoon Contributor of the Year - Innovation

Nominated for 2022 - HackerNoon Contributor of the Year - Data Science

Nominated for 2022 - HackerNoon Contributor of the Year - Natural Language Processing

What Did AI Bring to Computer Vision?

About Author

Comments

TOPICS

THIS ARTICLE WAS FEATURED IN

Related Stories

Untitled Story

3D Articulated Shape Reconstruction from Videos

The Noonification: How Often Do NFTs Pass The Howey Test? (1/13/2023)

Darwin's Hybrid Intelligence to Align AI & Human Goals for Startups & VCs

The Noonification: White Man (11/26/2022)

The Noonification: The Metaverse is a Sh*tshow (11/2/2022)

3D Articulated Shape Reconstruction from Videos

The Noonification: How Often Do NFTs Pass The Howey Test? (1/13/2023)

Darwin's Hybrid Intelligence to Align AI & Human Goals for Startups & VCs

The Noonification: White Man (11/26/2022)

The Noonification: The Metaverse is a Sh*tshow (11/2/2022)

Light-Mode

Classic

Newspaper

Dark-Mode

Neon Noir

Minty

HN StartUps