In this video, I will openly share everything about deep nets for computer vision applications, their successes, and the limitations we have yet to address. Watch the video References Read the article: https://www.louisbouchard.ai/ai-in-computer-vision/ Yuille, A.L., and Liu, C., 2021. . International Journal of Computer Vision, 129(3), pp.781–802, . Deep nets: What have they ever done for vision? https://arxiv.org/abs/1805.04025 Video Transcript 00:00 if you clicked on this video you are 00:01 certainly interested in computer vision 00:04 applications 00:04 like image classification image 00:06 segmentation object detection 00:08 and more complex tasks like face 00:10 recognition image generation or even 00:12 star transfer application 00:14 as you may already know with the growing 00:16 power of our computers 00:17 most of these applications are now being 00:19 realized using similar deep neural 00:21 networks 00:22 what we often refer to as artificial 00:25 intelligence models 00:26 there are of course some differences 00:28 between the deep nets used in these 00:30 different vision applications 00:31 but as of now they all use the same 00:34 basis of convolutions 00:35 introduced in 1989 by yan loken the 00:38 major difference here 00:40 is our computation power coming from the 00:42 recent advancements of gpus 00:44 to quickly go over the architecture as 00:46 the name says convolution 00:48 is a process where an original image or 00:50 video frame which is 00:51 our input in a computer vision 00:53 applications is convolved 00:55 using filters that detect important 00:57 small features of an image 00:59 such as edges the network will 01:01 autonomously learn 01:02 filter values that detect important 01:04 features to match the output we want to 01:06 have 01:07 such as the object's name in a specific 01:09 image sent as input 01:11 for a classification task these filters 01:13 are usually of size 01:15 3x3 or 5x5 pixel squares allowing them 01:18 to detect the direction of the edges 01:20 left right up or down just like you can 01:23 see in this image 01:24 the process of convolution makes a dot 01:26 product between the filter and the 01:28 pixels it faces 01:29 it's basically just a sum of all the 01:31 filter pixels multiplied with the values 01:34 of the images pixels 01:35 at the corresponding positions then it 01:38 goes to the right and does it again 01:40 convolving the whole image once it's 01:42 done these convolved features 01:44 give us the output of the first 01:46 convolution layer which we call this 01:48 output a feature map 01:49 we repeat this process with many other 01:51 filters giving us multiple feature maps 01:54 one for each filter used in the 01:56 convolution process 01:57 having more than one feature map gives 01:59 us more information about the image 02:01 and especially more information that we 02:03 can learn during training 02:04 since these filters are what we aim to 02:06 learn for our task 02:08 these feature maps are all sent into the 02:10 next layer 02:11 as input to produce many other smaller 02:14 sized 02:14 feature maps again the deeper we get 02:17 into the network the smaller these 02:18 feature maps gets 02:20 because of the nature of convolutions 02:21 and the more general the information of 02:23 these feature maps become 02:25 until it reaches the end of the network 02:27 with extremely general information 02:29 about what the image contains disposed 02:31 of our many feature maps 02:33 which is used for classification or to 02:35 build a latent code 02:37 to represent information present in the 02:39 image in the case of a gan architecture 02:41 to generate a new image 02:43 based on this code which we refer to as 02:45 encoded information 02:47 in the example of image classification 02:49 simply put 02:50 we can see that at the end of the 02:52 network these small feature maps contain 02:54 the information about the presence of 02:56 each possible class telling you whether 02:58 it's a dog a cat 03:00 a person etc of course this is super 03:03 simplified 03:04 and there are other steps but i feel 03:06 like this is an accurate summary of 03:07 what's going on 03:08 inside a deep convolutional neural 03:10 network 03:11 if you've been following my channel and 03:13 posts you know that deep neural networks 03:15 proved 03:16 to be extremely powerful again and again 03:18 but they also have weaknesses and 03:20 weaknesses 03:21 that we should not try to hide as with 03:24 all things in life 03:25 deep nets have strength and weaknesses 03:27 while strengths are widely shared 03:30 the latter is often omitted or even 03:32 discarded by companies 03:34 and ultimately by some researchers this 03:36 paper 03:37 by alan yule and chenxileo aims to 03:40 openly share 03:41 everything about deep nets for vision 03:43 applications their success and the 03:45 limitations we have to address 03:47 moreover just like for our brain we 03:50 still do not fully understand their 03:52 inner workings 03:53 which makes the use of deep nets even 03:55 more limited since we cannot maximize 03:57 their strength 03:58 and limit weaknesses as stated by o 04:00 hobart 04:01 it's like a road map that tells you 04:02 where cars can drive but doesn't tell 04:04 you when or where 04:06 cars are actually driving this is 04:08 another point they discuss 04:09 in their paper namely what is the future 04:12 of computer vision algorithms 04:14 as you may be thinking one way to 04:16 improve computer vision applications is 04:18 to understand our own visual system 04:20 better starting with our brain which is 04:23 why 04:24 neuroscience is such an important field 04:26 for ai 04:27 indeed current deep nets are 04:29 surprisingly different than our own 04:31 vision system 04:32 firstly humans can learn from very small 04:35 numbers of examples 04:37 by exploiting our memory and the 04:39 knowledge we already acquired 04:40 we can also exploit our understanding of 04:43 the world and its physical properties to 04:45 make 04:45 deductions something that a deep net 04:47 cannot do in 1999 04:50 gupp nick ital explained that babies are 04:52 more like tiny scientists 04:54 who understand the world by performing 04:56 experiments 04:57 and seeking causal explanations for 05:00 phenomena rather than 05:01 simply receiving stimulus from images 05:03 like current 05:04 deep nets do also we humans 05:07 are much more robust as we can easily 05:10 identify an object from any viewpoint 05:12 texture it has occlusions it may 05:14 encounter and novel context 05:16 as a concrete example you can just 05:18 visualize the annoying captcha you 05:20 always have to fill in 05:22 when logging into a website this captcha 05:24 is used to detect butts since they are 05:26 awful 05:27 when there are occlusions like this as 05:30 you can see here 05:31 the deep net got fooled by all the 05:33 examples because of the jungle context 05:35 and the fact that a monkey 05:37 is not typically holding a guitar this 05:39 happens because it's certainly not in 05:41 the training data set 05:43 of course this exact situation might not 05:45 happen very often 05:46 in real life but i will show some more 05:48 concrete examples 05:49 that are more relatable and that already 05:52 happened later on 05:53 in the video deep nets also have 05:55 strength that we must highlight 05:57 they can outperform us for face 05:59 recognition tasks since 06:00 humans are not used to until recently 06:03 seeing more than a few thousands of 06:05 people 06:05 in their whole lifetime but this 06:07 strength of deep nets also comes with a 06:09 limitation 06:10 where these faces need to be straight 06:12 centered clear 06:14 without any occlusions etc indeed 06:17 the algorithm could not recognize your 06:19 best friend at the alwyn party 06:21 disguised in harry potter having only 06:23 glasses and a lightning bolt on the 06:25 forehead 06:26 where you would instantly recognize him 06:28 and see 06:29 whoa that's not very original it looks 06:31 like you just put glasses on 06:33 similarly such algorithms are extremely 06:36 precise radiologists 06:37 if all the settings are similar to what 06:39 they have been seeing 06:40 during their training they will 06:42 outperform any human 06:43 this is mainly because even the most 06:45 expert radiologists have only seen 06:48 a fairly small number of ct scans in 06:50 their lives as they suggest 06:51 the superiority of algorithms may also 06:54 be because they are doing a low priority 06:56 task 06:57 for humans for example a computer vision 06:59 app on your phone can identify the 07:02 hundreds of plants in your garden much 07:04 better than 07:05 most of us watching the video can but a 07:08 plant expert 07:08 will surely outperform it and all of us 07:11 together as well 07:12 but again this strength comes with a 07:14 huge problem 07:15 related to the data the algorithm needs 07:17 in order to be this powerful 07:19 as they mentioned and as we often see on 07:21 twitter and article titles 07:23 there are biases due to the data set 07:25 these deep nets are trained on 07:28 since an algorithm is only as good as 07:31 the data set it is evaluated on 07:33 and the performance measures used this 07:35 dataset limitation 07:37 comes with the price that these deep 07:39 neural networks are much 07:40 less general purpose flexible and 07:43 adaptative 07:44 than our own visual system they are less 07:46 general purpose 07:47 and flexible in the way that contrary to 07:50 our visual system 07:51 where we automatically perform edge 07:53 detection binocular stereo 07:55 semantic segmentation object 07:57 classification scene classification and 07:59 3d 08:00 depth estimation deep nets can only be 08:02 trained to achieve 08:03 one of these tasks indeed simply by 08:05 looking around 08:06 your vision system automatically 08:08 achieves all these tasks with extreme 08:10 precision 08:10 where deep nets have difficulty 08:12 achieving similar precision on one of 08:14 them 08:14 but even if this seems effortless to us 08:17 half of our neurons are at work 08:19 processing the information 08:20 and analyzing what's going on we are 08:23 still 08:23 far from mimicking our vision system 08:26 even with the current depth of our 08:27 networks 08:28 but is that really the goal of our 08:30 algorithms will it be better to just 08:33 use them as a tool to improve our 08:35 weaknesses i couldn't say 08:36 but i am sure that we want to address 08:39 the deep nets limitations 08:40 that can cause serious consequences 08:43 rather than omitting them 08:44 i will show some concrete examples of 08:47 such consequences 08:48 just after introducing these limitations 08:50 but if you are too intrigued you can 08:52 skip 08:52 right to it following the timestamps 08:54 under the video and come back to the 08:56 explanation 08:57 afterwards indeed the lack of precision 08:59 we previously mentioned 09:01 by deepnets arises mainly because of the 09:03 disparity between the data we use to 09:05 train our algorithm 09:06 and what it sees in real life as you 09:09 know an algorithm needs to see a lot of 09:11 data 09:11 to iteratively improve at the task it is 09:14 trained for 09:14 this data is often referred to as a 09:17 training data set 09:18 this data disparity between the training 09:20 data set and the real 09:22 world is a problem because the real 09:23 world is too complicated to accurately 09:25 be represented 09:27 in a single data set which is why deep 09:29 nets are less additive than our vision 09:31 system 09:32 in the paper they call this the 09:34 combinatorial complexity explosion of 09:36 natural images the combinatorial 09:38 complexity 09:39 comes from the multitude of possible 09:41 variations within a natural image 09:43 like the camera pose lighting texture 09:46 material 09:47 background the position of the objects 09:49 etc biases can appear at any of these 09:52 levels of complexity 09:53 the data set is missing you can see how 09:56 these large data sets now seem 09:57 very small due to all these factors 10:00 considering that having only 10:02 let's say 13 of these different 10:04 parameters and we allow only 10:06 1000 different values for each of them 10:08 we quickly jump to this number of 10:10 different images 10:11 to represent only a single object the 10:14 current data sets only cover a handful 10:16 of these multitudes of possible 10:18 variations for each object 10:20 thus missing most real-world situations 10:22 that it will encounter in production 10:24 it's also worth mentioning that since 10:26 the variety of images is very limited 10:28 the network may find shortcuts to 10:31 detecting some objects as we saw 10:33 previously with the monkey where it was 10:34 detecting a human instead of a monkey 10:37 because of the guitar in front of it 10:39 similarly you can see that it's 10:40 detecting a bird here 10:42 instead of a guitar probably because the 10:44 model has never 10:45 seen a guitar with a jungle background 10:48 this is called 10:49 overfitting to the background context 10:51 where the algorithm does not focus on 10:53 the right thing 10:54 and instead finds a pattern in the 10:56 images themselves rather than on the 10:58 object of interest 11:00 also these data sets are all built from 11:03 images taken by photographs 11:04 meaning that they only cover specific 11:06 angles and poses that do not transfer to 11:09 all orientation possibilities in the 11:11 real world 11:12 currently we use benchmarks with the 11:14 most complex data sets possible to 11:16 compare the current algorithms and rate 11:18 them 11:19 which if you recall are very incomplete 11:21 compared to the real world 11:23 nonetheless we are often happy with 99 11:26 accuracy for a task on such benchmarks 11:29 firstly the problem is that this 11:31 one-person error is determined on a 11:32 benchmark data set 11:34 meaning that it's similar to our 11:35 training data set in the way that it 11:37 doesn't 11:37 represent the richness of natural images 11:40 it's normal because it's impossible to 11:42 represent the real world in just a bunch 11:44 of images 11:45 it's just way too complicated and there 11:47 are too many situations possible 11:49 these benchmarks we use to test our data 11:52 set to determine whether or not 11:53 they are ready to be deployed in the 11:55 real world application are not really 11:57 accurate to determine how well it will 11:59 actually perform 12:00 which leads to the second problem that 12:02 is how it will actually perform 12:04 in the real world let's see that the 12:06 benchmark data set is huge 12:08 and most cases are covered and we really 12:11 have 99 12:12 accuracy what are the consequences of 12:14 the one percent of cases where the 12:16 algorithm fails in the real world 12:19 this number will be represented in 12:21 misdiagnosis 12:22 accidents financial mistakes or even 12:25 worse 12:25 death such cases could be a self-driving 12:28 car 12:28 during a heavy rainy day heavily 12:30 affecting the death sensors 12:32 used by the vehicle causing it to fail 12:34 many depth estimations 12:36 would you trust your life to this 12:37 partially blind robot taxi 12:40 i don't think i would similarly would 12:42 you trust a self-driving car at night to 12:44 avoid 12:44 driving over pedestrians or cyclists 12:47 where even yourself had difficulty 12:49 seeing them 12:50 these kinds of life-threatening 12:51 situations are so broad 12:53 that it's almost impossible that they 12:55 are all represented in the training data 12:57 set 12:57 and of course here i use extreme 12:59 examples of the most relatable 13:01 application 13:02 but you can just imagine how harmful 13:04 this could be 13:05 when the perfectly trained and tested 13:07 algorithm misclassifies your ct scan 13:09 leading to misdiagnosis just because 13:12 your hospital has different settings in 13:13 their scanner or because you didn't 13:15 drink enough water 13:16 or die anything that would be different 13:19 from your training data 13:20 could lead to a major problem in real 13:22 life even if the benchmark 13:24 used to test it says it's perfect also 13:27 as it already happened 13:28 this can lead to people in 13:30 underrepresented demographics being 13:32 unfairly treated by these algorithms 13:34 and even worse this is why i argue that 13:37 we must focus on the task 13:38 where the algorithms help us and not 13:41 where they replace 13:42 us as long as they are that dependent on 13:44 data 13:45 this brings us to the two questions they 13:47 highlight how can we efficiently test 13:49 these algorithms to ensure that they 13:51 work on these enormous data sets 13:53 if we can only test them on a finite 13:55 subset and two 13:57 how can we train algorithms infinite 13:59 size data sets so that they can perform 14:01 well 14:02 on the truly enormous datasets required 14:04 to capture the combinatorial complexity 14:07 of the real world 14:08 in the paper they suggest to rethink our 14:11 methods for benchmarking performance 14:13 and evaluating vision algorithms and i 14:15 agree entirely 14:17 especially now where most applications 14:19 are made for real life users instead of 14:21 only academic competitions 14:23 it's crucial to get out of these 14:24 academia evaluation metrics 14:26 and create more appropriate evaluation 14:28 tools we also have to accept that data 14:31 bias exists 14:32 and that it can cause real world 14:34 problems of course we need to learn to 14:36 reduce these biases 14:38 but also to accept them biases are 14:40 inevitable due to the combinatorial 14:42 complexity of the real world 14:44 that cannot be realistically represented 14:46 in a single data set of images 14:48 yet thus focusing our attention without 14:51 any play of words with transformers 14:53 on better algorithms that can learn to 14:55 be fair 14:56 even when trained on such incomplete 14:58 data sets 14:59 rather than having bigger and bigger 15:01 models trying to represent the most data 15:04 possible 15:05 even if it may look like it this paper 15:07 was not a criticism of current 15:08 approaches 15:09 instead it's an opinion piece motivated 15:11 by discussions with other researchers in 15:14 several disciplines 15:15 as they state we stress that views 15:17 expressed in the paper 15:18 are our own and do not necessarily 15:20 reflect 15:21 those of the computer vision community 15:23 but i must say 15:24 this was a very interesting read and my 15:27 views are quite similar 15:28 they also discuss many important 15:30 innovations that happen over the last 40 15:32 years in computer vision 15:34 that is definitely worth reading as 15:36 always the link to the paper 15:38 is in the description below to end on a 15:40 more positive note we are nearly 15:42 a decade into the revolution of deep 15:44 neural networks that started in 2012 15:47 with alexnet and the imagenet 15:49 competition since then 15:51 there has been immense progress on our 15:53 computation power 15:54 and the deep net architectures like the 15:56 use of batch normalization 15:58 residual connections and more recently 16:00 self-attention 16:01 researchers will undoubtedly improve the 16:03 architecture of deep nets but we shall 16:05 not forget that there are other ways to 16:07 achieve intelligent models than going 16:09 deeper and using more data of course 16:12 these ways are yet to be discovered 16:14 if this story of deep neural networks 16:16 sounds interesting to you 16:18 i made a video of one of the most 16:19 interesting architecture 16:21 along with a short historical review of 16:23 deep nets i'm sure you'll love it 16:25 thank you for watching Also published on: https://www.louisbouchard.me/ai-in-computer-vision/