You can find me on Twitter @bhutanisanyam1, connect with me on Linkedin hereHere and Here are two articles on my Learning Path to Self Driving Cars
You can find the Lecture 1 Notes hereLecture 2 Notes can be found hereLecture 3 Notes can be found hereLecture 5 Notes can be found here
These are the Lecture 4 notes for the MIT 6.S094: Deep Learning for Self-Driving Cars Course (2018), Taught by Lex Fridman.
All Images are from the Lecture Slides.
Computer Vision, as of Today is Deep Learning. Majority of the successes of our understanding of images, utilise Neural Networks.
Raw Sensory data: For the machine, images are in the form of numbers.
The images in the form of 1 channel or 3 channel numerical arrays, are taken as input by the NN, the output is produced by regressing or by classifying the image into various categories.
We must be careful about our assumptions for what is easy and hard with Perception.
Human Vision Vs Computer Vision.
Structure of the Visual Cortex is in layers. As information is passed from our eyes to the brain, higher and higher order representation are formed. This is the inspiration behind Deep NN for images. Higher and higher representations are formed through the layers. The early layers, taking in the raw pixels, finding edges. Further finding more abstract features by connecting these edges. Finally, finding higher order semantic meaning.
There is a bin with different categories inside each class. Those bins have a lot of examples of each. Task: Bin a new image into one of these classes.
Famous Datasets:
CIFAR-10One of the Simplest ones, containing 10 categories is commonly used to explore CNNs.
Trivial Example:
Comparing images by subtracting the pixel intensity matrices and then sum up the differences of every element. If the sum is high, the images are different. If we use this methodology, we can get 35% accuracy using L2 diff, 38% with L1 diff these are better than the random accuracy of10%.
K-Nearest Neighbours: Instead of finding the 1 image that is the closest to our dataset, we try to find k-closest images and bin them into k-classes. We vary the k from 1–5 and see how that changes the problems.
With k=7, we achieve 30% accuracy. Human level accuracy is 95%.With CNNs, we get 97.75% accuracy.
When the NN is tasked to learning a complex task with large data and large number of objects, CNNs work efficiently.
‘Trick-Spatial Invarince’:An Object in the top left corner is the same as the object in the bottom right corner of an image. So we learn the same features across the image.
Convolution operation: Instead of the Fully connected layers; Here a 3rd dimension of depth is present. So the block take 3 input volumes and produce 3D output volumes
They take a slice of the image, ‘a window’ and slide it through the image. They apply the same weights to slice/window of an image to generate outputs. We can make many such filters.
Parameters on each of these filters are shared. (If a feature is useful in one place, it’s useful everywhere) This allows parameter reduction significantly. The re-use of spatial features.
Example:
Convolution
Spanning Images: Pooling. Taking the output of a convolutional operation and reducing the resolution of that by condensing the information, for example considering the max values in Max-Pooling.
ImageNet Case Study
Task: Classification of one of the largest dataset of images. 14M+ Images21k+ CategoriesWith Many Sub-Classes
GoogLeNet 2014Inception Modules were introduced. Idea: It used the idea that different sized convolutions provide different value for the network, doing different convolutions and concatenating. Smaller convolutions: Features that are very Local/High res in texture.Larger Convolutions: Higher/More Abstract Features.Result: Fewer Parameters and Better Performance.
ResNet 2015Inspiration: (Doesn’t always hold) Network depth increases representation power. ‘Residual blocks’ allow creating much ‘Deeper’ Networks.Residual Block:- Repeat a Simple Network Block, similar to RNNs.- Pass input along without transformation, along with the ability to learn the weights.- Every layer takes in the input of previous layer and the raw, untransformed data to learn something new.
SENet 2017Squeeze and Excitation Network:- Added a parameter to each channel of a convolutional block so that the network can adaptively adjust the weighting of each channel based on each feature map/input to the network.- Trick: Allow the Network to learn the weighting on each individual channels. - Note: This is applicable to any architecture. Since, it simply parametrises which filter to pick based on the content.
ILSVRC Challenge Evaluation for ClassificationTop 5 guesses.Human Error is 5.1%Surpassed in 2015.
Capsule Networks:- Inspiration: Consider what assumptions are made by the network and what information is thrown away. - CNNs, due to their spatial invariance-throw away the hierarchy between simple and complex objects. - Future challenge: Design NN that work with Rotational.
Note: CNN produce a pixel level heat map of activations based on convolutions
Scene understanding
Use-Cases:- Precise boundaries of Object Matters in Medical, Driving. - In Driving, to mark exact boundaries of environment. ‘Fuse’ this with data from sensors. So fusing the semantic knowledge with 3D location in real world.
FCN 2014:- Repurpose the ImageNet pretrained Nets- Replaced the Fully connected layers with decoders that upsampled the image to produce a heat map.- Connections are skipped to improve the coarseness of upsampling.
SegNet 2015:- Applied this to Driving Context.
Dilated Convoltions 2015:- Convolution operation as pooling operation reduces the resolution significantly. - ‘Gridding’ maintains the local high-res textures while still capturing the spatial windows necessary.
DeepLab v1,v2 2016:- Added Conditional Random Fields (CRFs): Post processing to smooth the segmentation by looking at the underlying image intensities.
Key Aspects of Segmentation
ResNet-DUC 2017:
Hybrid Dilated Convoltion:Convolution is spread apart from Input to output.
FlowNet
The methods discussed here, disregard the temporal dynamics, which is relevant in the case of Robotics.
The Optical flow produces a direction of where the pixel moved and the magnitude of movement. This allows us to take information detected from the First Frame and propagate it forward.
Challenge: Segmentation of images through time.
FlowNet 2 2016:
Process:- Stacking Networks as an approach. - Ordering of Dataset matters
Dataset:
Task:
Use output of the Network to help propagate the information better. Can we figure out ways to use Temporal Information?
You can find me on Twitter @bhutanisanyam1, connect with me on Linkedin hereHere and Here are two articles on my Learning Path to Self Driving Cars
Subscribe to my Newsletter for a Weekly curated list of Deep learning, Computer Vision Articles