You can find my code here. It might be more up to date than the article ;).
This is the second article in the self driving cars series. If you want to know why I’m sharing this and more about my journey, please read this.
Classify traffic signs using a simple convolutional neural network.
Imagine you need to build a program that recognizes written digits.
This is a 5. But it could arguably be a 3.
What cutoffs/rules, would you use to go from a 3 to a 5?
Instead of trying to handpick all the rules and build a very complicated program, researchers have decided to show a computer thousands of examples and let it try to solve the problem by experience. This is the beginning of machine Learning.
One of the main problems with machine learning is feature extraction. Even though we show the computer thousands of examples, we still needed to tell him what features it should focus on. For complex problems, this was not good enough.
Deep learning models circumvent that. They learn by themselves what features they should focus on.
For the sake of brevity, I’m not going to dive in the mathematical explanations of how deep learning works. It took me around 20 hours to understand the concepts and use them. Instead, I’ll try to explain the intuition behind deep learning. I’ll post some videos and lectures I used if you want to go deeper. No pun intended.
As humans, recognizing an object seems like a pretty simple task. There is hardly any effort involved on our part, at least not consciously. But there is actually a lot of work done by our brain before we can really understand what we’re looking at.
In the late 1950s, David Hubel and Torsten Wiesel, two famous neurophysiologists, made experiments on a cat to show how the neurons in the visual cortex work.
For one, they showed that nearby cells process information from nearby visual fields, forming a topographical map. Moreover, their work determined that neurons with similar functions are organized into columns, tiny computational machines that relay information to a higher region of the brain, where a visual image is progressively formed.
The brain basically combines low level features such as basic shapes, curves and builds more complex shapes out of it.
A deep learning convolutional neural network is similar. It first identifies low level features and then learns to recognize and combines these features to learn more complicated patterns. These different levels of features come from different layers of the network.
Deep Learning is a fascinating field and I hope I gave you a clear enough introduction. I encourage you to watch the wonderful Stanford class about the subject.
If you prefer reading, I’d advise you Goodfellow, Bengio, and Courville’s book.
Detecting and Classifying Traffic signs is a mandatory problem to solve if we want self driving cars.
The dataset we will be using is a German Traffic sign dataset available online.
It contains more than 50,000 images in total, divided into 43 different classes: speed limits, dangerous curves, slippery road… Here are some of them.
This dataset was used in a competition a few years ago. The best result for the competition correctly guessed 99.46% of the signs. In comparison, human performance was established at 98.84%. Yes, the machine was more efficient than the human, as it was better at handling the most difficult cases, such as a blurry image of a speed limit sign that could be mistaken for a different speed limit.
Before we start building our deep learning network, let’s analyze the data.
Here is the distribution of the different classes:
As you can see there is too much difference between the classes. We are going to create some data to balance the number of inputs and reduce the probable bias the network could have towards some classes. It will also help us give more data to our network.
An easy way to do that is take the images and rotate them by a few degrees. By adding 5, -5, 10, and -10 degrees, we can already increase the input of some classes fivefold.
This is the distribution we get after this operation.
The data is more balanced, and each class has at least 500 images.
This part of the article describes in detail the network architecture. If you are not familiar with deep learning, you can skip to the results section.
First layer is a CNN with a patch size of 3*3, a stride of 1, SAME padding and a depth of 64.
Second and third layers are fully connected layers with a width of 512.
The final layer is a fully connected layer with a width of 43 (the amount of classes).
I used the Adam optimizer with its default parameters as it is currently regarded as the most efficient.
I used a batch size of 250 and 100 training epochs.
I tried adding more convolutional networks but they didn’t improve the results and increased the computation time by a lot. It didn’t feel necessary to add them as there is a low statistical invariance between the pictures we work on. Most of them are already centered and cropped around the sign.
I used a medium sized network as the signs are overall pretty simple in shape and color, but wide fully connected layers as there are some variations in the sign’s shapes, colors and overall appearance.
I didn’t use more than 100 epochs as the accuracy wasn’t improving after that. I also wanted to keep the network light so training it wouldn’t take too much time on AWS and the results are satisfying as is.
Up until this point I only authorized myself to look at the validation accuracy. When I was above 99% and happy with the result, I decided to compute against the testing set.
Final validation accuracy: 0.9938
Final test accuracy: 0.9110
The 8% drop between the validation and test sets shows that I overfit my model.
Here are some things I could try to avoid that:
A look at the confusion matrix also helps us see where we had most errors:
As you can see, the majority of errors are located on the top left. These are all the speed limits. It seems like the model can detect it’s a speed limit but has trouble identifying the difference between the values in the sign.
I tested some random signs downloaded from the internet to see how the model would react:
The two last signs were not in our dataset.
The three first signs were recognized correctly with a confidence of 99+%
The 4th sign was (obviously) wrongly recognized with a confidence of 80% as a No entry sign.
The 5th sign was (obviously) wrongly recognized with a confidence of 99% as a speed limit. The model should do better than that.
It took me around 20 hours to go from 0 knowledge in deep learning to being able to implement a simple small network. A few hours later I was able to get satisfying results.
Deep learning is really impressive. I can’t wait to see how much I can still learn and progress in this domain.
Next time we’ll talk about behavioral cloning. We’ll drive a car in a simulator, and see how well our deep learning model can learn from that to drive by itself. Sounds exciting right?! Stay tuned!
Also, please feel free to comment, ask questions, and give advice.