1st post of the 5-minute paper’s series. About the series: This blog post marking the start of series (Hopefully) will be a walkthrough of the ideas shared in the paper and a few thoughts by me. Bag of Tricks for Image Classification with Convolutional Neural Networks The paper discusses a few tricks and does an analysis of their individual as well as combined contributions to training a few of the recent models. CNN Context: While we keep noticing a push in the State Of The Art accuracy for Image classification models and even though Deep networks have surpassed human level accuracy at the task of Image Classification; these breakthroughs haven’t solely been because of the architectures or just because the Neural net models keep getting “deeper” Learning Many of the improvements have been due to little “tricks” that usually go unpublished or are not extremely highlighted. The goal of the authors is to share these tricks along with extensive experiments based on each of the “trick”. These tricks have two aims: Enabling better accuracy:Tweaks to the NN models allow better accuracies. Enabling faster training.For ex: Utilisation of Tensor Cores on the new RTX cards using FP16 compute. Tricks The authors discuss a few tricks, lets go over them: **Linear scaling learning rate:**When training a Neural Network, we feed the images into our GPU in batches, as the memory permits. These are drawn by random in SGD and have some variance to them.As we increase the batch size, our batch to batch variance of images reduces, this allows more aggressive learning rate settings.  Learning rate is set using the formula: 0.1 x (Batch_size)/256 **Learning Rate Warmup:**When we train a Neural Net, we set the Learning Rate which determines how “fast” does the model learn or how aggressively do we penalize the predictions from it.As we start training our Neural Network, we can expect the weights to have more randomness. If we set an aggressive learning rate, that might cause the network to diverge.  To fix this, Learning rate warmup is used which allows the learning rate to be set to lower values initially and later increased.  This is done as follows: initially learning rate is set to ~0 and then increased for the “initial batches” as per the formula: Current epoch* (Learning Rate)/(Num_of_Initial_Epochs) is a hyperparameter, the number that we choose to keep until the model is “warmed up” Our final learning rate that we set for the NN. Num of initial epochs: Learning Rate: **Zero Y:**ResNets are composed of multiple Residual blocks. These can be denoted for simplicity as x+ block(x)These are usually followed by a batch normalization layers which are further scale transformed as Y*(x)+bias. (Y and bias are “learned” as we train the network) The paper suggests initializing for the layers at the end of ResNet blocks, Y should be initialized to 0. This reduces the number of learnable parameters initially. Regularisation is generally applied to all learnable parameters however, this often leads to overfitting.The authors suggest applying Regularisation just to the convolutional and fully connected layers, Biases and batch-norm parameters are not regularised. No Bias Decay: **Mixed Precision Training:**Mixed Precision training allows us to use Tensor Cores of the recent GPU(s) which are much faster, and thereby allow us to leverage fast FP16 operations and FP32 training for calculating weights or gradients, to allow “unsafe” operations to be done in Full Precision. For More details on Mixed Precision Training, I would suggest reading this writeup. The paper compares certain tweaks made to the ResNet architectures.  These include tweaking the strides inside of the Convolution layer, inspired by recent architectures. I’ll skip the details of this part since these require slightly detailed knowledge of strides and convolutions.  In general, the argument is tweaking the existing architectures also provides a boost to accuracy. Model Tweaks: As we train our model, we can expect it to get better with epochs (Assuming everything works nicely), as our loss values decrease, we’re closing into a “minima”. At this point, we’d like to decrease our learning rate. Instead of using Step decay where the learning rate is decayed in “steps” at fixed batch numbers, the paper suggests using a cosine scheduled learning rate as follows:  Learning Rate (Cosine_Decayed) = Initially set Learning rate * 0.5 * ( 1 + cos( Current_Batch * (Pi))/Total_Num_Batches)) Cosine Learning Rate: This suggests tweaking the final the final layers and the probability distribution for the predictions to allow the network to generalize better. I’ve skipped the details of how this is implemented. Label Smoothing: **Knowledge Distillation:**This approach “Distills” the knowledge from another pre-trained “teacher” model into our current student network.  This is done by adding a distillation loss that penalizes our student network based on how much its predictions are off when compared to the “teacher model”. Results: The paper shows the most promise in ImageNet, using the ResNet 50 model which shows the highest improvement. There is an extensive study of how useful or “hurtful” each of these individual tricks is towards a task. Finally, there is a comparison of using transfer learning on two tasks: Object Detection on the Pascal VOC dataset: Semantic Segmentation Conclusion & Personal thoughts: This is one of the first papers to embrace the little tricks that are usually not in the spotlight or not given as much importance in papers. The paper does a deep-dive into all of the ideas, showing extensive comparisons. As a little exercise, I believe it should be a good experiment to try a few of these tricks onto a personal target task and document the improvements. I’d also love to read more about such tricks in future papers or have a detailed section dedicated to such approaches. If you found this interesting and would like to be a part of My Learning Path , you can find me on Twitter here . If you’re interested in reading about Deep Learning and Computer Vision news, you can check out my newsletter here . If you’re interested in reading a few best advice from Machine Learning Heroes: Practitioners, Researchers, and Kagglers. Please click here

Initialized

Target

Twitter

“An Embarrassingly Simple Approach for Transfer Learning from Pretrained Language Models”: Paper…

Interview with Kaggle Grandmaster, Lead Data Scientist at Dbrain: Artur Kunzin

Connect with me on Twitter

Nominated for 2022 - HackerNoon Contributor of the Year - Deep Learning

Nominated for 2022 - HackerNoon Contributor of the Year - Machine Learning

Nominated for 2022 - HackerNoon Contributor of the Year - Computer Science

Too Long; Didn't Read

“Bag of Tricks for Image Classification with Convolutional Neural Networks”: Paper Discussion

“Bag of Tricks for Image Classification with Convolutional Neural Networks”: Paper Discussion

About Author

Comments

TOPICS

THIS ARTICLE WAS FEATURED IN

Related Stories

A Full Time ML Role, 1 Million Blog Views, 10k Podcast Downloads: A Community Taught ML Engineer

The Noonification: Use This 7-Step McKinsey Framework to Solve Any Problem (1/10/2023)

The Noonification: A Taxonomy of Inclusiveness (1/11/2024)

The Noonification: What is the InfiniteNature-Zero AI Model? (11/19/2022)

10 Ways AI Has Changed Our Lives

100 Days of AI, Day 8: Experimenting With Microsoft's Semantic Kernel Using GPT-4

A Full Time ML Role, 1 Million Blog Views, 10k Podcast Downloads: A Community Taught ML Engineer

The Noonification: Use This 7-Step McKinsey Framework to Solve Any Problem (1/10/2023)

The Noonification: A Taxonomy of Inclusiveness (1/11/2024)

The Noonification: What is the InfiniteNature-Zero AI Model? (11/19/2022)

10 Ways AI Has Changed Our Lives

100 Days of AI, Day 8: Experimenting With Microsoft's Semantic Kernel Using GPT-4

Light-Mode

Classic

Newspaper

Minty

Dark-Mode

Neon Noir

Minty

HN StartUps