Tranformer-based vision models are gradually evolving and are reported to be as good as Convolutional models on Classification, Segmentations and Action Recognition tasks. We have a whole array of Convolutional models for vision tasks and are more popular than Transformer-based ones. This blog delves into the SWin Transformer Vision model that was presented in the International Conference for Computer Vision (ICCV) 2021 by the Microsoft Research Team. It benchmarks its performance against several SOTA Convolution-based models on the Dog Breed Image Classification task.
Will transformer-based models become the next big thing in Computer Vision? With transformers being a successful solution for language tasks, will it unify the various AI subfields and present powerful solutions to more complex problems? So rolling up my sleeves to evaluate how good they are on the classification task to begin with.
The myriad of dog breeds with subtle changes in their physical appearances have been a challenge to veterinarians, dog owners, animal shelter staff and potential dog owners in identifying their right breed. They need to identify the breed in order to provide appropriate training, treatment and meet their nutritional needs.The data is sourced from the Stanford Dog Dataset that contains ~20K images of 120 breeds of dogs across the world. This data has been split almost equally into train and test set for a Kaggle Competition Dog Breed Identification.
The objective of the solution is to build a dog breed identification system capable of correctly identifying dog breeds with minimal data and rightly identifying even similar looking dog breeds. This is a multi-class classification task and for every image, the model has to predict probability for all the 120 breeds. The one with the highest probability is the most probable breed of the dog in the image.
Though there is no class imbalance, data may be insufficient to train the neural network. Image Augmentation using random image perturbations and pre-trained models will be able to circumvent this problem.
Top 5 breeds with the most images are scottish_deerhound, maltese_dog, afghan_hound, entlebucher and bernese_mountain_dog. Bottom 5 breeds with the least images are golden_retriever, brabancon_griffon, komondor, eskimo_dog and briard.
A quick analysis on spatial dimensions of the training images is done to understand the distribution of image height, width and their aspect ratios.
Images with very low (<0.5) and very high (>2.3) aspect ratios are considered to be anomaly images. 1.5 is considered to be a good aspect ratio.
In analysing various dog breeds, below pairs of breeds were generally found to look alike.
This architecture is based on “Swin Transformer: Hierarchical Vision Transformer using Shifted Windows” developed by Microsoft Research Team. This paper discusses an improved ViT architecture that produces a hierarchical representation of feature maps reducing the computation complexity of self-attention mechanism from quadratic to linear. It is proven to give identical results as SOTA Convolution Networks like EfficientNet on Imagenet classification problem.
The building blocks of this architecture is explained in the below notes:
In NLP, the tokens which are the processing elements of a model, are the words in a sentence where the size of 1 token is 1 (just 1 word). ViT (Vision Transformers) treat “image patches” as tokens where each patch is a partition of an image consisting of a group of neighbouring pixels. Each image patch is a token. The size of 1 token in any ViT is patch_height * patch_width * number of channels. Based on the patch dimensions, we get a number of patches or tokens for a single image. If the image size (H*W*3) is 224 * 224 * 3 pixels and the patch size is 4 * 4, then we get 224/4 * 224/4 = 56 * 56 patches or 3136 tokens from the image. Each token or patch will be of size 4*4*3 = 48 dimensions or 48 pixels of data. So the input to the architecture for this image consists of 3136 tokens each of size 48 dimensions.
The underlying mechanism of the SWIN transformer is analogous to any CNN based architecture where the spatial dimensions of the image is decreased and the number of channels is increased. SWIN transformer at every stage in the hierarchical architecture, also reduces the number of image patches or the number of tokens while increasing the token dimensions. With this mechanism in mind, it is easier to understand the SWIN architecture easily.
At every stage of the architecture, we can see the number of tokens decreasing while the token size is increasing.
The SWIN-T architecture, apart from the “Patch Partitioner”, is also made up of 3 other building blocks - Linear Embedding, Swin Transformer Block, Patch Merging. These building blocks are repeated and it process feature maps in a hierarchical fashion.
The 3136 tokens each of 48 dimension from the “Patch Partitioner” are fed to a feed forward layer to embed the token of 48 feature vector into a feature vector of size ‘C’. ‘C’ here acts as the capacity of the transformer and the SWIN architecture has 4 variants based on it.
Image patching and linear embedding are jointly implemented in a single convolution whose kernel-size and stride-length is same as the patch-size. Number of channels in the convolutional will be ‘C’.
The SWin Transformer Block is different from the standard transformer block in the ViT architecture. In SWin Transformers, the
Stage1 consists of 2 SWIN-T Transformer Blocks (refer image) where the first Transformer Block has Window MSA (W-MSA) and the second Transformer Block has Shifted Window MSA (SW-MSA) module. In the SWin Transformer Block, the inputs and the outputs of the W-MSA and SW-MSA layers are passed via Normalization Layers. It is then subjected to a 2 layered Feed Forward Network with Gaussian Error Linear Units (GELU) activation. There are residual connections within each block and between these 2 blocks.
Window MSA (W-MSA) and Shifted Window MSA (SW-MSA) modules
The standard attention layer in ViT was a global one calculating the attention of a patch with all other patches in the image thus leading to a quadratic complexity proportional to the image dimensions. This doesn't scale very well for high resolution images.
The self-attention mechanism in the W-MSA or SW-MSA module is a local one that calculates self-attention only between patches within the same window of the image and not outside the windows.
Windows are like larger partitions of the image where each window comprises of M*M patches. Replacing global self-attention with local self-attention reduced the computational complexity from quadratic to linear.
The key difference between W-MSA and SW-MSA attention modules is in the way how the windows for the image are configured.
In W-MSA module, a regular window partitioning strategy is followed. The image is evenly partitioned into non-overlapping windows starting from the top-left pixel of the image, and each window contains M*M or M2 patches .
In SW-MSA module, the window configuration is shifted from that of the W-MSA layer, by displacing the windows by (M/2, M/2) patches from the regular partitioning strategy.
Since the attention is restricted locally within a window in W-MSA, the shifted window enables cross-window attention to still yield the benefits of global attention. This is possible because the boundaries of window1 in W-MSA layer are shared with windows W2, W4 and W5 in SW-MSA layer. Hence global attention happens indirectly via “local attention on shifted windows”.
Patch Merging layer reduces the number of tokens as the network gets deeper and deeper. The first patch merging layer concatenates the features of each group of 2×2 neighbouring patches.
The package tfswin in PyPI has pretrained TF-Keras variants of the SWIN Transformers and is built based on the official pytorch implementation. Its code is available in github. tfswin is used to train the dog breed images.
from tfswin import SwinTransformerBase224, preprocess_input
def build_model1(swintransformer):
tf.keras.backend.clear_session()
inputs = Input(shape=(resize_height, resize_width, 3))
outputs = Lambda(preprocess_input)(inputs)
outputs = swintransformer(outputs)
outputs = Dense(num_classes, activation='softmax')(outputs)
swin_model = Model(inputs=inputs, outputs=outputs)
return swin_model
#build the model
swintransformer = SwinTransformerBase224(include_top=False,pooling='avg')
swin_model1 = build_model1(swintransformer)
#set the layers of the pretrained model as non-trainable
for layer in swin_model1.layers[2].layers:
layer.trainable = False
swin_model1.compile(optimizer=tf.keras.optimizers.Adam(learning_rate=0.001),loss='categorical_crossentropy',metrics=['accuracy'])
#Logloss of the test set using various ResNet variants
+------------+---------------+-------------------------+----------+
| Model Name | Retrained | Top Layers Replacement | Log_Loss |
+------------+---------------+-------------------------+----------+
| ResNet50 | None | ConvBlock_FC_Output | 0.96463 |
| ResNet50 | None | GlobalAvgPooling_Output | 0.58147 |
| ResNet50 | last 4 layers | ConvBlock_FC_Output | 2.10158 |
| ResNet50 | last 4 layers | GlobalAvgPooling_Output | 0.57019 |
+------------+---------------+-------------------------+----------+
Code corresponding to the ResNet50 model with least log loss
from tensorflow.keras.layers import Input,Conv2D,Dense,BatchNormalization,Flatten,Concatenate, Dropout,MaxPooling2D
from tensorflow.keras.models import Model
from tensorflow.keras.applications.resnet50 import ResNet50, preprocess_input
def build_model():
tf.keras.backend.clear_session()
inputs = Input(shape=(resize_height, resize_width, 3))
#added preprocess_input method as a layer to convert input images to those expected by Resnet
processed_inputs = preprocess_input(inputs)
#use the pretrained ResNet model (Parameter pooling = 'avg' will take care of the Gobal Average Pooling of the ResNet model features)
base_model = ResNet50(weights="imagenet", include_top=False,pooling='avg')(processed_inputs)
#output layer
output = Dense(units=num_classes,activation='softmax',name='Output')(base_model)
resnet_model = Model(inputs=inputs, outputs=output)
return resnet_model
#build the model
resnet_model = build_model()
#set the layers of the resnet pretrained model as non-trainable except for its last 4 layers which needs to be re-trained for this data
for layer in resnet_model.layers[3].layers[:-4]:
layer.trainable = False
#compile the model
resnet_model.compile(optimizer=tf.keras.optimizers.Adam(learning_rate=0.0001),loss='categorical_crossentropy',metrics=['accuracy'])
print(resnet_model.summary())
history = resnet_model.fit(train_ds,
epochs=50,
validation_data=val_ds, callbacks=callbacks_list)
#Logloss of the standalone model variants
+----------------------------+-------------+
| Model Name | Log_Loss |
+----------------------------+-------------+
| EfficientNetV2M | 0.28347 |
| Inception ResNet | 0.28623 |
| NasNetLarge | 0.33285 |
| Xception | 0.34187 |
| Inception_V3 | 0.54297 |
| EfficientNetV2M_GlobalAveg | 0.50423 |
| InceptionV3_GlobalAveg | 0.46402 |
+----------------------------+-------------+
+--------------------------------------------------------------------------+-----------+
| Model Name | Log_Loss |
+--------------------------------------------------------------------------+-----------+
| Ensemble1 - EfficientNEt,InceptionResNet,NasNet,Xception) | 0.17363 |
| Ensemble2 - EfficientNEt,InceptionResNet,NasNet,Xception and InceptionV3 | 0.16914 |
| Ensemble3 - Ensemble2 with 50% dropout. | 0.16678 |
| Ensemble4 - Ensemble of various EfficientNet Architecture | 0.16519 |
+--------------------------------------------------------------------------+-----------+
Each of these models accepts varied input formats and in Keras they have their own preprocessing functions.
Benchmarking Outcome
+----------------------------------+------------+----------------------+----------+
| Model Name | Parameters | Train time (seconds) | Log_Loss |
+----------------------------------+------------+----------------------+----------+
| EfficientNet_ConvBlock_Output | 54.7M | ~260s | 0.28347 |
| InceptionResNet_ConvBlock_Output | 56.1M | ~260s | 0.28623 |
| NASNetLarge_ConvBlock_Output | 89.6M | ~330s | 0.33285 |
| XCeption_ConvBlock_Output | 23.3M | ~240s | 0.34187 |
| InceptionV3_ConvBlock_Output | 24.2M | ~225s | 0.54297 |
| EfficientNet_GlobalAvg | 53.3M | ~260s | 0.50423 |
| InceptionV3_GlobalAvg | 22M | ~215s | 0.46402 |
| swin_base224 | 86.8M | ~550s | 0.47289 |
| swin_base384 | 87M | ~600s | 0.41902 |
| swin_large384 | 195M | ~1000s | 0.42207 |
+----------------------------------+------------+----------------------+----------+
SWIN Transformers have performed better than all of the ResNet50 variants and InceptionV3 model.
Log-Loss of SWIN Transformer on this data is slightly higher compared to InceptionResNet, EfficientNet, Xception and NasNet Large models when their outputs are processed subsequently by convolutional layer followed by Maxpooling.
SWIN however performs is as good as EfficientNet model when their average outputs are directly processed by the output layer.
SWIN models are larger compared to any of the Convolutional models and hence will take a hit on the system throughput and latency.
This study proved useful in understanding the application of transformer-based models for computer vision.
You can find the code to the notebooks on my Github and the GUI for this usecase here.
https://www.kaggle.com/competitions/dog-breed-identification
https://arxiv.org/pdf/2103.14030.pdf