In a couple of minutes, you will know how the transformer architecture can be applied to computer vision with a new paper called the Swin Transformer.
As a bonus, make sure you stay until the end of the video for a giveaway sponsored by NVIDIA GTC!
►My Newsletter (subscribe here to have a chance to win!): http://eepurl.com/huGLT5
►Register to the GTC event: https://www.nvidia.com/en-us/gtc/?ncid=ref-crea-331503
►DLI courses: https://www.nvidia.com/en-us/training/
►Paper: Liu, Z., “Swin Transformer: Hierarchical Vision Transformer using Shifted Windows”, 2021, https://arxiv.org/abs/2103.14030v1
This video is about most probably the next generation of neural networks for all computer
vision applications: The transformer architecture.
You've certainly already heard about this architecture in the field of natural language
processing, or NLP, mainly with GPT3 that made a lot of noise in 2020.
Transformers can be used as a general-purpose backbone for many different applications and
not only NLP. photo: transformers, gpt3
In a couple of minutes, you will know how this transformer architecture can be applied
to computer vision with a new paper called the Swin Transformer by Ze Lio et al. from
Before diving into the paper, I just wanted to tell you to stay until the end of the video,
where I will talk about my newsletter I just created and the next free NVIDIA GTC EVENT
happening in two weeks.
You should definitely stay or skip right to it as I will provide you with the timeline
as usual because I will be hosting a giveaway in collaboration with NVIDIA GTC!
This video may be less flashy than usual as it doesn't really show the actual results
of a precise application.
Instead, the researchers showed how to adapt the transformers architecture from text inputs
to images, surpassing computer vision state-of-the-art convolutional neural networks, which is much
more exciting than a small accuracy improvement, in my opinion!
And of course, they are providing the code for you to implement yourself!
The link is in the description.
But why are we trying to replace convolutional neural networks for computer vision applications?
This is because transformers can efficiently use a lot more memory and are much more powerful
when it comes to complex tasks.
This is, of course, according to the fact that you have the data to train it.
Transformers also use the attention mechanism introduced with the 2017 paper Attention is
all you need.
Attention allows the transformer architecture to compute in a parallelized manner.
It can simultaneously extract all the information we need from the input and its inter-relation,
compared to CNNs.
CNNs are much more localized, using small filters to compress the information towards
a general answer.
While this architecture is powerful for general classification tasks, it does not have the
spatial information necessary for many tasks like instance recognition.
This is because convolutions don't consider distanced-pixels relations.
In the case of NLP, a classical type of input is a sentence and an image in a computer vision
To quickly introduce the concept of attention, let's take a simple NLP example sending a
sentence to translate it into a transformer network.
In this case, attention is basically measuring how each word in the input sentence is associated
with each word on the output translated sentence.
Similarly, there is also what we call self-attention that could be seen as a measurement of a specific
word's effect on all other words of the same sentence.
This same process can be applied to images calculating the attention of patches of the
images and their relations to each other, as we will discuss further in the video.
Now that we know transformers are very interesting, there is still a problem when it comes to
computer vision applications.
Indeed, just like the popular saying "a picture is worth a thousand words," pictures contain
much more information than sentences, so we have to adapt the basic transformer's architecture
to process images effectively.
This is what this paper is all about.
This is due to the fact that the computational complexity of its self-attention is quadratic
to image size.
Thus exploding the computation time and memory needs.
Instead, the researchers replaced this quadratic computational complexity with a linear computational
complexity to image size.
The process to achieve this is quite simple.
At first, like most computer vision tasks, an RGB image is sent to the network.
This image is split into patches, and each patch is treated as a token.
And these tokens' features are the RGB values of the pixels themselves.
To compare with NLP, you can see this as the overall image is the sentence, and each patch
is the words of that sentence.
Self-attention is applied on each patch, here referred to as windows.
Then, the windows are shifted, resulting in a new window configuration to apply self-attention
This allows the creation of connections between windows while maintaining the computation
efficiency of this windowed architecture.
This is very interesting when compared with convolutional neural networks as it allows
long-range pixel relations to appear.
This was only for the first stage.
The second stage is very similar but concatenates the features of each group of two by two neighboring
patches, downsampling the resolution by a factor of two.
This procedure is repeated twice in Stages 3 and 4 producing the same feature map resolutions
like those of typical convolutional networks like resnets and VGG.
You may say that this is highly similar to a convolutional architecture and filters using
Well, yes and no.
The power of convolutions is that the filters use fixed weights globally, enabling the translation-invariance
property of convolution, making it a powerful generalizer.
In self-attention, the weights are not fixed globally.
Instead, they rely on the local context itself.
Thus, self-attention takes into account each pixel, but also its relation to the other
Also, their shifted window technique allows long-range pixel relations to appear.
Unfortunately, these long-range relations only appear with neighboring windows.
Thus, losing very long-range relations, showing that there is still a place for improvement
of the transformer architecture when it comes to computer vision,
As they state in the paper, "It is our belief that a unified architecture
across computer vision and natural language processing could benefit
both fields, since it would facilitate joint modeling of visual and textual signals and
the modeling knowledge from both domains can be more deeply shared"
And I completely agree.
I think using a similar architecture for both NLP and computer vision could significantly
accelerate the research process.
Of course, transformers are still highly data-dependent, and nobody can say whether or not it will
be the future of either NLP or computer vision.
Still, it is undoubtedly a significant step forward for both fields!
Now that you've stayed this far let's talk about an awesome upcoming event for our field:
So what is GTC2021?
It is a weeklong event offering over 1,500 talks from AI leaders like Yoshua Bengio,
Yann Lecun, Geoffrey Hinton, and much more!
The conference will start on April 12 with a keynote from the CEO of NVIDIA, where he
will be hosting the three AI pioneers I just mentioned.
This will be amazing!
It is an official NVIDIA conference for AI innovators, technologists, and creatives.
The conferences are covering many exciting topics.
Such as automotive, healthcare, data science, energy, deep learning, education, and much
You don't want to miss it out!
Oh, and did I forget to mention that the registration is completely free this year?
So sign-up right now and watch it with me.
The link is in the description!
What's even cooler is that NVIDIA provided me 5 Deep Learning Institute credits that
you can use for an online, self-paced course of your choice worth around 30$ each!
The deep learning institute offers hands-on training in AI for developers, data scientists,
students, and researchers to get practical experience powered by GPUs in the cloud!
I think it's an awesome platform to learn, and it is super cool that they are offering
credits to give away, don't miss out on this opportunity!
To participate in this giveaway, you need to mention your favorite moment from the GTC
keynote on April 12 at 8:30 am pacific time using the hashtag #GTCWithMe and tagging me
(@whats_ai) on LinkedIn or Twitter!
I will also be live-streaming the event on my channel to watch it together and discuss
it in the chat.
Stay tuned for that, and please let me know what you think of the conference afterward!
NVIDIA also provided me with two extra codes to give away to the ones subscribing to my
This newsletter is about sharing only ONE paper each week.
There will be a video, an article, the code, and the paper itself.
I will also add some of the projects I am working on, guides to learning machine learning,
and other exciting news!
It's the first link in the description, and I will draw the winners just after the GTC
Finally, just a final word as I wanted to personally thanks the four recent Youtube
Huge thanks to you ebykova, Tonia Spight-Sokoya, Hello Paperspace, and Martin Petrovski, for
your support and everyone watching the videos!
See you in the next one!