In a couple of minutes, you will know how the transformer architecture can be applied to computer vision with a new paper called the Swin Transformer.
As a bonus, make sure you stay until the end of the video for a giveaway sponsored by NVIDIA GTC!
►My Newsletter (subscribe here to have a chance to win!): http://eepurl.com/huGLT5
►Register to the GTC event: https://www.nvidia.com/en-us/gtc/?ncid=ref-crea-331503
►DLI courses: https://www.nvidia.com/en-us/training/
►Paper: Liu, Z., “Swin Transformer: Hierarchical Vision Transformer using Shifted Windows”, 2021, https://arxiv.org/abs/2103.14030v1
►Code: https://github.com/microsoft/Swin-Transformer
00:00
This video is about most probably the next generation of neural networks for all computer
00:05
vision applications: The transformer architecture.
00:09
You've certainly already heard about this architecture in the field of natural language
00:13
processing, or NLP, mainly with GPT3 that made a lot of noise in 2020.
00:19
Transformers can be used as a general-purpose backbone for many different applications and
00:23
not only NLP. photo: transformers, gpt3
00:25
In a couple of minutes, you will know how this transformer architecture can be applied
00:29
to computer vision with a new paper called the Swin Transformer by Ze Lio et al. from
00:35
Microsoft Research.
00:37
Before diving into the paper, I just wanted to tell you to stay until the end of the video,
00:41
where I will talk about my newsletter I just created and the next free NVIDIA GTC EVENT
00:47
happening in two weeks.
00:48
You should definitely stay or skip right to it as I will provide you with the timeline
00:53
as usual because I will be hosting a giveaway in collaboration with NVIDIA GTC!
00:58
This video may be less flashy than usual as it doesn't really show the actual results
01:03
of a precise application.
01:04
Instead, the researchers showed how to adapt the transformers architecture from text inputs
01:10
to images, surpassing computer vision state-of-the-art convolutional neural networks, which is much
01:15
more exciting than a small accuracy improvement, in my opinion!
01:19
And of course, they are providing the code for you to implement yourself!
01:23
The link is in the description.
01:25
But why are we trying to replace convolutional neural networks for computer vision applications?
01:30
This is because transformers can efficiently use a lot more memory and are much more powerful
01:36
when it comes to complex tasks.
01:38
This is, of course, according to the fact that you have the data to train it.
01:43
Transformers also use the attention mechanism introduced with the 2017 paper Attention is
01:48
all you need.
01:49
Attention allows the transformer architecture to compute in a parallelized manner.
01:54
It can simultaneously extract all the information we need from the input and its inter-relation,
02:00
compared to CNNs.
02:02
CNNs are much more localized, using small filters to compress the information towards
02:07
a general answer.
02:08
While this architecture is powerful for general classification tasks, it does not have the
02:13
spatial information necessary for many tasks like instance recognition.
02:18
This is because convolutions don't consider distanced-pixels relations.
02:23
In the case of NLP, a classical type of input is a sentence and an image in a computer vision
02:29
case.
02:30
To quickly introduce the concept of attention, let's take a simple NLP example sending a
02:34
sentence to translate it into a transformer network.
02:38
In this case, attention is basically measuring how each word in the input sentence is associated
02:44
with each word on the output translated sentence.
02:47
Similarly, there is also what we call self-attention that could be seen as a measurement of a specific
02:53
word's effect on all other words of the same sentence.
02:57
This same process can be applied to images calculating the attention of patches of the
03:01
images and their relations to each other, as we will discuss further in the video.
03:06
Now that we know transformers are very interesting, there is still a problem when it comes to
03:11
computer vision applications.
03:12
Indeed, just like the popular saying "a picture is worth a thousand words," pictures contain
03:18
much more information than sentences, so we have to adapt the basic transformer's architecture
03:23
to process images effectively.
03:26
This is what this paper is all about.
03:28
This is due to the fact that the computational complexity of its self-attention is quadratic
03:33
to image size.
03:35
Thus exploding the computation time and memory needs.
03:38
Instead, the researchers replaced this quadratic computational complexity with a linear computational
03:44
complexity to image size.
03:47
The process to achieve this is quite simple.
03:50
At first, like most computer vision tasks, an RGB image is sent to the network.
03:55
This image is split into patches, and each patch is treated as a token.
04:00
And these tokens' features are the RGB values of the pixels themselves.
04:04
To compare with NLP, you can see this as the overall image is the sentence, and each patch
04:10
is the words of that sentence.
04:13
Self-attention is applied on each patch, here referred to as windows.
04:17
Then, the windows are shifted, resulting in a new window configuration to apply self-attention
04:22
again.
04:23
This allows the creation of connections between windows while maintaining the computation
04:28
efficiency of this windowed architecture.
04:31
This is very interesting when compared with convolutional neural networks as it allows
04:35
long-range pixel relations to appear.
04:38
This was only for the first stage.
04:41
The second stage is very similar but concatenates the features of each group of two by two neighboring
04:47
patches, downsampling the resolution by a factor of two.
04:50
This procedure is repeated twice in Stages 3 and 4 producing the same feature map resolutions
04:57
like those of typical convolutional networks like resnets and VGG.
05:03
You may say that this is highly similar to a convolutional architecture and filters using
05:07
dot products.
05:08
Well, yes and no.
05:10
The power of convolutions is that the filters use fixed weights globally, enabling the translation-invariance
05:16
property of convolution, making it a powerful generalizer.
05:20
In self-attention, the weights are not fixed globally.
05:23
Instead, they rely on the local context itself.
05:26
Thus, self-attention takes into account each pixel, but also its relation to the other
05:32
pixels.
05:33
Also, their shifted window technique allows long-range pixel relations to appear.
05:38
Unfortunately, these long-range relations only appear with neighboring windows.
05:43
Thus, losing very long-range relations, showing that there is still a place for improvement
05:47
of the transformer architecture when it comes to computer vision,
05:51
As they state in the paper, "It is our belief that a unified architecture
05:55
across computer vision and natural language processing could benefit
05:59
both fields, since it would facilitate joint modeling of visual and textual signals and
06:04
the modeling knowledge from both domains can be more deeply shared"
06:08
And I completely agree.
06:10
I think using a similar architecture for both NLP and computer vision could significantly
06:15
accelerate the research process.
06:17
Of course, transformers are still highly data-dependent, and nobody can say whether or not it will
06:23
be the future of either NLP or computer vision.
06:26
Still, it is undoubtedly a significant step forward for both fields!
06:31
Now that you've stayed this far let's talk about an awesome upcoming event for our field:
06:36
GTC.
06:37
So what is GTC2021?
06:38
It is a weeklong event offering over 1,500 talks from AI leaders like Yoshua Bengio,
06:45
Yann Lecun, Geoffrey Hinton, and much more!
06:48
The conference will start on April 12 with a keynote from the CEO of NVIDIA, where he
06:53
will be hosting the three AI pioneers I just mentioned.
06:57
This will be amazing!
06:58
It is an official NVIDIA conference for AI innovators, technologists, and creatives.
07:04
The conferences are covering many exciting topics.
07:07
Such as automotive, healthcare, data science, energy, deep learning, education, and much
07:11
more.
07:12
You don't want to miss it out!
07:14
Oh, and did I forget to mention that the registration is completely free this year?
07:18
So sign-up right now and watch it with me.
07:21
The link is in the description!
07:23
What's even cooler is that NVIDIA provided me 5 Deep Learning Institute credits that
07:28
you can use for an online, self-paced course of your choice worth around 30$ each!
07:34
The deep learning institute offers hands-on training in AI for developers, data scientists,
07:40
students, and researchers to get practical experience powered by GPUs in the cloud!
07:45
I think it's an awesome platform to learn, and it is super cool that they are offering
07:49
credits to give away, don't miss out on this opportunity!
07:52
To participate in this giveaway, you need to mention your favorite moment from the GTC
07:57
keynote on April 12 at 8:30 am pacific time using the hashtag #GTCWithMe and tagging me
08:05
(@whats_ai) on LinkedIn or Twitter!
08:09
I will also be live-streaming the event on my channel to watch it together and discuss
08:13
it in the chat.
08:14
Stay tuned for that, and please let me know what you think of the conference afterward!
08:19
NVIDIA also provided me with two extra codes to give away to the ones subscribing to my
08:24
newsletter!
08:25
This newsletter is about sharing only ONE paper each week.
08:29
There will be a video, an article, the code, and the paper itself.
08:32
I will also add some of the projects I am working on, guides to learning machine learning,
08:37
and other exciting news!
08:39
It's the first link in the description, and I will draw the winners just after the GTC
08:43
event!
08:44
Finally, just a final word as I wanted to personally thanks the four recent Youtube
08:49
members!
08:50
Huge thanks to you ebykova, Tonia Spight-Sokoya, Hello Paperspace, and Martin Petrovski, for
08:58
your support and everyone watching the videos!
09:01
See you in the next one!