Hackernoon logoWill Transformers Replace CNNs in Computer Vision? by@whatsai

Will Transformers Replace CNNs in Computer Vision?

image
Louis Bouchard Hacker Noon profile picture

@whatsaiLouis Bouchard

I explain Artificial Intelligence terms and news to non-experts.

In a couple of minutes, you will know how the transformer architecture can be applied to computer vision with a new paper called the Swin Transformer.

As a bonus, make sure you stay until the end of the video for a giveaway sponsored by NVIDIA GTC!

References

►My Newsletter (subscribe here to have a chance to win!): http://eepurl.com/huGLT5

►Register to the GTC event: https://www.nvidia.com/en-us/gtc/?ncid=ref-crea-331503

►DLI courses: https://www.nvidia.com/en-us/training/

►Paper: Liu, Z., “Swin Transformer: Hierarchical Vision Transformer using Shifted Windows”, 2021, https://arxiv.org/abs/2103.14030v1

►Code: https://github.com/microsoft/Swin-Transformer

Video transcript

00:00

This video is about most probably the next generation of neural networks for all computer

00:05

vision applications: The transformer architecture.

00:09

You've certainly already heard about this architecture in the field of natural language

00:13

processing, or NLP, mainly with GPT3 that made a lot of noise in 2020.

00:19

Transformers can be used as a general-purpose backbone for many different applications and

00:23

not only NLP. photo: transformers, gpt3

00:25

In a couple of minutes, you will know how this transformer architecture can be applied

00:29

to computer vision with a new paper called the Swin Transformer by Ze Lio et al. from

00:35

Microsoft Research.

00:37

Before diving into the paper, I just wanted to tell you to stay until the end of the video,

00:41

where I will talk about my newsletter I just created and the next free NVIDIA GTC EVENT

00:47

happening in two weeks.

00:48

You should definitely stay or skip right to it as I will provide you with the timeline

00:53

as usual because I will be hosting a giveaway in collaboration with NVIDIA GTC!

00:58

This video may be less flashy than usual as it doesn't really show the actual results

01:03

of a precise application.

01:04

Instead, the researchers showed how to adapt the transformers architecture from text inputs

01:10

to images, surpassing computer vision state-of-the-art convolutional neural networks, which is much

01:15

more exciting than a small accuracy improvement, in my opinion!

01:19

And of course, they are providing the code for you to implement yourself!

01:23

The link is in the description.

01:25

But why are we trying to replace convolutional neural networks for computer vision applications?

01:30

This is because transformers can efficiently use a lot more memory and are much more powerful

01:36

when it comes to complex tasks.

01:38

This is, of course, according to the fact that you have the data to train it.

01:43

Transformers also use the attention mechanism introduced with the 2017 paper Attention is

01:48

all you need.

01:49

Attention allows the transformer architecture to compute in a parallelized manner.

01:54

It can simultaneously extract all the information we need from the input and its inter-relation,

02:00

compared to CNNs.

02:02

CNNs are much more localized, using small filters to compress the information towards

02:07

a general answer.

02:08

While this architecture is powerful for general classification tasks, it does not have the

02:13

spatial information necessary for many tasks like instance recognition.

02:18

This is because convolutions don't consider distanced-pixels relations.

02:23

In the case of NLP, a classical type of input is a sentence and an image in a computer vision

02:29

case.

02:30

To quickly introduce the concept of attention, let's take a simple NLP example sending a

02:34

sentence to translate it into a transformer network.

02:38

In this case, attention is basically measuring how each word in the input sentence is associated

02:44

with each word on the output translated sentence.

02:47

Similarly, there is also what we call self-attention that could be seen as a measurement of a specific

02:53

word's effect on all other words of the same sentence.

02:57

This same process can be applied to images calculating the attention of patches of the

03:01

images and their relations to each other, as we will discuss further in the video.

03:06

Now that we know transformers are very interesting, there is still a problem when it comes to

03:11

computer vision applications.

03:12

Indeed, just like the popular saying "a picture is worth a thousand words," pictures contain

03:18

much more information than sentences, so we have to adapt the basic transformer's architecture

03:23

to process images effectively.

03:26

This is what this paper is all about.

03:28

This is due to the fact that the computational complexity of its self-attention is quadratic

03:33

to image size.

03:35

Thus exploding the computation time and memory needs.

03:38

Instead, the researchers replaced this quadratic computational complexity with a linear computational

03:44

complexity to image size.

03:47

The process to achieve this is quite simple.

03:50

At first, like most computer vision tasks, an RGB image is sent to the network.

03:55

This image is split into patches, and each patch is treated as a token.

04:00

And these tokens' features are the RGB values of the pixels themselves.

04:04

To compare with NLP, you can see this as the overall image is the sentence, and each patch

04:10

is the words of that sentence.

04:13

Self-attention is applied on each patch, here referred to as windows.

04:17

Then, the windows are shifted, resulting in a new window configuration to apply self-attention

04:22

again.

04:23

This allows the creation of connections between windows while maintaining the computation

04:28

efficiency of this windowed architecture.

04:31

This is very interesting when compared with convolutional neural networks as it allows

04:35

long-range pixel relations to appear.

04:38

This was only for the first stage.

04:41

The second stage is very similar but concatenates the features of each group of two by two neighboring

04:47

patches, downsampling the resolution by a factor of two.

04:50

This procedure is repeated twice in Stages 3 and 4 producing the same feature map resolutions

04:57

like those of typical convolutional networks like resnets and VGG.

05:03

You may say that this is highly similar to a convolutional architecture and filters using

05:07

dot products.

05:08

Well, yes and no.

05:10

The power of convolutions is that the filters use fixed weights globally, enabling the translation-invariance

05:16

property of convolution, making it a powerful generalizer.

05:20

In self-attention, the weights are not fixed globally.

05:23

Instead, they rely on the local context itself.

05:26

Thus, self-attention takes into account each pixel, but also its relation to the other

05:32

pixels.

05:33

Also, their shifted window technique allows long-range pixel relations to appear.

05:38

Unfortunately, these long-range relations only appear with neighboring windows.

05:43

Thus, losing very long-range relations, showing that there is still a place for improvement

05:47

of the transformer architecture when it comes to computer vision,

05:51

As they state in the paper, "It is our belief that a unified architecture

05:55

across computer vision and natural language processing could benefit

05:59

both fields, since it would facilitate joint modeling of visual and textual signals and

06:04

the modeling knowledge from both domains can be more deeply shared"

06:08

And I completely agree.

06:10

I think using a similar architecture for both NLP and computer vision could significantly

06:15

accelerate the research process.

06:17

Of course, transformers are still highly data-dependent, and nobody can say whether or not it will

06:23

be the future of either NLP or computer vision.

06:26

Still, it is undoubtedly a significant step forward for both fields!

06:31

Now that you've stayed this far let's talk about an awesome upcoming event for our field:

06:36

GTC.

06:37

So what is GTC2021?

06:38

It is a weeklong event offering over 1,500 talks from AI leaders like Yoshua Bengio,

06:45

Yann Lecun, Geoffrey Hinton, and much more!

06:48

The conference will start on April 12 with a keynote from the CEO of NVIDIA, where he

06:53

will be hosting the three AI pioneers I just mentioned.

06:57

This will be amazing!

06:58

It is an official NVIDIA conference for AI innovators, technologists, and creatives.

07:04

The conferences are covering many exciting topics.

07:07

Such as automotive, healthcare, data science, energy, deep learning, education, and much

07:11

more.

07:12

You don't want to miss it out!

07:14

Oh, and did I forget to mention that the registration is completely free this year?

07:18

So sign-up right now and watch it with me.

07:21

The link is in the description!

07:23

What's even cooler is that NVIDIA provided me 5 Deep Learning Institute credits that

07:28

you can use for an online, self-paced course of your choice worth around 30$ each!

07:34

The deep learning institute offers hands-on training in AI for developers, data scientists,

07:40

students, and researchers to get practical experience powered by GPUs in the cloud!

07:45

I think it's an awesome platform to learn, and it is super cool that they are offering

07:49

credits to give away, don't miss out on this opportunity!

07:52

To participate in this giveaway, you need to mention your favorite moment from the GTC

07:57

keynote on April 12 at 8:30 am pacific time using the hashtag #GTCWithMe and tagging me

08:05

(@whats_ai) on LinkedIn or Twitter!

08:09

I will also be live-streaming the event on my channel to watch it together and discuss

08:13

it in the chat.

08:14

Stay tuned for that, and please let me know what you think of the conference afterward!

08:19

NVIDIA also provided me with two extra codes to give away to the ones subscribing to my

08:24

newsletter!

08:25

This newsletter is about sharing only ONE paper each week.

08:29

There will be a video, an article, the code, and the paper itself.

08:32

I will also add some of the projects I am working on, guides to learning machine learning,

08:37

and other exciting news!

08:39

It's the first link in the description, and I will draw the winners just after the GTC

08:43

event!

08:44

Finally, just a final word as I wanted to personally thanks the four recent Youtube

08:49

members!

08:50

Huge thanks to you ebykova, Tonia Spight-Sokoya, Hello Paperspace, and Martin Petrovski, for

08:58

your support and everyone watching the videos!

09:01

See you in the next one!

Tags

Join Hacker Noon

Create your free account to unlock your custom reading experience.