paint-brush
Top 10 Computer Vision Papers of 2021: HackerNoon Editionby@whatsai
1,329 reads
1,329 reads

Top 10 Computer Vision Papers of 2021: HackerNoon Edition

by Louis BouchardJanuary 2nd, 2022
Read on Terminal Reader
Read this story w/o Javascript
tldt arrow

Too Long; Didn't Read

Here are 10 most interesting computer vision research papers of the year that I shared here on HackerNoon, in case you missed any of them. In short, it is a curated list of the latest breakthroughs in AI and Data Science by release date with a clear video explanation, link to the HN article, and code (if applicable) Enjoy the read! The full list is also in a GitHub repository if you prefer it this way. In less than 5 minutes, you will know how the transformer architecture can be applied to computer vision with a new paper called the Swin Transformer.

Companies Mentioned

Mention Thumbnail
Mention Thumbnail
featured image - Top 10 Computer Vision Papers of 2021: HackerNoon Edition
Louis Bouchard HackerNoon profile picture

While the world is still recovering, research hasn't slowed its frenetic pace, especially in the field of artificial intelligence. Moreover, many important aspects in AI were highlighted this year, like the ethical aspects, important biases, governance, transparency, and much more.

Artificial intelligence and our understanding of the human brain and its link to AI are constantly evolving, showing promising applications improving our life's quality in the near future. Still, we ought to be careful with which technology we choose to apply.

"Science cannot tell us what we ought to do, only what we can do."
- Jean-Paul Sartre, Being and Nothingness

Here are the 10 most interesting computer vision research papers of the year that I shared here on HackerNoon, in case you missed any of them. In short, it is a curated list of the latest breakthroughs in AI and Data Science by release date with a clear video explanation, link to the HN article, and code (if applicable). Enjoy the read!

The full list is also in a GitHub repository if you prefer it this way.

Here's a short video showcasing all the papers on this list!

[1] A. Ramesh et al., Zero-shot text-to-image generation, 2021. arXiv:2102.12092

OpenAI successfully trained a network able to generate images from text captions. It is very similar to GPT-3 and Image GPT and produces amazing results.

Code and additional explanation in the HN article!

[2] Taming Transformers for High-Resolution Image Synthesis, Esser et al., 2020.

Tl;DR: They combined the efficiency of GANs and convolutional approaches with the expressivity of transformers to produce a powerful and time-efficient method for semantically-guided high-quality image synthesis.

Code and additional explanation in the HN article!

[3] Liu, Z. et al., 2021, “Swin Transformer: Hierarchical Vision Transformer using Shifted Windows”, arXiv preprint https://arxiv.org/abs/2103.14030v1

Will Transformers Replace CNNs in Computer Vision? In less than 5 minutes, you will know how the transformer architecture can be applied to computer vision with a new paper called the Swin Transformer.

Code and additional explanation in the HN article!

[BONUS] Yuille, A.L., and Liu, C., 2021. Deep nets: What have they ever done for vision?. International Journal of Computer Vision, 129(3), pp.781–802, https://arxiv.org/abs/1805.04025.

"I will openly share everything about deep nets for vision applications, their successes, and the limitations we have to address."

Code and additional explanation in the HN article!

[4] Liu, A., Tucker, R., Jampani, V., Makadia, A., Snavely, N. and Kanazawa, A., 2020. Infinite Nature: Perpetual View Generation of Natural Scenes from a Single Image, https://arxiv.org/pdf/2012.09855.pdf

The next step for view synthesis: Perpetual View Generation, where the goal is to take an image to fly into it and explore the landscape!

Code and additional explanation in the HN article!

[5] Pandey et al., 2021, Total Relighting: Learning to Relight Portraits for Background Replacement, doi: 10.1145/3450626.3459872, https://augmentedperception.github.io/total_relighting/total_relighting_paper.pdf.

Properly relight any portrait based on the lighting of the new background you add. Have you ever wanted to change the background of a picture but have it look realistic? If you’ve already tried that, you already know that it isn’t simple. You can’t just take a picture of yourself in your home and change the background for a beach. It just looks bad and not realistic. Anyone will just say “that’s photoshopped” in a second. For movies and professional videos, you need the perfect lighting and artists to reproduce a high-quality image, and that’s super expensive. There’s no way you can do that with your own pictures. Or can you?

Code and additional explanation in the HN article!

[6] Holynski, Aleksander, et al. “Animating Pictures with Eulerian Motion Fields.” Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2021.

This model takes a picture, understands which particles are supposed to be moving, and realistically animates them in an infinite loop while conserving the rest of the picture entirely still creating amazing-looking videos like this one...

Code and additional explanation in the HN article!

[7] Michael Niemeyer and Andreas Geiger, (2021), "GIRAFFE: Representing Scenes as Compositional Generative Neural Feature Fields", Published in CVPR 2021.

Using a modified GAN architecture, they can move objects in the image without affecting the background or the other objects!

Code and additional explanation in the HN article!

[8] Stepan Tulyakov*, Daniel Gehrig*, Stamatios Georgoulis, Julius Erbach, Mathias Gehrig, Yuanyou Li, Davide Scaramuzza, TimeLens: Event-based Video Frame Interpolation, IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, 2021, http://rpg.ifi.uzh.ch/docs/CVPR21_Gehrig.pdf

TimeLens can understand the movement of the particles in-between the frames of a video to reconstruct what really happened at a speed even our eyes cannot see. In fact, it achieves results that our intelligent phones and no other models could reach before!

Code and additional explanation in the HN article!

[9] a) CLIPDraw: exploring text-to-drawing synthesis through language-image encoders
b) StyleCLIPDraw: Schaldenbrand, P., Liu, Z. and Oh, J., 2021. StyleCLIPDraw: Coupling Content and Style in Text-to-Drawing Synthesis.

Have you ever dreamed of taking the style of a picture, like this cool TikTok drawing style on the left, and applying it to a new picture of your choice? Well, I did, and it has never been easier to do. In fact, you can even achieve that from only text and can try it right now with this new method and their Google Colab notebook available for everyone (see references). Simply take a picture of the style you want to copy, enter the text you want to generate, and this algorithm will generate a new picture out of it! Just look back at the results above, such a big step forward! The results are extremely impressive, especially if you consider that they were made from a single line of text!

Code and additional explanation in the HN article!

[10] Xiangli, Y., Xu, L., Pan, X., Zhao, N., Rao, A., Theobalt, C., Dai, B. and Lin, D., 2021. CityNeRF: Building NeRF at City Scale.

The model is called CityNeRF and grows from NeRF, which I previously covered on my channel. NeRF is one of the first models using radiance fields and machine learning to construct 3D models out of images. But NeRF is not that efficient and works for a single scale. Here, CityNeRF is applied to satellite and ground-level images at the same time to produce various 3D model scales for any viewpoint. In simple words, they bring NeRF to city-scale. But how?

Code and additional explanation in the HN article!

[38] Mokady, R., Hertz, A. and Bermano, A.H., 2021. ClipCap: CLIP Prefix for Image Captioning. https://arxiv.org/abs/2111.09734

We’ve seen AI generate images from other images using GANs. Then, there were models able to generate questionable images using text. In early 2021, DALL-E was published, beating all previous attempts to generate images from text input using CLIP, a model that links images with text as a guide. A very similar task called image captioning may sound really simple but is, in fact, just as complex. It is the ability of a machine to generate a natural description of an image. It’s easy to simply tag the objects you see in the image but it is quite another challenge to understand what’s happening in a single 2-dimensional picture, and this new model does it extremely well...

Code and additional explanation in the HN article!

Please tag me on Twitter @Whats_AI or LinkedIn @Louis (What's AI) Bouchard if you share the list!