Ten Trending Academic Papers on the Future of Computer Vision

If you couldn’t make it to CVPR 2019, no worries. Below is a list of top 10 papers everyone was talking about, covering DeepFakes, Facial Recognition, Reconstruction, & more.

---------------------------------------------------------

1.Learning Individual Styles of Conversational Gesture

Full Paper: https://www.catalyzex.com/paper/arxiv:1906.04160

TLDR: Given audio speech input, they generate plausible gestures to go along with the sound and synthesize a corresponding video of the speaker.

Model/Architecture used: Speech to gesture translation model. A convolutional audio encoder downsamples the 2D spectrogram and transforms it to a 1D signal. The translation model, G, then predicts a corresponding temporal stack of 2D poses. L1 regression to the ground truth poses provides a training signal, while an adversarial discriminator, D, ensures that the predicted motion is both temporally coherent and in the style of the speaker

Model accuracy: Researchers qualitatively compare speech to gesture translation results to the baselines and the ground truth gesture sequences (tables presented by the author show lower loss and higher PCK of the new model)

Datasets used: Speaker-specific gesture dataset taken by querying youtube. In total, there are 144 hours of video. They split the data into 80% train, 10% validation, and 10% test sets, such that each source video only appears in one set.

---------------------------------------------------------

2.Textured Neural Avatars

Full Paper: https://www.catalyzex.com/paper/arxiv:1905.08776

TLDR: The researchers present a system for learning full-body neural avatars, i.e. deep networks that produce full-body renderings of a person for varying body pose and camera position. A neural free-viewpoint rendering of human avatars without reconstructing geometry

Model/Architecture used: The overview of the textured neural avatar system. The input pose is defined as a stack of ”bone” rasterizations (one bone per channel). The input is processed by the fully-convolutional network (generator) to produce the body part assignment map stack and the body part coordinate map stack. These stacks are then used to sample the body texture maps at the locations prescribed by the part coordinate stack with the weights prescribed by the part assignment stack to produce the RGB image. In addition, the last body assignment stack map corresponds to the background probability. During learning, the mask and the RGB image are compared with ground-truth and the resulting losses are backpropagated through the sampling operation into the fully-convolutional network and onto the texture, resulting in their updates.

Model accuracy: Outperforms the other two in terms of structured self-similarity (SSIM) and underperforms V2V in terms of Frechet Inception Distance (FID)

Datasets used:

- 2 subsets from the CMU Panoptic dataset collection.
- captured our own multi-view sequences of three subjects using a rig of seven cameras, spanning approximately 30 degrees. - 2 short monocular sequences from another paper and a Youtube video

---------------------------------------------------------

3.DSFD: Dual Shot Face Detector

Full Paper: https://www.catalyzex.com/paper/arxiv:1810.10220

TLDR: They propose a novel face detection network with three novel contributions that address three key aspects of face detection, including better feature learning, progressive loss design and anchor assign based data augmentation, respectively.

Model/Architecture used: DSFD framework uses a Feature Enhance Module on top of a feedforward VGG/ResNet architecture to generate the enhanced features from the original features (a), along with two loss layers named first shot PAL for the original features and second shot PAL for the enchanted features.

Model accuracy: Extensive experiments on popular benchmarks: WIDER FACE and FDDB demonstrate the superiority of DSFD (Dual Shot face Detector) over the state-of-the-art face detectors (e.g., PyramidBox and SRN)

Datasets used: WIDER FACE and FDDB

---------------------------------------------------------

4.GANFIT: Generative Adversarial Network Fitting for High Fidelity 3D Face Reconstruction

Full Paper: https://www.catalyzex.com/paper/arxiv:1902.05978

TLDR: The proposed deep fitting approach can reconstruct high-quality texture and geometry from a single image with precise identity recovery. The reconstructions in the figure and the rest of the paper are represented by a vector of size 700 floating points and rendered without any special effects.(the depicted texture is reconstructed by the model and none of the features taken directly from the image)

Model/Architecture used: A 3D face reconstruction is rendered by a differentiable renderer. Cost functions are mainly formulated by means of identity features on a pretrained face recognition network and they are optimized by flowing the error all the way back to the latent parameters with gradient descent optimization. End-to-end differentiable architecture enables us to use computationally cheap and reliable first order derivatives for optimization thus making it possible to employ deep networks as a generator (i.e,. statistical model) or as a cost function.

Model accuracy: Accuracy results for the meshes on the MICC Dataset using point-to-plane distance. The table reports the mean error (Mean), the standard deviation (Std.).are lowest for the proposed model.

Datasets used: MoFA-Test, MICC, Labelled Faces in the Wild (LFW) dataset, BAM dataset

---------------------------------------------------------

5.DeepFashion2: A Versatile Benchmark for Detection, Pose Estimation, Segmentation and Re-Identification of Clothing Images

Full Paper: https://www.catalyzex.com/paper/arxiv:1901.07973

TLDR: Deepfashion 2 provides a new benchmark for Detection, Pose Estimation, Segmentation and Re-Identification of Clothing Images

Model/Architecture used: Match R-CNN contains three main components including a feature extraction network (FN), a perception network (PN), and a match network (MN).

Model accuracy: Match R-CNN achieves a top-20 accuracy of less than 0.7 with ground-truth bounding boxes provided, indicating that the retrieval benchmark is challenging.

Datasets used: The DeepFashion2 dataset contains 491K diverse images of 13 popular clothing categories from both commercial shopping stores and consumers

---------------------------------------------------------

6.Inverse Cooking: Recipe Generation from Food Images

Full Paper: https://www.catalyzex.com/paper/arxiv:1812.06164

TLDR: Facebook researchers use AI to generate recipes from food images

Model/Architecture used: Recipe generation model- They extract image features with the image encoder. Ingredients are predicted by Ingredient Decoder, and encoded into ingredient embeddings with ingredient encoder. The cooking instruction decoder generates a recipe title and a sequence of cooking steps by attending to image embeddings, ingredient embeddings, and previously predicted words.

Model accuracy: The user study results demonstrate the superiority of their system against state-of-the-art image-to-recipe retrieval approaches. (outperforms both human baseline and retrieval based systems obtaining F1 of 49.08%)(good F1 score means that you have low false positives and low false negatives)

Datasets used: They evaluate the whole system on the large-scale Recipe1M dataset

---------------------------------------------------------

7. ArcFace: Additive Angular Margin Loss for Deep Face Recognition

Full Paper: https://www.catalyzex.com/paper/arxiv:1801.07698

TLDR: ArcFace can obtain more discriminative deep features and shows state-of-art performance in the MegaFace Challenge in a reproducible way.

Model/Architecture used: To enhance intraclass compactness and inter-class discrepancy they propose Additive Angular Margin Loss (ArcFace)- inserting a geodesic distance margin between the sample and centres. This is done to enhance the discriminative power of face recognition model

Model accuracy: Comprehensive experiments reported demonstrate that ArcFace consistently outperforms the state-of-the-art!

Datasets used: They employ CASIA, VGGFace2, MS1MV2 and DeepGlint-Face (including MS1M-DeepGlint and Asian-DeepGlint) as training data in order to conduct a fair comparison with other methods. Other datasets used- LFW, CFP-FP, AgeDB-30, CPLFW, CALFW, YTF, MegaFace, IJB-B, IJB-C, Trillion-Pairs, iQIYI-VID

---------------------------------------------------------

8.Fast Online Object Tracking and Segmentation: A Unifying Approach

Full Paper: https://www.catalyzex.com/paper/arxiv:1812.05050

TLDR: The method, dubbed SiamMask, improves the offline training procedure of popular fully-convolutional Siamese approaches for object tracking by augmenting their loss with a binary segmentation task.

Model/Architecture used: SiamMask aims at the intersection between the tasks of visual tracking and video object segmentation to achieve high practical convenience. Like conventional object trackers, it relies on a simple bounding box initialisation and operates online. Differently, from state-of-the-art trackers such as ECO, SiamMask is able to produce binary segmentation masks, which can more accurately describe the target object. SiamMask has 2 variants: three-branch architecture, two-branch architecture (see paper for more details)

Model accuracy: Qualitative results of SiamMask for both VOT(visual object tracking) and DAVIS (Densely Annotated VIdeo Segmentation) sequences are shown in the paper. Despite the high speed, SiamMask produces accurate segmentation masks even in the presence of distractors.

Datasets used: VOT2016, VOT-2018, DAVIS-2016, DAVIS-2017 and YouTube-VOS

---------------------------------------------------------

9. Revealing Scenes by Inverting Structure from Motion Reconstructions

Full Paper: https://www.catalyzex.com/paper/arxiv:1904.03303

TLDR: A team of scientists at Microsoft and academic collaborators reconstruct color images of a scene from the point cloud.

Model/Architecture used: Our method is based on a cascaded U-Net that takes as input, a 2D multichannel image of the points rendered from a specific viewpoint containing point depth and optionally color and SIFT descriptors and outputs a color image of the scene from that viewpoint.
Their network has 3 sub-networks – VISIBNET, COARSENET and REFINENET. The input to their network is a multi-dimensional nD array. The paper explores network variants where the inputs are different subsets of depth, color and SIFT descriptors. The 3 sub-networks have similar architectures. They are U-Nets with encoder and decoder layers with symmetric skip connections. The extra layers at the end of the decoder layers are there to help with high-dimensional inputs

Model accuracy: The paper demonstrated that surprisingly high quality images can be reconstructed from the limited amount of information stored along with sparse 3D point cloud models

Dataset used: trained on 700+ indoor and outdoor SfM reconstructions generated from 500k+ multi-view images taken from the NYU2 and MegaDepth datasets

---------------------------------------------------------

10. Semantic Image Synthesis with Spatially-Adaptive Normalization

Full Paper: https://www.profillic.com/paper/arxiv:1903.07291

TLDR: Turning Doodles into Stunning, Photorealistic Landscapes! NVIDIA research harnesses generative adversarial networks to create highly realistic scenes. Artists can use paintbrush and paint bucket tools to design their own landscapes with labels like river, rock and cloud

Model/Architecture used:

In SPADE, the mask is first projected onto an embedding space, and then convolved to produce the modulation parameters γ and β. Unlike prior conditional normalization methods, γ and β are not vectors, but tensors with spatial dimensions. The produced γ and β are multiplied and added to the normalized activation element-wise.

In the SPADE generator, each normalization layer uses the segmentation mask to modulate the layer activations. (left) Structure of one residual block with SPADE. (right) The generator contains a series of SPADE residual blocks with upsampling layers.

Model accuracy: Our architecture achieves better performance with a smaller number of parameters by removing the downsampling layers of leading image-to-image translation networks. Our method successfully generates realistic images in diverse scenes ranging from animals to sports activities.

Dataset used: COCO-Stuff, ADE20K, Cityscapes, Flickr Landscape