If you couldn’t make it to , no worries. Below is a list of top 10 papers everyone was talking about, covering DeepFakes, Facial Recognition, Reconstruction, & more. CVPR 2019 --------------------------------------------------------- 1.Learning Individual Styles of Conversational Gesture Full Paper: https://www.catalyzex.com/paper/arxiv:1906.04160 . TLDR: Given audio speech input, they generate plausible gestures to go along with the sound and synthesize a corresponding video of the speaker : Speech to gesture translation model. A convolutional audio encoder downsamples the 2D spectrogram and transforms it to a 1D signal. The translation model, G, then predicts a corresponding temporal stack of 2D poses. L1 regression to the ground truth poses provides a training signal, while an adversarial discriminator, D, ensures that the predicted motion is both temporally coherent and in the style of the speaker Model/Architecture used : Researchers qualitatively compare speech to gesture translation results to the baselines and the ground truth gesture sequences (tables presented by the author show lower loss and higher PCK of the new model) Model accuracy Speaker-specific gesture dataset taken by querying youtube. In total, there are 144 hours of video. They split the data into 80% train, 10% validation, and 10% test sets, such that each source video only appears in one set. Datasets used: --------------------------------------------------------- 2.Textured Neural Avatars Full Paper: https://www.catalyzex.com/paper/arxiv:1905.08776 TLDR: The researchers present a system for learning full-body neural avatars, i.e. deep networks that produce full-body renderings of a person for varying body pose and camera position. A neural free-viewpoint rendering of human avatars without reconstructing geometry : The overview of the textured neural avatar system. The input pose is defined as a stack of ”bone” rasterizations (one bone per channel). The input is processed by the fully-convolutional network (generator) to produce the body part assignment map stack and the body part coordinate map stack. These stacks are then used to sample the body texture maps at the locations prescribed by the part coordinate stack with the weights prescribed by the part assignment stack to produce the RGB image. In addition, the last body assignment stack map corresponds to the background probability. During learning, the mask and the RGB image are compared with ground-truth and the resulting losses are backpropagated through the sampling operation into the fully-convolutional network and onto the texture, resulting in their updates. Model/Architecture used : Outperforms the other two in terms of structured self-similarity (SSIM) and underperforms V2V in terms of Frechet Inception Distance (FID) Model accuracy : Datasets used - 2 subsets from the CMU Panoptic dataset collection. - captured our own multi-view sequences of three subjects using a rig of seven cameras, spanning approximately 30 degrees. - 2 short monocular sequences from another paper and a Youtube video --------------------------------------------------------- 3.DSFD: Dual Shot Face Detector Full Paper: https://www.catalyzex.com/paper/arxiv:1810.10220 TLDR: They propose a novel face detection network with three novel contributions that address three key aspects of face detection, including better feature learning, progressive loss design and anchor assign based data augmentation, respectively. : DSFD framework uses a Feature Enhance Module on top of a feedforward VGG/ResNet architecture to generate the enhanced features from the original features (a), along with two loss layers named first shot PAL for the original features and second shot PAL for the enchanted features. Model/Architecture used : Extensive experiments on popular benchmarks: WIDER FACE and FDDB demonstrate the superiority of DSFD (Dual Shot face Detector) over the state-of-the-art face detectors (e.g., PyramidBox and SRN) Model accuracy : WIDER FACE and FDDB Datasets used --------------------------------------------------------- 4.GANFIT: Generative Adversarial Network Fitting for High Fidelity 3D Face Reconstruction Full Paper: https://www.catalyzex.com/paper/arxiv:1902.05978 TLDR: The proposed deep fitting approach can reconstruct high-quality texture and geometry from a single image with precise identity recovery. The reconstructions in the figure and the rest of the paper are represented by a vector of size 700 floating points and rendered without any special effects.(the depicted texture is reconstructed by the model and none of the features taken directly from the image) : A 3D face reconstruction is rendered by a differentiable renderer. Cost functions are mainly formulated by means of identity features on a pretrained face recognition network and they are optimized by flowing the error all the way back to the latent parameters with gradient descent optimization. End-to-end differentiable architecture enables us to use computationally cheap and reliable first order derivatives for optimization thus making it possible to employ deep networks as a generator (i.e,. statistical model) or as a cost function. Model/Architecture used : Accuracy results for the meshes on the MICC Dataset using point-to-plane distance. The table reports the mean error (Mean), the standard deviation (Std.).are lowest for the proposed model. Model accuracy : MoFA-Test, MICC, Labelled Faces in the Wild (LFW) dataset, BAM dataset Datasets used --------------------------------------------------------- 5.DeepFashion2: A Versatile Benchmark for Detection, Pose Estimation, Segmentation and Re-Identification of Clothing Images Full Paper: https://www.catalyzex.com/paper/arxiv:1901.07973 TLDR: Deepfashion 2 provides a new benchmark for Detection, Pose Estimation, Segmentation and Re-Identification of Clothing Images : Match R-CNN contains three main components including a feature extraction network (FN), a perception network (PN), and a match network (MN). Model/Architecture used : Match R-CNN achieves a top-20 accuracy of less than 0.7 with ground-truth bounding boxes provided, indicating that the retrieval benchmark is challenging. Model accuracy : The DeepFashion2 dataset contains 491K diverse images of 13 popular clothing categories from both commercial shopping stores and consumers Datasets used --------------------------------------------------------- 6.Inverse Cooking: Recipe Generation from Food Images Full Paper: https://www.catalyzex.com/paper/arxiv:1812.06164 ages TLDR: Facebook researchers use AI to generate recipes from food im : Recipe generation model- They extract image features with the image encoder. Ingredients are predicted by Ingredient Decoder, and encoded into ingredient embeddings with ingredient encoder. The cooking instruction decoder generates a recipe title and a sequence of cooking steps by attending to image embeddings, ingredient embeddings, and previously predicted words. Model/Architecture used : The user study results demonstrate the superiority of their system against state-of-the-art image-to-recipe retrieval approaches. (outperforms both human baseline and retrieval based systems obtaining F1 of 49.08%)(good F1 score means that you have low false positives and low false negatives) Model accuracy : They evaluate the whole system on the large-scale Recipe1M dataset Datasets used --------------------------------------------------------- 7. ArcFace: Additive Angular Margin Loss for Deep Face Recognition Full Paper: https://www.catalyzex.com/paper/arxiv:1801.07698 . TLDR: ArcFace can obtain more discriminative deep features and shows state-of-art performance in the MegaFace Challenge in a reproducible way : To enhance intraclass compactness and inter-class discrepancy they propose Additive Angular Margin Loss (ArcFace)- inserting a geodesic distance margin between the sample and centres. This is done to enhance the discriminative power of face recognition model Model/Architecture used : Comprehensive experiments reported demonstrate that ArcFace consistently outperforms the state-of-the-art! Model accuracy : They employ CASIA, VGGFace2, MS1MV2 and DeepGlint-Face (including MS1M-DeepGlint and Asian-DeepGlint) as training data in order to conduct a fair comparison with other methods. Other datasets used- LFW, CFP-FP, AgeDB-30, CPLFW, CALFW, YTF, MegaFace, IJB-B, IJB-C, Trillion-Pairs, iQIYI-VID Datasets used --------------------------------------------------------- 8.Fast Online Object Tracking and Segmentation: A Unifying Approach Full Paper: https://www.catalyzex.com/paper/arxiv:1812.05050 TLDR: The method, dubbed SiamMask, improves the offline training procedure of popular fully-convolutional Siamese approaches for object tracking by augmenting their loss with a binary segmentation task. : SiamMask aims at the intersection between the tasks of visual tracking and video object segmentation to achieve high practical convenience. Like conventional object trackers, it relies on a simple bounding box initialisation and operates online. Differently, from state-of-the-art trackers such as ECO, SiamMask is able to produce binary segmentation masks, which can more accurately describe the target object. SiamMask has 2 variants: three-branch architecture, two-branch architecture (see paper for more details) Model/Architecture used : Qualitative results of SiamMask for both VOT(visual object tracking) and DAVIS (Densely Annotated VIdeo Segmentation) sequences are shown in the paper. Despite the high speed, SiamMask produces accurate segmentation masks even in the presence of distractors. Model accuracy : VOT2016, VOT-2018, DAVIS-2016, DAVIS-2017 and YouTube-VOS Datasets used --------------------------------------------------------- 9. Revealing Scenes by Inverting Structure from Motion Reconstructions Full Paper: https://www.catalyzex.com/paper/arxiv:1904.03303 TLDR: A team of scientists at Microsoft and academic collaborators reconstruct color images of a scene from the point cloud. : Our method is based on a cascaded U-Net that takes as input, a 2D multichannel image of the points rendered from a specific viewpoint containing point depth and optionally color and SIFT descriptors and outputs a color image of the scene from that viewpoint. Their network has 3 sub-networks – VISIBNET, COARSENET and REFINENET. The input to their network is a multi-dimensional nD array. The paper explores network variants where the inputs are different subsets of depth, color and SIFT descriptors. The 3 sub-networks have similar architectures. They are U-Nets with encoder and decoder layers with symmetric skip connections. The extra layers at the end of the decoder layers are there to help with high-dimensional inputs Model/Architecture used : The paper demonstrated that surprisingly high quality images can be reconstructed from the limited amount of information stored along with sparse 3D point cloud models Model accuracy : trained on 700+ indoor and outdoor SfM reconstructions generated from 500k+ multi-view images taken from the NYU2 and MegaDepth datasets Dataset used --------------------------------------------------------- 10. Semantic Image Synthesis with Spatially-Adaptive Normalization Full Paper: https://www.profillic.com/paper/arxiv:1903.07291 TLDR: Turning Doodles into Stunning, Photorealistic Landscapes! NVIDIA research harnesses generative adversarial networks to create highly realistic scenes. Artists can use paintbrush and paint bucket tools to design their own landscapes with labels like river, rock and cloud Model/Architecture used: In SPADE, the mask is first projected onto an embedding space, and then convolved to produce the modulation parameters γ and β. Unlike prior conditional normalization methods, γ and β are not vectors, but tensors with spatial dimensions. The produced γ and β are multiplied and added to the normalized activation element-wise. In the SPADE generator, each normalization layer uses the segmentation mask to modulate the layer activations. (left) Structure of one residual block with SPADE. (right) The generator contains a series of SPADE residual blocks with upsampling layers. : Our architecture achieves better performance with a smaller number of parameters by removing the downsampling layers of leading image-to-image translation networks. Our method successfully generates realistic images in diverse scenes ranging from animals to sports activities. Model accuracy : COCO-Stuff, ADE20K, Cityscapes, Flickr Landscape Dataset used