Today, we are going to discuss a proposed by researchers from four institutions one of which is ByteDance AI Lab (known for their TikTok App). They give us a new method termed (not confuse with Sparse R-CNN that works with sparse convolutions on 3D computer vision tasks such ) that achieve near state-of-the-art performance in object detection and uses completely sparse and learnable bounding boxes generation. method Sparse R-CNN that Let’s start with a short overview of existing detecting methods. Dense method One of the widely used pipelines is the , which directly predicts the label and location of anchor boxes densely covering spatial positions, scales, and aspect ratios in a single-shot way. For example or . one-stage detector SSD YOLO Let’s consider the YOLO algorithm. Ultimately, it aims to predict a class of an object on the image and the bounding box specifying object location. Each bounding box can be described using four descriptors: center of a bounding box ( ) bx, by width ( ) bw height ( ) bh value is corresponding to a class of an object (such as: car, traffic lights, etc.). c In addition, we have to predict the value, which is the probability that there is an object in the bounding box. pc It is a because it is not searching for interesting regions in the given image that could potentially contain an object. Instead, YOLO is splitting the image into cells, using 19×19 grid. But in general one-stage detector could produce × cells, one per pixel. Each cell is responsible for predicting bounding boxes (for this example is chosen as 5). Therefore, we arrive at a large number of × × bounding boxes for one image. dense method W H k k W H k For each cell in the grid and each bbox mentioned above values are produced ( source ) Dense-To-Sparse Method There are two-stage detectors that piggy-backs on dense proposals generated using RPN like the paper proposed. These detectors have dominated modern object detection for years. Faster R-CNN Using RPN it “ from dense region candidates, and then refines the location of each proposal” and predicts its specific category. obtains a sparse set of foreground proposal boxes Two-stage detector architecture ( source ) Proposals are obtained in a similar way as in one-stage detectors, but instead of predicting the class of object directly, it predicts objectness probability. After that, the second stage predicts classes filtered by objectness and overlap score bounding boxes. Sparse Method Authors of this paper categorize a paradigm of their new Sparse R-CNN as the extension of the existing object detector paradigm which includes from thoroughly dense to dense-to-sparse with a new step which leads to thoroughly sparse. Model architectures comparison ( source ) In the reviewed paper using RPN is avoided and replaced with a small set of proposal boxes (e.g. 100 per image). These boxes are obtained using the learnable part and part of the network. The formal predicts values per proposal and the latter predicts latent representation vector of length 256 of each bbox contents. Learned proposal boxes are used as a reasonable statistic to perform refining steps afterward and learned proposal features used to introduce attention mechanism. This mechanism is very similar to one that is used in the paper. proposal boxes proposal features 4 (x,y,h,w) DETR These manipulations are performed inside the dynamic instance interactive head that we will cover in the following section. Proposed Model Features As the name of the paper implies, this model is end-to-end. The architecture is elegant. It consists of FPN based backbone that acquires features from images, mentioned above learnable and and a dynamic instance interactive head which is the main contribution to neural nets architecture of this very paper. proposal boxes proposal features, Dynamic instance interactive head Given proposal boxes, Sparse R-CNN first utilizes the RoIAlign operation to extract features from the backbone for each region defined with proposal bounding boxes. Each RoI feature “ ”. N is fed into its own exclusive head for object location and classification, where each head is conditioned on specific learnable proposal feature Dynamic module ( source ) Proposal features are used as weights for convolutions, on the image above they are mentioned as “Params.” The RoI feature is processed by this generated convolution to obtain the final feature. In this way, those “ ”. The self-attention module is embedded into the dynamic head to reason about the relations between objects and affects predictions by this convolution. bins with most foreground information make effect on final object location and classification Main Result The authors provide several comparison tables that show the performance of a new method. Sparse R-CNN is compared to RetinaNet, Faster R-CNN, and DETR in two variations with ResNet50 and ResNet100. Model performance ( source ) Here we can see that Sparse R-CNN outperforms RetinaNet and Faster R-CNN in both R50 and R100, but it performs quite similarly to DETR based architectures. According to the authors, the DETR model is in practice a dense-to-sparse model because it utilizes a sparse set of object “ ”. Hence the novelty of the article is arise comparing to DETR. queries, to interact with global(dense) image feature Qualitative analysis ( ) source On that image, you can see the qualitative result of model inference on the COCO Dataset. In the first column learned proposal boxes are shown, they are predicted for any new image. In the next columns, you can see final bboxes that were refined from proposals. They differ depending on the stage in the iterative learning process. Show me the code! In conclusion, I would like to say that now in 2020 we saw a lot of papers that apply transformers to images. Transformers have proved their worth in fields of NLP and now they gradually enter the scene of image processing. This paper shows us that using transformers it is possible to create fast one-stage detectors that perform comparable in terms of quality to the best, for now, two-stage ones. All details on implementation, you can find in the author’s code that based on FAIR’s DETR and detectron2 codebases: https://github.com/PeizeSun/SparseR-CNN References [1] Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks https://arxiv.org/abs/1506.01497 [2] YOLO Algorithm and YOLO Object Detection: An Introduction https://appsilon.com/object-detection-yolo-algorithm/ [3] Sparse R-CNN: End-to-End Object Detection with Learnable Proposals https://arxiv.org/abs/2011.12450 Also published here