Counting objects in images is one of the fundamental tasks in computer vision. It has a lot of applications, in particular in:
The task of counting objects is relatively easy for us, people, but it can be challenging for a computer vision algorithm, especially when different instances of an object vary significantly in terms of shape, color, texture or size. If a problem is complex from an algorithmic point of view but simple from a human point of view, machine learning methods can be an answer.
Currently, deep learning (DL) methods provide the state-of-the-art performance in digital image processing. However, they require collecting a lot of annotated data, which is usually time consuming and prone to labelling errors.
The common way to count objects using DL is to first detect them using convolutional neural networks, like e.g. GCNet [1], and then count all found instances. It is effective but requires bounding box annotations, like presented in Fig. 1a, which are hard to obtain.
Fig. 1a: COCO dataset [2]
To overcome this issue, alternative approaches leverage point-like annotations of objects positions (see Fig. 1b), which are much cheaper to collect.
Fig. 1b: Mall dataset [3-6]
In this article we describe our studies on counting objects in images with fully convolutional networks (FCN), trained on data with point-like annotations. In the next sections details on a model we used are presented together with the implementation, considered datasets, and results we obtained.
We decided to start with the approach described in [7]. The main idea is to count objects indirectly by estimating a density map. The first step is to prepare training samples, so that for every image there is a corresponding density map. Let’s consider an example shown in Fig. 2.
Fig. 2a: An example image
Fig. 2b: A label (head positions)
The image presented in Fig. 2a is annotated with points in the positions of pedestrians heads (Fig. 2b). A density map is obtained by applying a convolution with a Gaussian kernel (and normalized so that integrating it gives the number of objects). The density map for the example above is presented in Fig. 3.
Fig. 3: A density map generated with a Gaussian filter
Now, the goal is to train a fully convolutional network to map an image to a density map, which can be later integrated to get the number of objects. So far, we have considered two FCN architectures: U-Net [8] and Fully Convolutional Regression Network (FCRN) [7].
U-Net is a widely used FCN for image segmentation, very often applied to biomedical data. It has autoencoder-like structure (see Fig. 4). An input image is processed by a block of convolutional layers, followed by a pooling layer (downsampling). This procedure is repeated several times on subsequent blocks outputs, which is demonstrated on the left side of Fig. 4. This way the network encodes (and compresses) the key features of an input image. The second part of U-Net is symmetric, but pooling layers are replaced with upsampling, so that an output dimensions match the size of an input image. The information from higher resolution layers in the downsampling part is passed to corresponding layers in the upsampling part, which allows to reuse learned higher level features to decode contracted layers more precisely.
Fig. 4: U-Net architecture
Fully Convolutional Regression Network (FCRN) was proposed in [7]. The architecture is very similar to U-Net. The main difference is that the information from higher resolution layers from downsampling part is not passed directly to the corresponding layers in upsampling part. In the paper two networks are proposed: FCRN-A and FCRN-B, which differ in downsampling intensity. While FCRN-A perform pooling every convolutional layer, FCRN-B does that every second layer.
Our implementation can be found in our public repository. It is based on the code from Weidi Xi’s GitHub, but PyTorch is used instead of Keras. Please feel free to use it for your research on objects counting in images.
Currently, U-Net and FCNR-A are implemented. They both use three downsampling and three upsampling convolutional blocks with fixed filter size 3×3. By default there are two convolutional layers in each block for U-Net, and one for FCNR-A. For U-Net we keep constant number of filters for all convolutional layers, and for FCRN-A we increase this number every subsequent layer to compensate for the loss of higher resolution information during pooling (which is not passed directly as in the case of U-Net).
We considered three datasets in our study. They all are annotated with point-like objects positions, so we could use them directly to generate density maps for all images and test the method described above. We preprocess them to a common format and store in HDF5 flies before the training.
Fluorescent cells (FC) dataset is generated by Visual Geometry Group (VGG) with a computational framework from [9], which can simulate fluorescence microscope images with bacterial cells. It can be downloaded from VGG website. An example image along with generated density map is presented on Fig. 5
Fig. 5: An example image and corresponding density map (FC dataset)UCSD pedestrian dataset
UCSD dataset [10] contains videos of pedestrians recorded on walkways in the University of California San Diego campus. It is widely used for various problems, such as counting, motion segmentation, and analysis of pedestrians behaviour. It can be downloaded from Statistical Visual Computing Lab website. An example image along with generated density map is presented on Fig. 6.
Fig. 6: An example image and corresponding density map (UCSD dataset)
Mall dataset [3-6] was created for crowd counting and profiling. It contains a video recorded by a publicly available webcam. Every frame is annotated with head positions of every pedestrian. An example image along with generated density map is presented on Fig. 3.
The method of counting objects in images by integrating an estimated density map has been already applied to both fluorescent cells and UCSD datasets [11]. We chose mall dataset to be our test dataset for the method.
Results
As stated above, two models were tested, namely U-Net and FCRN. Using U-Net we were able to achieve more accurate results so below we present our findings obtained with this architecture.
In the table below the summary of our results is presented for each dataset with minimum and maximum numbers of objects in validation sets and the mean absolute error (MAE) we obtained.
We used standard definition for MAE:
where ti is true and pi is predicted number of objects for i-th sample.
The scatter plots for true (ti) vs predicted (pi) number of objects for each validation sample are presented in Figs. 7-9. As expected, the model handles well relatively simple fluorescent cells dataset despite high number of objects in single image. However, there is much higher deviation when counting pedestrians. This could be happening due to the fact that it is difficult even for human labeller to decide whether person standing behind a plant or just barely visible from behind corner should be counted.
This is just the beginning of our research on object counting. We are looking forward to conduct more experiments including trying out different architectures and methods.
Fig 7. True vs predicted counts on fluorescent cells validation dataset
Fig 8. True vs predicted counts on UCSD validation dataset
Fig 9. True vs predicted counts on mall validation dataset
References
Previously published at https://neurosys.com/article/objects-counting-by-estimating-a-density-map-with-convolutional-neural-networks/