Authors:
(1) Juan F. Montesinos, Department of Information and Communications Technologies Universitat Pompeu Fabra, Barcelona, Spain {[email protected]};
(2) Olga Slizovskaia, Department of Information and Communications Technologies Universitat Pompeu Fabra, Barcelona, Spain {[email protected]};
(3) Gloria Haro, Department of Information and Communications Technologies Universitat Pompeu Fabra, Barcelona, Spain {[email protected]}.
Solos[1] was designed to have the same categories as the URMP [1] dataset, so that URMP can be used as testing dataset in a real-world scenario. This way we aim to establish a standard way of evaluating source separation algorithms’ performance avoiding the use of mix-and-separate in testing. Solos consists of 755 recordings distributed amongst 13 categories as shown in Figure 1, with an average amount of 58 recordings per category and an average duration of 5:16 min. It is interesting to highlight that, for 8 out of 13 categories, the median of resolution is HD, despite being a YouTube-gathered dataset. Per-category statistics can be found in Table I. These recordings were gathered by querying YouTube using the tags solo and auditions in several languages such as English, Spanish, French, Italian, Chinese or Russian.
A. OpenPose Skeletons
Solos is not only a set of recordings. Apart from the videos identificators We also provide: i) body and hand skeletons estimated by OpenPose [33] in each frame of each recording and ii) timestamps indicating useful parts. OpenPose is a system capable to predict body skeleton and hands skeletons
making use of two different neural networks. To do so, they predict a confidence map of the belief that a specific body part may be located at any given pixel as well as part affinity fields which encode the degree of association between different body parts. Finally, it predicts 2D skeletons and per-joint confidence via greedy inference. In practice, the body skeleton is estimated with a first network. Then, the position of the wrists in the body skeleton are used to estimate the position of both hands. A second neural network obtains the skeleton of each hand independently. Note that since each body part is estimated independently, OpenPose makes no assumptions about the limbs to find. It just calculates the most likely skeleton given confidence maps and part affinity fields. The whole process is carried out frame-wise. This leads to a small flickering and mispredictions between frames.
B. Timestamps estimation and skeleton refinement
OpenPose maps mispredicted joints to the origin of coordinates. We empirically found that such a big jump in the position of a joint induces noise. Using interpolated coordinates helps to address this problem.
This paper is available on arxiv under CC BY-NC-SA 4.0 DEED license.
[1] Dataset available at https://juanfmontesinos.github.io/Solos/