Video Instance Matting: V-HIM60 Dataset and Difficulty Levels

Table of Links

Supplementary Material

Architecture details
Image matting

8.1. Dataset generation and preparation

8.2. Training details

8.3. Quantitative details

8.4. More qualitative results on natural images
Video matting

9.1. Dataset generation

9.2. Training details

9.3. Quantitative details

9.4. More qualitative results

4.1. Image Instance Matting

We derived the Image Human Instance Matting 50K (IHIM50K) training dataset from HHM50K [50], featuring multiple human subjects. This dataset includes 49,737 synthesized images with 2-5 instances each, created by compositing human foregrounds with random backgrounds and modifying alpha mattes for guidance binary masks. For benchmarking, we used HIM2K [49] and created the Mask HIM2K (M-HIM2K) set to test robustness against varying mask qualities from available instance segmentation models (as shown in Fig. 3). Details on the generation process are available in the supplementary material.

4.2. Video Instance Matting

Our video instance matte dataset, synthesized from VM108 [57], VideoMatte240K [33], and CRGNN [52], includes subsets V-HIM2K5 for training and V-HIM60 for testing. We categorized the dataset into three difficulty levels based on instance overlap. Table 2 shows some details of the synthesized datasets. Masks in training involved dilation and erosion on binarized alpha mattes. For testing, masks are generated using XMem [8]. Further details on dataset synthesis and difficulty levels are provided in the supplementary material.

Authors:

(1) Chuong Huynh, University of Maryland, College Park ([email protected]);

(2) Seoung Wug Oh, Adobe Research (seoh,[email protected]);

(3) Abhinav Shrivastava, University of Maryland, College Park ([email protected]);

(4) Joon-Young Lee, Adobe Research ([email protected]).

This paper is available on arxiv under CC by 4.0 Deed (Attribution 4.0 International) license.