VEATIC: Video-based Emotion and Affect Tracking in Context Dataset: VEATIC Datasetby@kinetograph

VEATIC: Video-based Emotion and Affect Tracking in Context Dataset: VEATIC Dataset

Too Long; Didn't Read

In this paper, researchers introduce VEATIC dataset for human affect recognition, addressing limitations in existing datasets, enabling context-based inference.
featured image - VEATIC: Video-based Emotion and Affect Tracking in Context Dataset: VEATIC Dataset
Kinetograph: The Video Editing Technology Publication HackerNoon profile picture

This paper is available on arxiv under CC 4.0 license.


(1) Zhihang Ren, University of California, Berkeley and these authors contributed equally to this work (Email: [email protected]);

(2) Jefferson Ortega, University of California, Berkeley and these authors contributed equally to this work (Email: [email protected]);

(3) Yifan Wang, University of California, Berkeley and these authors contributed equally to this work (Email: [email protected]);

(4) Zhimin Chen, University of California, Berkeley (Email: [email protected]);

(5) Yunhui Guo, University of Texas at Dallas (Email: [email protected]);

(6) Stella X. Yu, University of California, Berkeley and University of Michigan, Ann Arbor (Email: [email protected]);

(7) David Whitney, University of California, Berkeley (Email: [email protected]).

3. VEATIC Dataset

In this section, we introduce the Video-based Emotion and Affect Tracking in Context Dataset (VEATIC). First, we describe how we obtained all the video clips. Next, we illustrate the data annotation procedures and pre-processing process. Finally, we report important dataset statistics and visualize data analysis results.

3.1. Video Clips Acquisition

All video clips used in the dataset were acquired from an online video-sharing website (YouTube) and video clips were selected on the basis that the emotions/affect of the characters in the clips should vary across time. In total, the VEATIC dataset contains 124 video clips, 104 clips from Hollywood movies, 15 clips from home videos, and 5 clips from documentaries or reality TV shows. Sample frames from the VEATIC dataset are shown in (Figure 2). These videos contain zero to multiple interacting characters. All sound was removed from the videos so observers only had access to visual information when tracking the emotion of the target character.

Figure 2. Overview of video frames in VEATIC. We sampled 4 key frames from 5 videos in our dataset. Unlike other datasets where the source of video clips is unique, video clips of VEATIC come from different sources. They include Hollywood movies, documentaries, and homemade videos. Thus, it would make the model trained on our dataset have more generalization ability. For the visual input, VEATIC contains various context information, including different backgrounds, lighting conditions, character interactions, etc. It makes the dataset more representative of our daily life. At last, the emotion/affect of the selected character varies a lot in each video clip, making modeling the character’s affect in VEATIC more challenging.

3.2. Data Annotation and Pre-processing

In total, we had 192 observers who participated in the annotation of the videos in the dataset. All participants provided signed consent in accordance with the guidelines and regulations of the UC Berkeley Institutional Review Board and all experimental procedures were approved.

Participants watched and rated a total of 124 videos in the dataset. To prevent observers from getting fatigued, we split the annotation procedure into two 1-hour and 30- minute annotation sessions. Before participants were able to annotate any videos, they were shown a printed version of the valence-arousal affect rating grid with example emotions labeled in different locations of the grid according to the ratings provided by Bradley and Lang (1999) [6]. Annotators were instructed to familiarize themselves with the dimensions and the sample word locations which they would later utilize in the annotation process. After participants familiarized themselves with the affect rating grid, they then completed a two-minute practice annotation where they continuously tracked the valence and arousal of a target character in a video (Figure 3b). Annotators were instructed to track the valence and arousal of the target character in the video by continuously moving their mouse pointer in realtime within the 2D valence-arousal grid. The grid would map to their valence and arousal ratings in the range of [−1, 1]. To control for potential motor biases, we counterbalanced the valence-arousal dimensions between participants where half of the annotators had valence on the x-axis and arousal on the y-axis and the other half had the dimensions flipped so that arousal was on the x-axis and valence was on the y-axis. Once observers finished the practice annotation session, they then started annotating the videos in the dataset.

Figure 3. User interface used for video annotation. a) Participants were first shown the target character and were reminded of the task instructions before the start of each video. b) The overlayed valence and arousal grid that was present while observers annotated the videos. Observers were instructed to continuously rate the emotion of the target character in the video in real-time. If observers did not move their mouse for more than 10 seconds, the response rating grid would flash to remind the observer to continuously rate the emotion.

Before participants started the annotations, they were shown an image with the target character circled (Figure 3a) which informs the participants which character they will track when the video begins. Then, they annotated the video clips in real-time. At the end of each video annotation, participants reported their familiarity with the video clip using a 1-5 discrete Likert scale that ranged from ”Not familiar”, ”Slightly familiar”, ”Somewhat familiar”, ”Moderately familiar”, and ”Extremely familiar”. Participants were also asked about their level of enjoyment while watching the clip which was rated using a 1-9 discrete Likert scale that ranged from 1 (Not Enjoyable) to 9 (Extremely Enjoyable). Additionally, in order to not make participants feel bored, all 124 video clips were split into two sessions. Participants rated the video clips in two sessions separately.

During each trial, we assessed whether participants were not paying attention by tracking the duration that they kept the mouse pointer at any single location. If the duration was longer than 10 seconds, the affect rating grid would start to fluctuate which reminded participants to continue tracking the emotion of the target character. In order to assess whether there were any noisy annotators in our dataset, we computed each individual annotator’s agreement with the consensus by calculating the Pearson correlation between each annotator and the leave-one-out consensus (aggregate of responses except for the current annotator) for each video. We found that only one annotator had a correlation lower than .2 across all videos with the leave-one-out consensus. Since only one annotator fell below our threshold, we decided to keep the annotator in the dataset in order to not remove any important alternative annotations to the videos.

Figure 4. Visualization of sample mean ratings of valence and arousal for specific video clips with the zoom-in view of the selected character. We show key frames related to specific mean ratings of valence and arousal. Corresponding frames and ratings are marked the same color.

3.3. Visualization and Data Analysis

Figure 4 shows sample mean ratings and key frames in 2 different video clips. Clearly, both the valence and arousal here have a wide range of ratings. Moreover, it shows that context information, either spatial and/or temporal, plays an important role in emotion recognition tasks. In the valence example (upper figure), without the temporal and/or spatial context information of the fighting, it would be hard to recognize whether the character (the woman) in the last frame (yellow) is surprisingly happy or astonished. In the arousal example (lower figure), even without the selected character’s face, observers can easily and consistently infer the character’s arousal via the intense context.

Figure 5. Example valence and arousal ratings for a single video (video 47). Transparent gray lines indicate individual subject ratings and the green line is the average rating across participants.

Figure 5 illustrates sample valence and arousal ratings of all participants for a single video in our dataset. Individual subject’s ratings (gray lines) followed the consensus ratings across participants (green line) for both valence and arousal ratings. The dense gray line overlapping around the green consensus line indicates agreements between a wide range of observers. Additionally, We investigated how observers’ responses varied across videos by calculating the standard deviation across observers for each video. We found that the variance between observers for both valence and arousal dimensions was small with valence having an average standard deviation of µ = 0.248 and a median of 0.222 and arousal having an average standard deviation of µ = 0.248 and a median of 0.244, which are comparable with the valence and arousal rating variance from EMOTIC [32].

The distribution of the valence and arousal ratings across all of our videos is shown in Figure 6. We found that individual participant ratings were distributed fully across both valence and arousal dimensions which highlights the diversity of the VEATIC dataset. We also collected familiarity and enjoyment ratings for each video across participants (shown in Figure 7). We found that observers were unfamiliar with the videos used in the dataset as the average familiarity rating was 1.61 for video IDs 0-97. Additionally, observers rated their enjoyment while watching the videos as an average of 4.98 for video IDs 0-97 indicating that observers moderately enjoyed watching and annotating the video clips. Familiarity and enjoyment ratings were not collected for video IDs 98-123 as the annotations for these videos were collected at an earlier time point during data collection which did not include these ratings.

Figure 6. Distribution of valence and arousal ratings across participants. Individual white dots represent the average valence and arousal of the continuous ratings for each video clip for Hollywood movies. Blue squares and green triangles represent the average valence and arousal for documentaries and home videos, respectively. Ratings were binned into 0.02 intervals and the total number of data points was counted within each bin.

Figure 7. Familiarity and enjoyment ratings across videos for video IDs 0-97. Vertical black lines indicate 1 SD.

Table 2 below summarizes the basic statistics of the VEATIC dataset. In a nutshell, VEATIC has a long total video clip duration and a variety of video sources that cover a wide range of contexts and emotional conditions. Moreover, compared to previous datasets, we recruited far more participants to annotate the ratings.

Table 2. Statistics of VEATIC Dataset.

This paper is available on arxiv under CC 4.0 license.