Authors:
(1) Zhihang Ren, University of California, Berkeley and these authors contributed equally to this work (Email: peter.zhren@berkeley.edu);
(2) Jefferson Ortega, University of California, Berkeley and these authors contributed equally to this work (Email: jefferson_ortega@berkeley.edu);
(3) Yifan Wang, University of California, Berkeley and these authors contributed equally to this work (Email: wyf020803@berkeley.edu);
(4) Zhimin Chen, University of California, Berkeley (Email: zhimin@berkeley.edu);
(5) Yunhui Guo, University of Texas at Dallas (Email: yunhui.guo@utdallas.edu);
(6) Stella X. Yu, University of California, Berkeley and University of Michigan, Ann Arbor (Email: stellayu@umich.edu);
(7) David Whitney, University of California, Berkeley (Email: dwhitney@berkeley.edu). Table of Links Abstract and Intro
Related Wok
VEATIC Dataset
Experiments
Discussion
Conclusion
More About Stimuli
Annotation Details
Outlier Processing
Subject Agreement Across Videos
Familiarity and Enjoyment Ratings and References 2. Related Work Recently, there have been several datasets that provide frames with both facial and context information, like CAER [33] and EMOTIC [32]. CAER [33] is a videobased dataset that contains categorical labels of each video frame, and EMOTIC [32] is an image-based dataset containing both categorical expression labels and continuous valence-arousal-dominance ratings. Unlike these datasets, our dataset is video-based and contains continuous valence and arousal ratings. A detailed comparison between our dataset with previous datasets can be found in Table 1. Based on various emotion datasets, studies have started to focus on how to infer emotion automatically. Human affect can be inferred from many modalities, such as audio [70, 68, 65], visual [40, 54, 55, 37], and text [68, 22]. For visual inputs, in particular, there are three major tasks. The valence-arousal estimation task aims to predict the valence and arousal of each image/frame [71, 69, 29, 30]; the expression recognition task focuses on classifying emotional categories of each image/frame [66, 57, 67]; and the action unit (AU) detection task intends to detect facial muscle actions from the faces of each image/frame [25, 56, 35, 64]. Currently, most proposed methods rely highly on the facial area to infer the emotional state. Indeed, the facial area contains rich information about the human emotional state. However, contextual factors also provide essential information that is necessary for humans to correctly infer and perceive the emotional states of others [8, 9, 10]. Several studies [33, 32, 40] have started to incorporate context information as a source of affect inference. In this study, we also adopted both facial and context information to achieve the new task, i.e., to infer the valence and arousal for each video frame. To infer the affect of a person, we usually need to deal with temporal information of either audio segments, video frames, or words. Many studies [68, 69, 29, 30] started to utilize long short term memory (LSTM) [23], gated recurrent unit (GRU) [11], or recurrent neural network (RNN) [24, 50] to process the temporal information. With the emergence of the visual transformer (ViT) [14], attention has been shifted. Many video understanding tasks [19, 1, 36] have utilized ViT for temporal information understanding and achieving state-of-the-art performance. Our baseline method also adopted ViT as a tool to process the temporal information in video clips. This paper is available on arxiv under CC 4.0 license. Authors: (1) Zhihang Ren, University of California, Berkeley and these authors contributed equally to this work (Email: peter.zhren@berkeley.edu); (2) Jefferson Ortega, University of California, Berkeley and these authors contributed equally to this work (Email: jefferson_ortega@berkeley.edu); (3) Yifan Wang, University of California, Berkeley and these authors contributed equally to this work (Email: wyf020803@berkeley.edu); (4) Zhimin Chen, University of California, Berkeley (Email: zhimin@berkeley.edu); (5) Yunhui Guo, University of Texas at Dallas (Email: yunhui.guo@utdallas.edu); (6) Stella X. Yu, University of California, Berkeley and University of Michigan, Ann Arbor (Email: stellayu@umich.edu); (7) David Whitney, University of California, Berkeley (Email: dwhitney@berkeley.edu). Authors: Authors: (1) Zhihang Ren, University of California, Berkeley and these authors contributed equally to this work (Email: peter.zhren@berkeley.edu); (2) Jefferson Ortega, University of California, Berkeley and these authors contributed equally to this work (Email: jefferson_ortega@berkeley.edu); (3) Yifan Wang, University of California, Berkeley and these authors contributed equally to this work (Email: wyf020803@berkeley.edu); (4) Zhimin Chen, University of California, Berkeley (Email: zhimin@berkeley.edu); (5) Yunhui Guo, University of Texas at Dallas (Email: yunhui.guo@utdallas.edu); (6) Stella X. Yu, University of California, Berkeley and University of Michigan, Ann Arbor (Email: stellayu@umich.edu); (7) David Whitney, University of California, Berkeley (Email: dwhitney@berkeley.edu). Table of Links Abstract and Intro Related Wok VEATIC Dataset Experiments Discussion Conclusion More About Stimuli Annotation Details Outlier Processing Subject Agreement Across Videos Familiarity and Enjoyment Ratings and References Abstract and Intro Abstract and Intro Related Wok Related Wok VEATIC Dataset VEATIC Dataset Experiments Experiments Discussion Discussion Conclusion Conclusion More About Stimuli More About Stimuli Annotation Details Annotation Details Outlier Processing Outlier Processing Subject Agreement Across Videos Subject Agreement Across Videos Familiarity and Enjoyment Ratings and References Familiarity and Enjoyment Ratings and References 2. Related Work Recently, there have been several datasets that provide frames with both facial and context information, like CAER [33] and EMOTIC [32]. CAER [33] is a videobased dataset that contains categorical labels of each video frame, and EMOTIC [32] is an image-based dataset containing both categorical expression labels and continuous valence-arousal-dominance ratings. Unlike these datasets, our dataset is video-based and contains continuous valence and arousal ratings. A detailed comparison between our dataset with previous datasets can be found in Table 1. Based on various emotion datasets, studies have started to focus on how to infer emotion automatically. Human affect can be inferred from many modalities, such as audio [70, 68, 65], visual [40, 54, 55, 37], and text [68, 22]. For visual inputs, in particular, there are three major tasks. The valence-arousal estimation task aims to predict the valence and arousal of each image/frame [71, 69, 29, 30]; the expression recognition task focuses on classifying emotional categories of each image/frame [66, 57, 67]; and the action unit (AU) detection task intends to detect facial muscle actions from the faces of each image/frame [25, 56, 35, 64]. Currently, most proposed methods rely highly on the facial area to infer the emotional state. Indeed, the facial area contains rich information about the human emotional state. However, contextual factors also provide essential information that is necessary for humans to correctly infer and perceive the emotional states of others [8, 9, 10]. Several studies [33, 32, 40] have started to incorporate context information as a source of affect inference. In this study, we also adopted both facial and context information to achieve the new task, i.e., to infer the valence and arousal for each video frame. To infer the affect of a person, we usually need to deal with temporal information of either audio segments, video frames, or words. Many studies [68, 69, 29, 30] started to utilize long short term memory (LSTM) [23], gated recurrent unit (GRU) [11], or recurrent neural network (RNN) [24, 50] to process the temporal information. With the emergence of the visual transformer (ViT) [14], attention has been shifted. Many video understanding tasks [19, 1, 36] have utilized ViT for temporal information understanding and achieving state-of-the-art performance. Our baseline method also adopted ViT as a tool to process the temporal information in video clips. This paper is available on arxiv under CC 4.0 license. This paper is available on arxiv under CC 4.0 license. available on arxiv

Part of HackerNoon's growing list of open-source research papers, promoting free access to academic material.

VEATIC: Video-based Emotion and Affect Tracking in Context Dataset: Outlier Processing

VEATIC: Video-based Emotion and Affect Tracking in Context Dataset: Related Work

About Author

Comments

TOPICS

THIS ARTICLE WAS FEATURED IN

Related Stories

A Reference List to Learn More About Image Editing, Video Editing, and Diffusion Models

VEATIC: Video-based Emotion and Affect Tracking in Context Dataset: Abstract and Intro

VEATIC: Video-based Emotion and Affect Tracking in Context Dataset: Discussion

VEATIC: Video-based Emotion and Affect Tracking in Context Dataset: Experiments

VEATIC: Video-based Emotion and Affect Tracking in Context Dataset: VEATIC Dataset

VEATIC: Video-based Emotion and Affect Tracking in Context Dataset: More About Stimuli

A Reference List to Learn More About Image Editing, Video Editing, and Diffusion Models

VEATIC: Video-based Emotion and Affect Tracking in Context Dataset: Abstract and Intro

VEATIC: Video-based Emotion and Affect Tracking in Context Dataset: Discussion

VEATIC: Video-based Emotion and Affect Tracking in Context Dataset: Experiments

VEATIC: Video-based Emotion and Affect Tracking in Context Dataset: VEATIC Dataset

VEATIC: Video-based Emotion and Affect Tracking in Context Dataset: More About Stimuli

Light-Mode

Classic

Newspaper

Minty

Dark-Mode

Neon Noir

Minty

HN StartUps