paint-brush
Concatenated Masked Autoencoders as Spatial-Temporal Learner: Abstract & Introby@concatenation

Concatenated Masked Autoencoders as Spatial-Temporal Learner: Abstract & Intro

tldt arrow

Too Long; Didn't Read

In this paper, we introduce the Concatenated Masked Autoencoders (CatMAE) as a spatial-temporal learner for self-supervised video representation learning.
featured image - Concatenated Masked Autoencoders as Spatial-Temporal Learner: Abstract & Intro
Concatenation: The Interconnected CS Series HackerNoon profile picture

This paper is available on arxiv under CC 4.0 license.

Authors:

(1) Zhouqiang Jiang, Meetyou AI Lab,

(2) Bowen Wang, Institute for Datability Science, Osaka University,

(3) Tong Xiang, Meetyou AI Lab,

(4) Zhaofeng Niu, Department of Computer Science, Qufu Normal University,

(5) Hong Tang, Department of Information Engineering, East China Jiaotong University,

(6) Guangshun Li, Department of Computer Science, Qufu Normal University,

(7) Liangzhi Li, Meetyou AI Lab.

Abstract

— Learning representations from videos requires understanding continuous motion and visual correspondences between frames. In this paper, we introduce the Concatenated Masked Autoencoders (CatMAE) as a spatial-temporal learner for self-supervised video representation learning. For the input sequence of video frames, CatMAE keeps the initial frame unchanged while applying substantial masking (95%) to subsequent frames. The encoder in CatMAE is responsible for encoding visible patches for each frame individually; subsequently, for each masked frame, the decoder leverages visible patches from both previous and current frames to reconstruct the original image. Our proposed method enables the model to estimate the motion information between visible patches, match the correspondences between preceding and succeeding frames, and ultimately learn the evolution of scenes. Furthermore, we propose a new data augmentation strategy, Video-Reverse (ViRe), which uses reversed video frames as the model’s reconstruction targets. This further encourages the model to utilize continuous motion details and correspondences to complete the reconstruction, thereby enhancing the model’s capabilities. Compared to the most advanced pre-training methods, CatMAE achieves a leading level in video segmentation tasks and action recognition tasks. Code is available at https: //github.com/minhoooo1/CatMAE.


Fig. 1: Visualizations on the Kinetics-400 [7] validation set (masking rate 90%). For each video sequence, we sample 6 frames with a frame gap of 4. Each subfigure displays the original frames (top), masked future frames (middle), and CatMAE reconstruction results (bottom).

I. INTRODUCTION

In the realm of image pre-training, Masked Autoencoders (MAE) [1] have demonstrated their effectiveness in learning visual representations by reconstructing missing patches from randomly masked input images. Recent research extends this paradigm to the pre-training for videos [2], [3], [4], [5]. Existing studies consider the spatio-temporal specificity of masking strategies in order to reduce inductive bias. However, they either overlook differences in spatial and temporal dimensions [6] or neglect the ability to model continuous motion. This leads to the dilemma where learned representations struggle to simultaneously incorporate both continuous motion information and the correspondences between video frames.


Properly handling the temporal dimension in videos is essential to effectively utilize the masking and reconstruction paradigm. VideoMAE [2] and MAE-ST [3] expand 2D image patches into 3D cubes in videos, applying a selfsupervised mask reconstruction pipeline on these cubes. However, the semantics of video frames vary slowly over time [8], creating temporal redundancy which increases the risk of reconstructing missing cubes solely from the spatialtemporal neighborhood. Therefore, to prevent the model from learning shortcuts by exploiting this leaked information during reconstruction, a very high mask rate is considered a straightforward universal solution, which also significantly reduces computational cost [3]. The pre-trained models from these methods demonstrate outstanding transferability on action recognition tasks.


Yet, utilizing 3D cube masking in the spatial-temporal dimension is sub-optimal for learning inter-frame correspondences; for instance, Image MAEs outperform VideoMAE in video segmentation tasks [9]. Therefore, SiamMAE [4] proposes to use an asymmetric mask strategy for reconstructing future frames [10], [11] to learn inter-frame correspondences, and shows superiority in fine-grained correspondence tasks such as video object segmentation [9], video part segmentation [12], and pose tracking [13].


Continuous motion information and long-term correspondences usually span an extended period within a video. As such, predicting and reconstructing the future frames over a longer interval using the asymmetric masking strategy in SiamMAE is inherently hard due to their ambiguity. In extreme cases, the model may not be able to model any effective motion information, as the scenes in the two sampled frames could be entirely different.


To capture continuous motion information and longterm correspondences in videos, we propose Concatenated Masked AutoEncoders (CatMAE), which uses a concatenated information channel masking strategy to enhance the learning ability of the encoder. In our method, we first chronologically select N frames from a video clip, keeping all patches of the first frame visible and then performing random masking on patches of subsequent frames with an extremely high mask ratio. The encoder encodes the visible patches from N frames separately, and then the decoder reconstructs the masked patches for N-1 subsequent frames. Note that when reconstructing each frame’s missing patches, the decoder utilizes cross-attention to receive information from the visible patches of previous and current frames. This concatenated information channel masking assists in modeling motion differences and correspondences, eventually estimating the long-term dynamic evolution of video frames. As shown in Fig 1, CatMAE is able to reconstruct a long video sequence. Furthermore, inspired by time’s bidirectionality, we hypothesize that the reconstruction of reversed actions can reinforce the understanding of actions, and propose a video reverse data augmentation method, Video-Reverse (ViRe), to enhance the representation learning.


In conclusion, CatMAE enriches the exploration of the masking and reconstruction pipeline [1] in the field of selfsupervised video representation learning. In video object segmentation tasks CatMAE based on ViT-S/16 improves by 2% in J &Fm. Additionally, we also achieve leading results in the action recognition task. Experiment results show that our proposed CatMAE and ViRe arouses the full potential of our paradigm of masking and reconstruction within the spatial-temporal domain.