In this paper, we introduce the Concatenated Masked Autoencoders (CatMAE) as a spatial-temporal learner for self-supervised video representation learning.
This paper is available on arxiv under CC 4.0 license.


(1) Zhouqiang Jiang, Meetyou AI Lab,

(2) Bowen Wang, Institute for Datability Science, Osaka University,

(3) Tong Xiang, Meetyou AI Lab,

(4) Zhaofeng Niu, Department of Computer Science, Qufu Normal University,

(5) Hong Tang, Department of Information Engineering, East China Jiaotong University,

(6) Guangshun Li, Department of Computer Science, Qufu Normal University,

(7) Liangzhi Li, Meetyou AI Lab.


In this paper, we proposed CatMAE for self-supervised video representation learning. It leverages a concatenated information channel masking strategy to address the limitations posed by cube masking and enhances the capability to capture continuous and long-term motion compared to asymmetric masking. Our experimental results demonstrate superior performance in comparison to state-of-the-art methods across both video segmentation and action recognition tasks. One distinctive feature of our training pipeline is the propagation of reconstruction information from the initial frame throughout the entire video sequence. This theoretically unlimited propagation showcases CatMAE’s potential to learn long-term video representations. Our future work focuses on extending the application of CatMAE to realworld scenarios involving embodied agents, such as robots.


