多任务对比学习的自监督视频表达

doi:10.12677/CSA.2023.133041

期刊菜单

多任务对比学习的自监督视频表达
Multitask Contrastive Learning for Self-Supervised Video Representation

DOI: 10.12677/CSA.2023.133041, PDF, 科研立项经费支持
作者: 单东风, 于磊, 骆文杰, 熊思璇, 刘家仁, 吴克伟：合肥工业大学计算机与信息学院，安徽合肥
关键词: 自监督；空间特征；时间特征；多任务对比学习方法；时空自注意力；Self-Supervised； Spatial Feature； Temporal Feature； Multitask Contrastive Learning Method； Spatiotemporal Self-Attention

摘要: 现有的自监督学习使用单一的空间或时间代理任务。单一的代理任务，从未标记的数据中提供单一的监督信号，不足以描述视频表示学习的空间特征和时间特征之间的差异。在本文中，我们提出了一个多任务对比学习方法，它通过对多个时空代理任务的对比学习，在时空自注意力的情况下学习有区别的时空特征。不同的空间代理任务学习不同的空间特征，包括空间旋转和空间拼图。不同的时间代理任务学习不同的时间特征，包括时间顺序和时间节奏。我们将视频表示为每个代理任务的多个不同特征，并设计基于代理任务的对比损失来分离一个视频中学习的空间特征和时间特征。基于代理任务的对比损失鼓励不同代理任务学习不同的特征，同一代理任务学习相似的特征，可以学习到同一视频中每个代理任务的判别特征。实验表明，在UCF-101数据集和HMDB-51数据集的行为识别上优于现有的自监督学习方法。

Abstract: Most existing self-supervised works use a single spatial or temporal pretext task. A single pretext task, providing single supervision from unlabeled data, is insufficient to describe the difference between spatial features and temporal features for video representation learning. In this paper, we propose an attentive spatiotemporal contrastive learning network, which learns discriminative spatial-temporal features with self-attention by contrastive learning between multiple spatial and temporal pretext tasks. Different spatial features are learned by multiple spatial pretext tasks, including spatial rotation, and spatial jigsaw. Different temporal features are learned by multiple temporal pretext tasks, including temporal order, and temporal pace. We represent video as multiple different features for each pretext task, and design pretext task-based contrastive loss to separate the spatial feature and the temporal feature learned in one video. The pretext task-based contrastive loss encourages the different pretext tasks to learn dissimilar features and the same pretext task to learn similar features, which can learn the discriminative features for each pretext task in one video. Experiments show that it outperforms existing self-supervised learning methods for behavior recognition on the UCF-101 dataset and the HMDB-51 dataset.

文章引用：单东风, 于磊, 骆文杰, 熊思璇, 刘家仁, 吴克伟. 多任务对比学习的自监督视频表达[J]. 计算机科学与应用, 2023, 13(3): 433-443. https://doi.org/10.12677/CSA.2023.133041

参考文献

[1]	Jing, L. and Tian, Y. (2018) Self-Supervised Spatiotemporal Feature Learning by Video Geometric Transfor-mations.
[2]	Ahsan, U., Madhok, R. and Essa, I. (2019) Video Jigsaw: Unsupervised Learning of Spatiotemporal Context for Video Action Recognition. 2019 IEEE Winter Conference on Applications of Computer Vision (WACV), Waikoloa, 7-11 January 2019, 179-189. [Google Scholar] [CrossRef]
[3]	Xu, D., Xiao, J., Zhao, Z., et al. (2019) Self-Supervised Spatiotemporal Learning via Video Clip Order Prediction. 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, 15-20 June 2019, 10326-10335. [Google Scholar] [CrossRef]
[4]	Yao, Y., Liu, C., Luo, D., et al. (2020) Video Playback Rate Per-ception for Self-Supervised Spatio-Temporal Representation Learning. 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, 13-19 June 2020, 6547-6556. [Google Scholar] [CrossRef]
[5]	Benaim, S., Ephrat, A., Lang, O., et al. (2020) SpeedNet: Learning the Speediness in Videos. 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, 13-19 June 2020, 9919-9928. [Google Scholar] [CrossRef]
[6]	Liang, H., Quader, N., Chi, Z., et al. (2021) Self-Supervised Spatiotemporal Representation Learning by Exploiting Video Continuity. The 36th AAAI Conference on Artificial Intelli-gence (AAAI-22), 22 February-1 March 2022, 1564-1573.
[7]	Kim, D., Cho, D. and Kweon, I.S. (2018) Self-Supervised Video Representation Learning with Space-Time Cubic Puzzles. Proceedings of the AAAI Conference on Artificial Intelligence, 33, 8545-8552. [Google Scholar] [CrossRef]
[8]	Piergiovanni, A.J., Angelova, A. and Ryoo, M.S. (2020) Evolving Losses for Unsupervised Video Representation Learning. 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, 13-19 June 2020, 130-139. [Google Scholar] [CrossRef]
[9]	Huang, L., Liu, Y., Wang, B., et al. (2021) Self-Supervised Video Representation Learning by Context and Motion Decoupling. 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, 20-25 June 2021, 13881-13890. [Google Scholar] [CrossRef]
[10]	Dave, I., Gupta, R., Rizve, M.N., et al. (2021) TCLR: Tem-poral Contrastive Learning for Video Representation. Computer Vision and Image Understanding, 219, Article ID: 103406. [Google Scholar] [CrossRef]
[11]	Wang, J., Jiao, J. and Liu, Y.H. (2020) Self-Supervised Video Representation Learning by Pace Prediction. Computer Vision—ECCV 2020 16th European Conference, Glasgow, 23-28 August 2020, 504-521.
[12]	Bai, Y., Fan, H., Misra, I., et al. (2020) Can Temporal Information Help with Con-trastive Self-Supervised Learning?
[13]	Kay, W., Carreira, J., Simonyan, K., et al. (2017) The Kinetics Human Action Video Dataset.
[14]	Soomro, K., Zamir, A.R. and Shah, M. (2012) UCF101: A Dataset of 101 Human Actions Classes from Videos in the Wild.
[15]	Kuehne, H., Jhuang, H., Garrote, E., et al. (2011) HMDB: A Large Video Database for Human Motion Recognition. IEEE International Conference on Computer Vision, Barcelona, 6-13 November 2011, 2556-2563. [Google Scholar] [CrossRef]
[16]	Chen, X., Xie, S. and He, K. (2021) An Empirical Study of Training Self-Supervised Vision Transformers. 2021 IEEE/CVF International Conference on Computer Vision (ICCV), Montreal, 10-17 October 2021, 9620-9629. [Google Scholar] [CrossRef]
[17]	Feichtenhofer, C., Fan, H., Malik, J., et al. (2019) SlowFast Networks for Video Recognition. 2019 IEEE/CVF International Conference on Computer Vision (ICCV), Seoul, 27 Oc-tober-2 November 2019, 6201-6210. [Google Scholar] [CrossRef]
[18]	Behrmann, N., Fayyaz, M., Gall, J., et al. (2021) Long Short View Feature Decomposition via Contrastive Video Representation Learning. 2021 IEEE/CVF International Conference on Computer Vision (ICCV), Montreal, 10-17 October 2021, 9224-9233. [Google Scholar] [CrossRef]
[19]	Wang, J., Gao, Y., Li, K., et al. (2021) Removing the Back-ground by Adding the Background: Towards Background Robust Self-Supervised Video Representation Learning. 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, 20-25 June 2021, 11799-11808. [Google Scholar] [CrossRef]
[20]	Han, T., Xie, W. and Zisserman, A. (2020) Self-Supervised Co-Training for Video Representation Learning.
[21]	Luo, D., Fang, B., Zhou, Y., et al. (2020) Exploring Relations in Untrimmed Videos for Self-Supervised Learning. ACM Transactions on Multimedia Computing, Communications, and Applications, 18, Article No. 35.
[22]	Liu, Y., Wang, K., Lan, H., et al. (2021) Temporal Contrastive Graph Learning for Video Action Recognition and Retrieval.
[23]	Zhang, Y., Po, L.M., Xu, X., et al. (2021) Contrastive Spatio-Temporal Pretext Learning for Self-Supervised Video Representation. The 36th AAAI Conference on Artificial Intelligence (AAAI-22), 22 February-1 March 2022, 3380-3389.

为你推荐

友情链接