基于不确定性感知自适应伪标签的指代视频目标分割
Uncertainty-Aware Adaptive Pseudo-Labeling for Referring Video Object Segmentation
DOI: 10.12677/mos.2025.142147, PDF,    国家自然科学基金支持
作者: 张施明, 陈智谦, 米金鹏*:上海理工大学机器智能研究院,上海
关键词: 指代视频目标分割伪标签不确定性感知细化Referring Video Object Segmentation Pseudo-Labeling Uncertainty-Aware Refinement
摘要: 指代视频目标分割(Referring Video Object Segmentation, RVOS)是一项新兴的多模态任务,旨在通过理解给定指代表达的语义来分割视频片段中的目标区域。然而,基准数据集的标注是通过半监督方式收集的,仅提供了视频第一帧的真实目标掩码。为了在一个更综合的框架中探索未标记数据中的隐藏知识,本文引入了在线伪标签来解决RVOS问题。具体来说,使用之前训练阶段的即时学习检查点作为教师模型,在未标记的视频帧上生成伪标签,并将获得的伪标签用作训练数据的增强,以监督随后的训练阶段。为了避免伪标签带来的混淆,本文提出了一种不确定性感知的细化策略,根据模型预测的置信度自适应地修正生成的伪标签。本文在基准数据集Refer-YouTube-VOS和Refer-DAVIS17上进行了广泛的实验来验证所提出的方法。实验结果表明,本文的模型与最先进的模型相比取得了具有竞争力的结果。
Abstract: Referring video object segmentation (RVOS) is an emerging multimodal task aiming to segment target regions in video clips by understanding the semantics of given referring expressions. While the annotations of the benchmark datasets are collected in a semi-supervised manner, which only provides the ground truth object masks on the first frame of videos. To explore the concealed knowledge in the unlabeled data in a more integrated framework, we introduce online pseudo-labeling to address RVOS. Specifically, we employ the on-the-fly learned checkpoints in the previous training epochs as the teacher model to produce the pseudo labels on the unlabeled video frames, and the obtained pseudo-labels are utilized as augmentation for the training data to supervise the subsequent training stage. To avert the confusion derived from pseudo-labels, we propose an uncertainty-aware refinement strategy to adaptively rectify the generated pseudo-labels based on the model prediction confidence. We conduct extensive experiments on the benchmark datasets Refer-YouTube-VOS and Refer-DAVIS17 to validate the proposed approach. The experimental results demonstrate that our model achieves competitive results compared with state-of-the-art models.
文章引用:张施明, 陈智谦, 米金鹏. 基于不确定性感知自适应伪标签的指代视频目标分割[J]. 建模与仿真, 2025, 14(2): 236-244. https://doi.org/10.12677/mos.2025.142147

参考文献

[1] Perazzi, F., Pont-Tuset, J., McWilliams, B., Van Gool, L., Gross, M. and Sorkine-Hornung, A. (2016) A Benchmark Dataset and Evaluation Methodology for Video Object Segmentation. 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, 27-30 June 2016, 724-732. [Google Scholar] [CrossRef
[2] Xu, N., Yang, L., Fan, Y., Yang, J., Yue, D., Liang, Y., et al. (2018) YouTube-VOS: Sequence-to-Sequence Video Object Segmentation. In: Lecture Notes in Computer Science, Springer, 603-619. [Google Scholar] [CrossRef
[3] Zhou, T., Porikli, F., Crandall, D.J., Van Gool, L. and Wang, W. (2023) A Survey on Deep Learning Technique for Video Segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 45, 7099-7122. [Google Scholar] [CrossRef] [PubMed]
[4] Hu, R., Rohrbach, M., Andreas, J., Darrell, T. and Saenko, K. (2017) Modeling Relationships in Referential Expressions with Compositional Modular Networks. 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, 21-26 July 2017, 4418-4427. [Google Scholar] [CrossRef
[5] Yu, L., Lin, Z., Shen, X., Yang, J., Lu, X., Bansal, M., et al. (2018). MAttNet: Modular Attention Network for Referring Expression Comprehension. 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, 18-23 June 2018, 1307-1315.[CrossRef
[6] Hu, R., Rohrbach, M. and Darrell, T. (2016) Segmentation from Natural Language Expressions. In: Lecture Notes in Computer Science, Springer, 108-124. [Google Scholar] [CrossRef
[7] Shi, H., Li, H., Meng, F. and Wu, Q. (2018) Key-Word-Aware Network for Referring Expression Image Segmentation. In: Lecture Notes in Computer Science, Springer, 38-54. [Google Scholar] [CrossRef
[8] Khoreva, A., Rohrbach, A. and Schiele, B. (2019) Video Object Segmentation with Language Referring Expressions. In: Lecture Notes in Computer Science, Springer, 123-141. [Google Scholar] [CrossRef
[9] Seo, S., Lee, J. and Han, B. (2020) URVOS: Unified Referring Video Object Segmentation Network with a Large-Scale Benchmark. In: Lecture Notes in Computer Science, Springer, 208-223. [Google Scholar] [CrossRef
[10] Liang, C., Wu, Y., Zhou, T., Wang, W., Yang, Z., Wei, Y. and Yang, Y. (2021) Rethinking Cross-Modal Interaction from a Top-Down Perspective for Referring Video Object Segmentation.
[11] Wu, D., Dong, X., Shao, L. and Shen, J. (2022) Multi-Level Representation Learning with Semantic Alignment for Referring Video Object Segmentation. 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orlean, 18-24 June 2022, 4986-4995. [Google Scholar] [CrossRef
[12] Li, H., Wu, Z., Shrivastava, A. and Davis, L.S. (2022) Rethinking Pseudo Labels for Semi-Supervised Object Detection. Proceedings of the AAAI Conference on Artificial Intelligence, 36, 1314-1322. [Google Scholar] [CrossRef
[13] Wang, Y., Wang, H., Shen, Y., Fei, J., Li, W., Jin, G., et al. (2022) Semi-Supervised Semantic Segmentation Using Unreliable Pseudo-labels. 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, 18-24 June 2022, 4238-4247. [Google Scholar] [CrossRef
[14] Xu, Y., Shang, L., Ye, J., Qian, Q., et al. (2021) Dash: Semi-Supervised Learning with Dynamic Thresholding. International Conference on Machine Learning, Online, 18-24 July 2021, 11525-11536.
[15] Berthelot, D., Carlini, N., Cubuk, E.D., Kurakin, A., et al. (2019) Remixmatch: Semi-Supervised Learning with Distribution Matching and Augmentation Anchoring. 2019 International Conference on Learning Representation, New Orleans, 6-9 May 2019.
[16] Xie, Q., Luong, M., Hovy, E. and Le, Q.V. (2020) Self-Training with Noisy Student Improves ImageNet Classification. 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, 13-19 June 2020, 10684-10695. [Google Scholar] [CrossRef
[17] Settles, B. (2009) Active Learning Literature Survey. Computer Sciences Technical Report 1648.
[18] Pont-Tuset, J., Perazzi, F., Caelles, S., Arbeláez, P., Sorkine-Hornung, A. and Van Gool, L. (2017) The 2017 Davis Challenge on Video Object Segmentation.
[19] Li, D., Li, R., Wang, L., Wang, Y., Qi, J., Zhang, L., et al. (2022) You Only Infer Once: Cross-Modal Meta-Transfer for Referring Video Object Segmentation. Proceedings of the AAAI Conference on Artificial Intelligence, 36, 1297-1305. [Google Scholar] [CrossRef
[20] He, K., Zhang, X., Ren, S. and Sun, J. (2016) Deep Residual Learning for Image Recognition. 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, 27-30 June 2016, 770-778. [Google Scholar] [CrossRef
[21] Tarvainen, A. and Valpola, H. (2017) Mean Teachers Are Better Role Models: Weight-Averaged Consistency Targets Improve Semi-Supervised Deep Learning Results. 2017 Conference on Neural Information Processing Systems, Long Beach, 4-9 December 2017.
[22] Bellver, M., Ventura, C., Silberer, C., Kazakos, I., Torres, J. and Giro-i-Nieto, X. (2022) A Closer Look at Referring Expressions for Video Object Segmentation. Multimedia Tools and Applications, 82, 4419-4438. [Google Scholar] [CrossRef
[23] Liu, S., Hui, T., Huang, S., Wei, Y., Li, B. and Li, G. (2021) Cross-Modal Progressive Comprehension for Referring Segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 44, 4761-4775.
[24] Ding, Z., Hui, T., Huang, J., Wei, X., Han, J. and Liu, S. (2022) Language-Bridged Spatial-Temporal Interaction for Referring Video Object Segmentation. 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, 18-24 June 2022, 4954-4963. [Google Scholar] [CrossRef
[25] Feng, G., Zhang, L., Hu, Z. and Lu, H. (2022) Deeply Interleaved Two-Stream Encoder for Referring Video Segmentation.
[26] Liang, C., Wang, W., Zhou, T., Miao, J., Luo, Y. and Yang, Y. (2023) Local-Global Context Aware Transformer for Language-Guided Video Segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 45, 10055-10069. [Google Scholar] [CrossRef] [PubMed]