基于近邻卷积Transformer的视频序列3D人体姿态估计方法
3D Human Pose Estimation Method in Video Sequences Based on Neighbor Convolution Transformer
DOI: 10.12677/csa.2026.162052, PDF,    科研立项经费支持
作者: 潘帅杰:温州大学计算机与人工智能学院,元宇宙与人工智能研究中心,浙江 温州;智 宇, 陈 昂*:温州大学计算机与人工智能学院,元宇宙与人工智能研究中心,浙江 温州;温州大学元宇宙与人工智能研究院,浙江 温州
关键词: 三维人体姿态估计时空Transformer卷积轴向多层感知机3D Human Pose Estimation Spatiotemporal Transformer Convolution Axial Multi-Layer Perceptron
摘要: 近年来,基于Transformer的方法在单目三维人体姿态估计领域取得了显著进展,其强大的自注意力机制能够有效建模全局特征与长程依赖关系。然而,现有方法大多侧重于构建全局的时空依赖,其交互机制缺乏对局部时空结构(特别是相邻帧之间强相关性)的显式归纳偏置。这可能导致模型对近邻帧间紧密而具结构性的时序关联挖掘不足。为此,本文提出一种新颖的注意力架构——近邻卷积Transformer (NCFormer),它通过近邻帧卷积与轴向多层感知机显式地建模近邻帧间的依赖关系。具体而言,NCFormer包含三个核心组件:(1) 用于捕获全局时空依赖的多头自注意力模块;(2) 近邻卷积模块,利用时间方向的卷积核提取近邻帧关系;(3) 轴向多层感知机,该模块旨在对时间和空间维度进行独立的特征变换,避免跨维度信息的无差别混合,使模型能够更专注地学习各维度特有的模式。在两个广泛使用的三维人体姿态估计基准数据集——Human3.6M和MPI-INF-3DHP上进行的实验表明,NCFormer在多种评估设定下均取得了具有高度竞争力的性能。
Abstract: In recent years, Transformer-based methods have achieved significant progress in the field of monocular 3D human pose estimation, owing to their powerful self-attention mechanism that effectively capture global representations and long-range dependencies. However, most existing approaches predominantly focus on constructing global spatiotemporal dependencies, and their interaction mechanisms lack explicit inductive bias toward local spatiotemporal structures, particularly the strong correlations between adjacent frames. This may lead to insufficient exploitation of the close and structured temporal relationships among neighboring frames. To address this, this paper proposes a novel attention architecture—the Neighbor Convolution Transformer (NCFormer)—which explicitly models dependencies between neighboring frames through neighbor-frame convolution and axial multi-layer perceptrons. Specifically, NCFormer consists of three core components: (1) A multi-head self-attention module for capturing global spatiotemporal dependencies; (2) A neighbor convolution module, which employs temporal convolution kernels to extract relationships among neighboring frames; and (3) An axial multi-layer perceptron, designed to perform independent feature transformations along the temporal and spatial dimensions, thereby avoiding undifferentiated mixing of cross-dimensional information and enabling the model to focus more on learning dimension-specific patterns. Experiments conducted on two widely used benchmark datasets for 3D human pose estimation—Human3.6M and MPI-INF-3DHP—demonstrate that NCFormer achieves highly competitive performance across various evaluation settings.
文章引用:潘帅杰, 智宇, 陈昂. 基于近邻卷积Transformer的视频序列3D人体姿态估计方法[J]. 计算机科学与应用, 2026, 16(2): 201-213. https://doi.org/10.12677/csa.2026.162052

参考文献

[1] Zheng, C., Wu, W., Chen, C., Yang, T., Zhu, S., Shen, J., et al. (2023) Deep Learning-Based Human Pose Estimation: A Survey. ACM Computing Surveys, 56, 1-37. [Google Scholar] [CrossRef
[2] Li, C., Huang, Q., Mao, Y., Li, X. and Wu, J. (2024) Multi-Granular Spatial-Temporal Synchronous Graph Convolutional Network for Robust Action Recognition. Expert Systems with Applications, 257, Article ID: 124980. [Google Scholar] [CrossRef
[3] Liu, M., Liu, H. and Chen, C. (2017) Enhanced Skeleton Visualization for View Invariant Human Action Recognition. Pattern Recognition, 68, 346-362. [Google Scholar] [CrossRef
[4] Gong, J., Fan, Z., Ke, Q., Rahmani, H. and Liu, J. (2022) Meta Agent Teaming Active Learning for Pose Estimation. 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, 18-24 June 2022, 11069-11079. [Google Scholar] [CrossRef
[5] Yoon, J.S., Liu, L., Golyanik, V., Sarkar, K., Park, H.S. and Theobalt, C. (2021) Pose-Guided Human Animation from a Single Image in the Wild. 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, 20-25 June 2021, 15034-15043. [Google Scholar] [CrossRef
[6] Ionescu, C., Papava, D., Olaru, V. and Sminchisescu, C. (2014) Human3.6M: Large Scale Datasets and Predictive Methods for 3D Human Sensing in Natural Environments. IEEE Transactions on Pattern Analysis and Machine Intelligence, 36, 1325-1339. [Google Scholar] [CrossRef] [PubMed]
[7] Mehta, D., Rhodin, H., Casas, D., Fua, P., Sotnychenko, O., Xu, W., et al. (2017) Monocular 3D Human Pose Estimation in the Wild Using Improved CNN Supervision. 2017 International Conference on 3D Vision (3DV), Qingdao, 10-12 October 2017, 506-516. [Google Scholar] [CrossRef
[8] Zhao, Q., Zheng, C., Liu, M., Wang, P. and Chen, C. (2023) PoseFormerV2: Exploring Frequency Domain for Efficient and Robust 3D Human Pose Estimation. 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Vancouver, 17-24 June 2023, 8877-8886. [Google Scholar] [CrossRef
[9] Chen, Y., Wang, Z., Peng, Y., Zhang, Z., Yu, G. and Sun, J. (2018) Cascaded Pyramid Network for Multi-Person Pose Estimation. 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, 18-23 June 2018, 7103-7112. [Google Scholar] [CrossRef
[10] Wang, J., Sun, K., Cheng, T., Jiang, B., Deng, C., Zhao, Y., et al. (2021) Deep High-Resolution Representation Learning for Visual Recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, 43, 3349-3364. [Google Scholar] [CrossRef] [PubMed]
[11] Liu, R., Shen, J., Wang, H., Chen, C., Cheung, S. and Asari, V. (2020) Attention Mechanism Exploits Temporal Contexts: Real-Time 3D Human Pose Reconstruction. 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, 13-19 June 2020, 5063-5072. [Google Scholar] [CrossRef
[12] Pavllo, D., Feichtenhofer, C., Grangier, D. and Auli, M. (2019) 3D Human Pose Estimation in Video with Temporal Convolutions and Semi-Supervised Training. 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, 15-20 June 2019, 7745-7754. [Google Scholar] [CrossRef
[13] Cai, Y., Ge, L., Liu, J., Cai, J., Cham, T., Yuan, J., et al. (2019) Exploiting Spatial-Temporal Relationships for 3D Pose Estimation via Graph Convolutional Networks. 2019 IEEE/CVF International Conference on Computer Vision (ICCV), Seoul, 27 October-2 November 2019, 2272-2281. [Google Scholar] [CrossRef
[14] Jia, R., Yang, H., Zhao, L., Wu, X. and Zhang, Y. (2023) MPA-GNet: Multi-Scale Parallel Adaptive Graph Network for 3D Human Pose Estimation. The Visual Computer, 40, 5883-5899. [Google Scholar] [CrossRef
[15] Shan, W., Liu, Z., Zhang, X., Wang, S., Ma, S. and Gao, W. (2022) P-STMO: Pre-Trained Spatial Temporal Many-To-One Model for 3D Human Pose Estimation. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M. and Hassner, T., Eds., Computer VisionECCV 2022, Springer, 461-478. [Google Scholar] [CrossRef
[16] Li, W., Liu, H., Tang, H., Wang, P. and Van Gool, L. (2022) MHFormer: Multi-Hypothesis Transformer for 3D Human Pose Estimation. 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, 18-24 June 2022, 13137-13146. [Google Scholar] [CrossRef
[17] Newell, A., Yang, K. and Deng, J. (2016) Stacked Hourglass Networks for Human Pose Estimation. In: Leibe, B., Matas, J., Sebe, N. and Welling, M., Eds., Computer VisionECCV 2016, Springer, 483-499. [Google Scholar] [CrossRef
[18] Zheng, C., Zhu, S., Mendieta, M., Yang, T., Chen, C. and Ding, Z. (2021) 3D Human Pose Estimation with Spatial and Temporal Transformers. 2021 IEEE/CVF International Conference on Computer Vision (ICCV), Montreal, 10-17 October 2021, 11636-11645. [Google Scholar] [CrossRef
[19] Zhang, J., Tu, Z., Yang, J., Chen, Y. and Yuan, J. (2022) MixSTE: Seq2seq Mixed Spatio-Temporal Encoder for 3D Human Pose Estimation in Video. 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, 18-24 June 2022, 13222-13232. [Google Scholar] [CrossRef
[20] Hossain, M.R.I. and Little, J.J. (2018) Exploiting Temporal Information for 3D Human Pose Estimation. In: Ferrari, V., Hebert, M., Sminchisescu, C. and Weiss, Y., Eds., Computer VisionECCV 2018, Springer, 69-86. [Google Scholar] [CrossRef
[21] Kingma, D.P. and Ba, J. (2015) Adam: A Method for Stochastic Optimization. arXiv: 1412.6980.
[22] Zeng, A., Sun, X., Huang, F., Liu, M., Xu, Q. and Lin, S. (2020) SRNet: Improving Generalization in 3D Human Pose Estimation with a Split-And-Recombine Approach. In: Vedaldi, A., Bischof, H., Brox, T. and Frahm, J.M., Eds., Computer VisionECCV 2020, Springer, 507-523. [Google Scholar] [CrossRef
[23] Yu, B.X.B., Zhang, Z., Liu, Y., Zhong, S., Liu, Y. and Chen, C.W. (2023) GLA-GCN: Global-Local Adaptive Graph Convolutional Network for 3D Human Pose Estimation from Monocular Video. 2023 IEEE/CVF International Conference on Computer Vision (ICCV), Paris, 1-6 October 2023, 8784-8795. [Google Scholar] [CrossRef
[24] Hao, F., Zhong, F., Yu, H., Hu, J. and Yang, Y. (2024) STAFFormer: Spatio-Temporal Adaptive Fusion Transformer for Efficient 3D Human Pose Estimation. Image and Vision Computing, 149, Article ID: 105142. [Google Scholar] [CrossRef
[25] Li, W., Liu, H., Tang, H. and Wang, P. (2023) Multi-Hypothesis Representation Learning for Transformer-Based 3D Human Pose Estimation. Pattern Recognition, 141, Article ID: 109631. [Google Scholar] [CrossRef