跨模态时空交叉注意力下机器人抓取滑动检测
Robot Grasping Slip Detection Based on Cross-Modal Spatiotemporal Cross-Attention Mechanism
摘要: 在机器人领域,滑动检测是一个关键的任务。机器人需要利用多模态信息进行特征提取、信息融合交互与灵巧操作。为此,提出一个基于跨模态时空交叉注意力机制的多模态融合模型,用于滑动检测。该模型利用时空注意力学习多模态传感器反馈的物理特征,将学习到的视触觉时空特征通过跨模态交叉注意力进行交互融合。最后,通过多层感知机(MLP)预测滑动检测结果。使用7自由度XArm机械臂、D455摄像头和XELA触觉传感器进行数据采集、模型训练和验证。结果表明,该模型的滑动检测准确率高达97.8%,所提出的模型在可靠、顺利执行机器人抓取任务方面具有较高的研究和应用价值。
Abstract: In the field of robotics, slip detection is a crucial task. Robots need to utilize multimodal information for feature extraction, information fusion interaction, and dexterous manipulation. For this, a multimodal fusion model based on cross-modal spatiotemporal attention mechanism is proposed for slip detection. The model uses spatiotemporal attention to learn the physical features reflected by multimodal sensor feedback, and the learned visuotactile spatiotemporal features are interactively fused through cross-modal attention. Finally, slip detection results are predicted using a Multi-layer perceptron (MLP). Data collection, model training, and validation are carried out using a 7-DOF XArm robotic arm, a D455 camera, and XELA tactile sensors. The results indicate that the slip detection accuracy of this model reaches up to 97.8%, demonstrating the high research and practical value of the proposed model in ensuring reliable and smooth execution of robotic grasping tasks.
文章引用:谷鑫. 跨模态时空交叉注意力下机器人抓取滑动检测[J]. 计算机科学与应用, 2024, 14(4): 1-12. https://doi.org/10.12677/csa.2024.144071

参考文献

[1] Li, J., Dong, S. and Adelson, E. (2018) Slip Detection with Combined Tactile and Visual Information. 2018 IEEE International Conference on Robotics and Automation (ICRA), Brisbane, 21-25 May 2018, 7772-7777. [Google Scholar] [CrossRef
[2] Cui, S., Wang, R., Wei, J., et al. (2020) Grasp State Assessment of Deformable Objects Using Visual-Tactile Fusion Perception. 2020 IEEE International Conference on Robotics and Automation (ICRA), Paris, 31 May-31 August 2020, 538-544. [Google Scholar] [CrossRef
[3] Zhang, W., Sun, F., Wu, H., et al. (2017) A Framework for the Fusion of Visual and Tactile Modalities for Improving Robot Perception. Science China Information Sciences, 60, Article No. 12201. [Google Scholar] [CrossRef
[4] Francomano, M.T., Accoto, D. and Guglielmelli, E. (2013) Artificial Sense of Slip—A Review. IEEE Sensors Journal, 13, 2489-2498. [Google Scholar] [CrossRef
[5] Yan, G., Schmitz, A., Tomo, T.P., et al. (2022) Detection of Slip from Vision and Touch. 2022 International Conference on Robotics and Automation (ICRA), Philadelphia, 23-27 May 2022, 3537-3543. [Google Scholar] [CrossRef
[6] 黄兆基, 高军礼, 唐兆年, 等. 基于注意力机制和视触融合的机器人抓取滑动检测[J/OL]. 信息与控制: 1-9. 2024-04-06.[CrossRef
[7] Bahdanau, D., Cho, K. and Bengio, Y. (2014) Neural Machine Translation by Jointly Learning to Align and Translate. arXiv: 1409.0473.
[8] Vaswani, A., Shazeer, N., Parmar, N., et al. (2017) Attention Is All You Need. In: Guyon, I., Von Luxburg, U., et al., Eds., Advances in Neural Information Processing Systems 30, Long Beach, 4-9 December 2017, 1-15.
[9] Cui, S., Wang, R., Wei, J., et al. (2020) Self-Attention Based Visual-Tactile Fusion Learning for Predicting Grasp Outcomes. IEEE Robotics and Automation Letters, 5, 5827-5834. [Google Scholar] [CrossRef
[10] Dosovitskiy, A., Beyer, L., Kolesnikov, A., et al. (2020) An Image Is Worth 16x16 Words: Transformers for Image Recognition at Scale. arXiv preprint arXiv:2010.11929.
[11] Bertasius, G., Wang, H. and Torresani, L. (2021) Is Space-Time Attention All You Need for Video Understanding? ICML, 2, 1-12.
[12] Arnab, A., Dehghani, M., Heigold, G., et al. (2021) Vivit: A Video Vision Transformer. 2021 IEEE/CVF International Conference on Computer Vision (ICCV), Montreal, 10-17 October 2021, 6836-6846. [Google Scholar] [CrossRef
[13] Cao, G., Zhou, Y., Bollegala, D., et al. (2020) Spatio-Temporal Attention Model for Tactile Texture Recognition. 2020 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Las Vegas, 24 October 2020-24 January 2021, 9896-9902. [Google Scholar] [CrossRef
[14] Kim, H., Ohmura, Y. and Kuniyoshi, Y. (2021) Transformer-Based Deep Imitation Learning for Dual-Arm Robot Manipulation. 2021 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Prague, 27 September-1 October 2021, 8965-8972. [Google Scholar] [CrossRef
[15] Li, J., Selvaraju, R., Gotmare, A., et al. (2021) Align Before Fuse: Vision and Language Representation Learning with Momentum Distillation. Advances in Neural Information Processing Systems, 34, 9694-9705.
[16] Bao, H., Wang, W., Dong, L., et al. (2022) Vlmo: Unified Vision-Language Pre-Training with Mixture-of-Modality-Experts. Advances in Neural Information Processing Systems, 35, 32897-32912
[17] Cui, S., Wei, J., Li, X., et al. (2020) Generalized Visual-Tactile Transformer Network for Slip Detection. IFAC-Pa-persOnLine, 53, 9529-9534. [Google Scholar] [CrossRef