基于混合注意力机制的YOLOv5模型及其在电商透明商品检测中的应用
A YOLOv5 Model with Hybrid Attention Mechanism for Transparent Product Detection in E-Commerce
摘要: 随着电子商务的发展,面向透明商品(比如玻璃瓶、塑料包装化妆品)的检测识别变得尤为重要。由于其特殊光学特性,传统目标检测模型YOLOv5在此类物体上常出现特征提取困难、定位不准和漏检率高等问题,严重制约了电商运营效率与用户体验。为解决这一难题,本研究提出了一种改进的YOLOv5目标检测模型。本文在YOLOv5s的主干与颈部网络结合一种混合的轻量级注意力机制:CLEAR-Attn (Channel-Linear External-SimAM Attention with Residuals),该注意力机制针对透明的电商物品进行改进,提升了针对电商物品检测的稳健性。该模型融合了外部注意力机制(External Attention, EA)与SimAM注意力机制,EA注意力机制能够以两层线性映射与双重归一化实现外部记忆建模,捕获长程依赖。在其输出后串接SimAM以强化边界与高光区域的响应,最后通过残差连接输出最后的特征,从而能够实现对透明电商商品的有效检测。在Trans10K透明商品数据集上进行实证,采用与基线一致的训练策略。结果显示,在增加少量参数情况下,本文模型在整体精度上取得稳定提升5个百分点,检测精度达到了96%。在进一步的应用分析表明,该技术可广泛应用于电商平台的图像搜索、智能货柜的商品识别、自动化仓库的包裹分拣及库存盘点等具体场景,有效提升识别准确率与作业自动化水平。
Abstract: With the growth of e-commerce, detecting transparent or highly reflective products (e.g., glass bottles and plastic-wrapped cosmetics) has become increasingly important. Due to their optical properties, conventional detectors such as YOLOv5 often struggle with feature extraction, precise localization, and miss rates on such objects, which constrains operational efficiency and user experience. To address this, we propose an improved YOLOv5 model. Built on YOLOv5s, we integrate a lightweight hybrid attention mechanism, CLEAR-Attn (Channel-Linear External-SimAM Attention with Residuals), into both the backbone and neck to enhance robustness for transparent product detection. CLEAR-Attn fuses External Attention (EA) and SimAM: EA models long-range dependencies via two linear projections with double normalization to form an external memory, while SimAM—placed after EA—amplifies responses along object boundaries and specular highlights. A residual connection is then applied to stabilize training and preserve information. We validate the approach on a Trans10K transparent-product subset under training protocols aligned with the baseline. Results show that, with only a small increase in parameters, our method yields a ~5 percentage-point improvement in mAP50 to reach 96% over the baseline YOLOv5s and is particularly stable in scenarios with strong reflections, low contrast, and small objects. The proposed technique can be readily applied to e-commerce image search, smart-cabinet product recognition, automated warehouse parcel sorting, and inventory auditing, thereby improving recognition accuracy and the level of operational automation.
文章引用:张伟. 基于混合注意力机制的YOLOv5模型及其在电商透明商品检测中的应用[J]. 电子商务评论, 2025, 14(10): 2045-2053. https://doi.org/10.12677/ecl.2025.14103364

参考文献

[1] 谭鑫, 齐福霖, 王楠, 等. 基于视觉失真的玻璃表面检测方法[J]. 计算机辅助设计与图形学学报, 2025, 37(5): 832-843.
[2] 罗文沛, 李军. 基于YOLOv5s的无人机视角下的小目标检测算法[J]. 计算机应用, 2025, 45(S1): 235-238.
[3] Guo, M.H., Liu, Z.N., Mu, T.J. and Hu, S.M. (2023) Beyond Self-Attention: External Attention Using Two Linear Layers for Visual Tasks. IEEE Transactions on Pattern Analysis and Machine Intelligence, 45, 5436-5447.
[4] Yang, L.X., Zhang, R.Y., Li, L., et al. (2021) SimAM: A Simple, Parameter-Free Attention Module for Convolutional Neural Networks. Proceedings of the 38th International Conference on Machine Learning, 139, 11863-11874.
[5] Xie, E., Wang, W., Wang, W., Ding, M., Shen, C. and Luo, P. (2020) Segmenting Transparent Objects in the Wild. In: Vedaldi, A., Bischof, H., Brox, T. and Frahm, J.M., Eds., Computer Vision—ECCV 2020., Springer, 696-711. [Google Scholar] [CrossRef
[6] Ren, S., He, K., Girshick, R. and Sun, J. (2017) Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. IEEE Transactions on Pattern Analysis and Machine Intelligence, 39, 1137-1149. [Google Scholar] [CrossRef] [PubMed]
[7] Redmon, J., Divvala, S., Girshick, R. and Farhadi, A. (2016) You Only Look Once: Unified, Real-Time Object Detection. 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, 27-30 June 2016, 779-788. [Google Scholar] [CrossRef
[8] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł. and Polosukhin, I. (2017) Attention Is All You Need. Proceedings of the 31st International Conference on Neural Information Processing Systems (NIPS’17), Long Beach, 4-9 December 2017, 6000-6010.
[9] Dosovitskiy, A., et al. (2021) An Image Is Worth 16 × 16 Words: Transformers for Image Recognition at Scale. arXiv: 2010.11929.
[10] Zhao, Y., Lv, W., Xu, S., Wei, J., Wang, G., Dang, Q., et al. (2024) DETRs Beat YOLOs on Real-Time Object Detection. 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, 16-22 June 2024, 16965-16974. [Google Scholar] [CrossRef
[11] Sajjan, S., Moore, M., Pan, M., Nagaraja, G., Lee, J., Zeng, A., et al. (2020) Clear Grasp: 3D Shape Estimation of Transparent Objects for Manipulation. 2020 IEEE International Conference on Robotics and Automation (ICRA), Paris, 31 May-31 August 2020, 3634-3642. [Google Scholar] [CrossRef
[12] Tang, Y., Chen, J., Yang, Z., Lin, Z., Li, Q. and Liu, W. (2021) DepthGrasp: Depth Completion of Transparent Objects Using Self-Attentive Adversarial Network with Spectral Residual for Grasping. 2021 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Prague, 27 September-1 October 2021, 5710-5716. [Google Scholar] [CrossRef
[13] Xie, E., Wang, W., Wang, W., Sun, P., Xu, H., Liang, D., et al. (2021) Segmenting Transparent Objects in the Wild with Transformer. Proceedings of the Thirtieth International Joint Conference on Artificial Intelligence, Montreal, 19-27 August 2021, 1194-1200. [Google Scholar] [CrossRef
[14] Tong, L., Song, K., Tian, H., Man, Y., Yan, Y. and Meng, Q. (2023) SG-Grasp: Semantic Segmentation Guided Robotic Grasp Oriented to Weakly Textured Objects Based on Visual Perception Sensors. IEEE Sensors Journal, 23, 28430-28441. [Google Scholar] [CrossRef
[15] Woo, S., Park, J., Lee, J. and Kweon, I.S. (2018) CBAM: Convolutional Block Attention Module. In: Ferrari, V., Hebert, M., Sminchisescu, C. and Weiss, Y., Eds., Computer VisionECCV 2018, Springer, 3-19. [Google Scholar] [CrossRef
[16] Hu, J., Shen, L. and Sun, G. (2018) Squeeze-and-Excitation Networks. 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, 18-23 June 2018, 7132-7141. [Google Scholar] [CrossRef
[17] Huang, Z., Liang, M., Qin, J., Zhong, S. and Lin, L. (2023) Understanding Self-Attention Mechanism via Dynamical System Perspective. 2023 IEEE/CVF International Conference on Computer Vision (ICCV), Paris, 1-6 October 2023, 1412-1422. [Google Scholar] [CrossRef
[18] Li, Z., Liu, F., Yang, W., Peng, S. and Zhou, J. (2022) A Survey of Convolutional Neural Networks: Analysis, Applications, and Prospects. IEEE Transactions on Neural Networks and Learning Systems, 33, 6999-7019. [Google Scholar] [CrossRef] [PubMed]
[19] 王慧云, 赵俊生, 王禹, 等. 面向无人边防的复杂环境遮挡小目标检测算法[J]. 电子测量技术, 2024, 47(21): 168-177.