YOLO11-Swin:一种面向复杂水下环境的目标检测模型
YOLO11-Swin: A Target Detection Model for Complex Underwater Environments
摘要: 水下目标检测在海洋资源开发与生态环境监测中至关重要,但水下图像的低对比度、色彩失真及复杂背景干扰为精准检测带来巨大挑战。为克服传统方法在特征提取与小目标识别上的局限,本文提出一种深度融合Swin Transformer与YOLO11架构的新型检测模型(A Novel Detection Model with Deep Integration of Swin Transformer and YOLO11 Architectures, YOLO11-Swin)。该模型以Swin Transformer作为主干特征提取网络,利用其分层设计与滑动窗口自注意力机制,有效捕获图像的全局上下文依赖关系,增强对模糊、遮挡目标的表征能力。在特征融合阶段,本文设计了一种跨层特征聚合机制(Cross-layer Feature Aggregation, CFA),通过全局池化与自适应权重计算,引导不同尺度特征图进行高效信息交互,以解决特征金字塔中的语义间隙与尺度不匹配问题。此外,在各级特征图输出端嵌入卷积注意力模块(Convolutional Block Attention Module, CBAM),通过串行的通道与空间注意力子模块,自适应地优化特征响应,突出目标区域并抑制背景噪声。针对水下数据集正负样本不均衡的问题,模型采用Focal Loss作为分类损失函数,以聚焦困难样本的训练,提升模型收敛速度与稳健性。在URPC数据集上的实验结果表明,YOLO11-Swin的mAP@50达到75.54%,相比基线YOLO11模型显著提升9.42%。特别地,对小目标(如扇贝)的检测平均精度(AP)提升10.16%,召回率(Recall)提高4.55%,充分验证了所提模型在复杂水下环境下的有效性与先进性。
Abstract: Underwater object detection plays a crucial role in marine resource development and ecological environment monitoring. However, the low contrast, color distortion, and complex background interference of underwater images pose significant challenges to accurate detection. To overcome the limitations of traditional methods in feature extraction and small object recognition, this paper proposes a novel detection model with deep integration of Swin Transformer and YOLO11 architectures (referred to as YOLO11-Swin). This model adopts Swin Transformer as the backbone feature extraction network. Leveraging its hierarchical design and sliding window self-attention mechanism, it effectively captures the global contextual dependencies of images and enhances the representation capability for blurred and occluded objects. In the feature fusion stage, a Cross-layer Feature Aggregation (CFA) mechanism is designed. Through global pooling and adaptive weight calculation, it guides efficient information interaction among feature maps of different scales, thereby addressing the issues of semantic gaps and scale mismatches in the feature pyramid. Additionally, Convolutional Block Attention Module (CBAM) is embedded at the output end of feature maps at all levels. Via serial channel and spatial attention sub-modules, it adaptively optimizes feature responses, highlights object regions, and suppresses background noise. To tackle the problem of imbalanced positive and negative samples in underwater datasets, the model employs Focal Loss as the classification loss function. This focuses on the training of hard samples, improving the model’s convergence speed and robustness. Experimental results on the URPC dataset demonstrate that the mAP@50 of YOLO11-Swin reaches 75.54%, which is a significant increase of 9.42% compared to the baseline YOLO11 model. Specifically, the Average Precision (AP) for small objects (e.g., scallops) is improved by 10.16%, and the Recall is increased by 4.55%. These results fully verify the effectiveness and advancement of the proposed model in complex underwater environments.
文章引用:郑广海, 张倩, 张薇. YOLO11-Swin:一种面向复杂水下环境的目标检测模型[J]. 计算机科学与应用, 2026, 16(1): 374-387. https://doi.org/10.12677/csa.2026.161031

参考文献

[1] Wang, W., Sun, Y.F., Gao, W., Xu, W., Zhang, Y. and Huang, D. (2024) Quantitative Detection Algorithm for Deep-Sea Megabenthic Organisms Based on Improved YOLOv5. Frontiers in Marine Science, 11, Article 1301024. [Google Scholar] [CrossRef
[2] Ge, Z., Liu, S., Wang, F., et al. (2021) YOLOX: Exceeding YOLO Series in 2021. arXiv: 2107.08430.
[3] Li, B., Li, X., Li, S., Zhang, Y., Liu, K., Ma, J., et al. (2024) Cross-Layer Feature Guided Multiscale Infrared Small Target Detection. IEEE Geoscience and Remote Sensing Letters, 21, 1-5. [Google Scholar] [CrossRef
[4] Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., et al. (2021) Swin Transformer: Hierarchical Vision Transformer Using Shifted Windows. 2021 IEEE/CVF International Conference on Computer Vision (ICCV), Montreal, 10-17 October 2021, 9992-10002. [Google Scholar] [CrossRef
[5] Monterroso Muñoz, A., Moron-Fernández, M., Cascado-Caballero, D., Diaz-del-Rio, F. and Real, P. (2023) Autonomous Underwater Vehicles: Identifying Critical Issues and Future Perspectives in Image Acquisition. Sensors, 23, Article 4986. [Google Scholar] [CrossRef] [PubMed]
[6] Ferrari, V. (2018) Computer Vision—ECCV 2018: 15th European Conference, Munich, Germany, September 8-14, 2018, Proceedings, Part VII, in Lecture Notes in Computer Science, Vol. 11211. Springer International Publishing AG.
[7] Yang, C., Zhang, C., Jiang, L. and Zhang, X. (2024) Underwater Image Object Detection Based on Multi-Scale Feature Fusion. Machine Vision and Applications, 35, Article No. 124. [Google Scholar] [CrossRef
[8] Shen, X., Wang, H., Cui, T., Guo, Z. and Fu, X. (2023) Multiple Information Perception-Based Attention in YOLO for Underwater Object Detection. The Visual Computer, 40, 1415-1438. [Google Scholar] [CrossRef
[9] Hu, X., Liu, Y., Zhao, Z., Liu, J., Yang, X., Sun, C., et al. (2021) Real-Time Detection of Uneaten Feed Pellets in Underwater Images for Aquaculture Using an Improved YOLO-V4 Network. Computers and Electronics in Agriculture, 185, Article ID: 106135. [Google Scholar] [CrossRef
[10] Li, X., Zhao, Y., Su, H., Wang, Y. and Chen, G. (2025) Efficient Underwater Object Detection Based on Feature Enhancement and Attention Detection Head. Scientific Reports, 15, Article No. 5973. [Google Scholar] [CrossRef] [PubMed]
[11] Jia, J., Fu, M., Liu, X. and Zheng, B. (2022) Underwater Object Detection Based on Improved EfficientDet. Remote Sensing, 14, Article 4487. [Google Scholar] [CrossRef
[12] Chen, L., Yang, Y., Wang, Z., Zhang, J., Zhou, S. and Wu, L. (2023) Underwater Target Detection Lightweight Algorithm Based on Multi-Scale Feature Fusion. Journal of Marine Science and Engineering, 11, Article 320. [Google Scholar] [CrossRef
[13] Lyu, Z., Peng, A., Wang, Q. and Ding, D. (2022) An Efficient Learning-Based Method for Underwater Image Enhancement. Displays, 74, Article ID: 102174. [Google Scholar] [CrossRef
[14] Liu, K. (2023) Underwater Object Detection Using TC-YOLO with Attention Mechanisms. Sensors, 23, Article No. 2567. [Google Scholar] [CrossRef] [PubMed]
[15] Zhang, M., Xu, S., Song, W., He, Q. and Wei, Q. (2021) Lightweight Underwater Object Detection Based on YOLO V4 and Multi-Scale Attentional Feature Fusion. Remote Sensing, 13, Article 4706. [Google Scholar] [CrossRef
[16] Cheng, L., Zhou, H., Le, X., Chen, W., Tao, H., Ding, J., et al. (2024) An Improved Underwater Object Detection Algorithm Based on YOLOv5 for Blurry Images. 2024 12th International Conference on Intelligent Computing and Wireless Optical Communications (ICWOC), Chongqing, 21-23 June 2024, 42-47. [Google Scholar] [CrossRef
[17] Ahlawat, V. (2022) An Efficient Algorithm for Collision Avoidance between a Solar Array Satellite and Space Debris. International Journal of Research in Science and Technology, 12, 14-24. [Google Scholar] [CrossRef
[18] Li, X., Lv, C., Wang, W., Li, G., Yang, L. and Yang, J. (2022) Generalized Focal Loss: Towards Efficient Representation Learning for Dense Object Detection. IEEE Transactions on Pattern Analysis and Machine Intelligence, 45, 3139-3153. [Google Scholar] [CrossRef] [PubMed]
[19] Yang, Q., Meng, H., Gao, Y. and Gao, D. (2023) A Real-Time Object Detection Method for Underwater Complex Environments Based on FasterNet-YOLOv7. Journal of Real-Time Image Processing, 21, Article No. 8. [Google Scholar] [CrossRef
[20] Guo, A., Sun, K. and Zhang, Z. (2024) A Lightweight YOLOv8 Integrating FasterNet for Real-Time Underwater Object Detection. Journal of Real-Time Image Processing, 21, Article No. 49. [Google Scholar] [CrossRef