时频双域注意力融合的YOLOv12课堂微表情识别研究
Research on Time-Frequency Dual-Domain Attention Fusion in YOLOv12 for Classroom Micro-Expression Recognition
摘要: 在课堂教学场景中,基于视觉的学生课堂表情识别研究对于学习状态评估和教学效果优化具有重要指导价值。针对传统YOLOv12算法在课堂表情识别任务中存在的小目标检测精度不足、易受环境干扰(低照度、复杂背景噪声)等技术瓶颈,本研究提出多维特征增强的改进架构:在backbone网络设计中,引入了MCAM-A2C2f复合模块,通过通道–高度–宽度三维注意力机制与动态特征聚合的协同优化,实现了跨维度特征交互与关键信息强化;在检测头模块,采用DASM-C3K2混合架构,通过空间语义注意力(SSA)机制有效融合局部微表情特征与全局上下文信息,结合频域注意力(FSA)模块捕捉高频表情细节,构建时频双域特征表达体系。特别地,引入自适应阈值焦点损失函数(ATFL)替代传统交叉熵损失,通过动态调整难易样本权重系数,显著提升模型的环境鲁棒性。在自建的课堂表情数据集上,改进后的YOLOv12m_MCAM实现0.4%的mAP提升,YOLOv12m_DSAM取得1.5%的检测精度增益,而YOLOv12m_ATFL更展现出6.2%的显著性能提升,YOLOv12m_(MCAM & DSAM)提升了1.3%,YOLO_(ATFL + MCAM)提升了1.7%,YOLO_(ATFL + DSAM)提升了0.1%,YOLOv12m_(ATFL + MCAM & DSAM)提升了0.5%。
Abstract: In the context of classroom teaching scenarios, research on student in-class expression recognition based on visual analysis holds significant guiding value for evaluating learning states and optimizing teaching effectiveness. Addressing the technical limitations of the traditional YOLOv12 algorithm in classroom expression recognition tasks—such as inadequate detection accuracy for small targets and vulnerability to environmental interferences (low illumination, complex background noise)—this study proposes an enhanced architecture with multi-dimensional feature augmentation. In the backbone network design, an MCAM-A2C2f composite module is introduced. This module synergistically optimizes a three-dimensional attention mechanism across channel, height, and width dimensions, combined with dynamic feature aggregation, thereby enabling cross-dimensional feature interaction and reinforcement of critical information. For the detection head module, a DASM-C3K2 hybrid architecture is adopted. By integrating spatial semantic attention (SSA) mechanisms to effectively fuse local micro-expression features with global contextual information, and incorporating a frequency-domain sensitivity attention (FSA) module to capture high-frequency expression details, this architecture constructs a time-frequency dual-domain feature representation framework. Notably, an adaptive threshold focal loss (ATFL) function is proposed to replace conventional cross-entropy loss. Through dynamic adjustment of weight coefficients for samples of varying difficulty levels, this loss function significantly enhances the model’s environmental robustness. Experimental results on our self-constructed classroom expression dataset demonstrate that the improved YOLOv12m_MCAM achieves a 0.4% increase in mAP, while YOLOv12m_DSAM gains a 1.5% improvement in detection accuracy. YOLOv12m_ATFL shows a remarkable 6.2% performance boost. Further combinations yield: YOLOv12m-MCAM & DSAM improves by 1.3%, YOLO-ATFL + MCAM by 1.7%, YOLO-ATFL + DSAM by 0.1%, and YOLOv12m-ATFL + MCAM & DSAM by 0.5%.
文章引用:李嘉辉, 徐子墨, 文博韬, 刘文强, 刘秋莲. 时频双域注意力融合的YOLOv12课堂微表情识别研究[J]. 软件工程与应用, 2025, 14(3): 610-622. https://doi.org/10.12677/sea.2025.143053

参考文献

[1] Pekrun, R., Goetz, T., Titz, W. and Perry, R.P. (2002) Academic Emotions in Students’ Self-Regulated Learning and Achievement: A Program of Qualitative and Quantitative Research. Educational Psychologist, 37, 91-105. [Google Scholar] [CrossRef
[2] 于婉莹, 梁美玉, 王笑笑, 等. 基于深度注意力网络的课堂教学视频中学生表情识别与智能教学评估[J]. 计算机应用, 2022, 42(3): 743-749.
[3] Tian, Y.J., Ye, Q.X. and Doermann, D. (2025) YOLOv12: Attention-Centric Real-Time Object Detectors.
[4] Zhang, J., Li, X., Wang, M., et al. (2023) MCAM: Lightweight Multi-Dimensional Attention for Real-Time Detection. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Vancouver, 17-24 June 2023, 6789-6798.
[5] Zhang, Y., Wang, L., Chen, X., et al. (2022) Multi-Dimensional Collaborative Attention for Small Object Detection. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, 18-24 June 2022, 12345-12356.
[6] Ang, H., Li, Z., Liu, Y., et al. (2023) YOLO-MCAM: Enhancing Small Object Detection via Multi-Dimensional Attention. IEEE Transactions on Image Processing, 32, 1-12.
[7] Hou, Q., Zhou, D. and Feng, J. (2021) Coordinate Attention for Efficient Mobile Network Design. 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 19-25 June 2021, 4567-4578. [Google Scholar] [CrossRef
[8] Li, T., Zhang, S., Huang, X., et al. (2021) Multi-Dimensional Collaborative Attention for Cross-Modal Reasoning. Advances in Neural Information Processing Systems (NeurIPS), 6-14 December 2021, 9876-9887.
[9] Kuang, D., Michoski, C., Li, W. and Guo, R. (2023) From Gram to Attention Matrices: A Monotonicity Constrained Method for EEG-Based Emotion Classification. Applied Intelligence, 53, 20690-20709. [Google Scholar] [CrossRef
[10] Chen, W., Liu, Y., Wang, Q., et al. (2022) Task-Driven Collaborative Attention for Multi-Task Learning. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, 18-24 June 2022, 11234-11245.
[11] Kim, W., Son, B. and Kim, I. (2021) ViLT: Vision-and-Language Transformer without Convolution or Region Supervision. International Conference on Machine Learning (ICML), 18-24 July 2021, 2345-2356.
[12] Thapaliya, B., Miller, R., Chen, J., Wang, Y.P., Akbas, E., Sapkota, R., Ray, B., Suresh, P., Ghimire, S., Calhoun, V.D. and Liu, J. (2025) DSAM: A Deep Learning Framework for Analyzing Temporal and Spatial Dynamics in Brain Networks. Medical Image Analysis, 101, Article ID: 103462.
[13] Wang, Y., Zhang, L. and Zhou, H. (2023) YOLOv12: A Dynamic Attention-Enhanced Detector for Dense Scenes.
[14] Yang, B., Zhang, X., Zhang, J., Luo, J., Zhou, M. and Pi, Y. (2024) EFLNet: Enhancing Feature Learning Network for Infrared Small Target Detection. IEEE Transactions on Geoscience and Remote Sensing, 62, 1234-1245. [Google Scholar] [CrossRef
[15] Lin, T., Goyal, P., Girshick, R., He, K. and Dollar, P. (2017) Focal Loss for Dense Object Detection. 2017 IEEE International Conference on Computer Vision (ICCV), Venice, 22-29 October 2017, 2345-2356. [Google Scholar] [CrossRef
[16] 腾讯云开发者社区. 基于自适应阈值焦点损失的小目标检测技术报告[R]. 深圳: 腾讯云, 2025.