基于多模态融合驱动的智慧课堂感知系统实验方法设计
Design of Experimental Method for Smart Classroom Perception System Driven by Multimodal Fusion
摘要: 针对人工智能师范专业实验教学中单一模态训练局限、技术与教育场景脱节,以及学生课堂整体感知能力培养不足的问题,本文设计并实现了基于语音–视觉双模态融合的智慧课堂感知系统及对应实验。实验以“技术融合–场景适配–能力验证”为核心,整合Whisper、YOLO26、改进型ViT及大语言模型,构建全流程实验体系,实现教学阶段划分、学生个体识别与追踪、动作识别及多模态数据融合,输出课堂感知综合报告。结合前文实验流程与验证,结果表明,该方案规范可行,突破传统实验局限,可帮助师范生巩固核心技术、提升工程实践能力,培养课堂感知与场景适配能力,为智能教育复合型师资培养提供实践载体,也为相关专业实验教学体系优化提供参考。
Abstract: Aiming to address the problems of single-modal training limitations, the disconnect between technology and educational scenarios, and the insufficient cultivation of students’ overall classroom perception ability in the experimental teaching of AI-oriented teacher education programs, this paper designs and implements a smart classroom perception system based on audio-visual bimodal fusion, along with corresponding experiments. Centered on the framework of “technology integration-scenario adaptation-capability validation”, the experiments integrate Whisper, YOLO26, an improved Vision Transformer (ViT), and a large language model to build a full-process experimental system, achieving teaching phase classification, individual student identification and tracking, action recognition, and multimodal data fusion, and outputting a comprehensive classroom perception report. In conjunction with the experimental procedures and validations described earlier, the results show that the proposed scheme is standardized and feasible, breaking through the limitations of traditional experiments. It helps pre-service teachers consolidate core technologies, enhance engineering practice skills, and cultivate classroom perception and scenario adaptation abilities. This provides a practical platform for cultivating compound talents in intelligent education and offers a reference for optimizing experimental teaching systems in related disciplines.
文章引用:李禹衡, 丁晓铭, 张重, 韩亮. 基于多模态融合驱动的智慧课堂感知系统实验方法设计[J]. 教育进展, 2026, 16(4): 1224-1234. https://doi.org/10.12677/ae.2026.164772

参考文献

[1] Zhou, J., Ran, F., Li, G., Peng, J., Li, K. and Wang, Z. (2022) Classroom Learning Status Assessment Based on Deep Learning. Mathematical Problems in Engineering, 2022, Article ID: 7049458. [Google Scholar] [CrossRef
[2] Jia, L., Sun, H., Jiang, J. and Yang, X. (2025) High-Quality Classroom Dialogue Automatic Analysis System. Applied Sciences, 15, Article 1613. [Google Scholar] [CrossRef
[3] Liu, Q., Jiang, X. and Jiang, R. (2025) Classroom Behavior Recognition Using Computer Vision: A Systematic Review. Sensors, 25, Article 373. [Google Scholar] [CrossRef] [PubMed]
[4] 张乐乐, 顾小清. 多模态数据支持的课堂教学行为分析模型与实践框架[J]. 开放教育研究, 2022, 28(6): 101-110.
[5] Radford, A., Kim, J.W., Xu, T., Brockman, G., McLeavey, C. and Sutskever, I. (2023) Robust Speech Recognition via Large-Scale Weak Supervision. Proceedings of the 40th International Conference on Machine Learning (ICML’23), 202, 28492-28518.
[6] Chu, Z., Wang, S., Xie, J., et al. (2025) LLM Agents for Education: Advances and Applications. arXiv: 2503.11733.
[7] Jocher, G., Qiu, J. and Chaurasia, A. (2023) Ultralytics YOLO (Version 8.0.0) [Computer Software].
https://github.com/ultralytics/ultralytics
[8] Sapkota, R. and Karkee, M. (2025) Ultralytics YOLO Evolution: An Overview of YOLO26, YOLO11, YOLOv8 and YOLOv5 Object Detectors for Computer Vision and Pattern Recognition. arXiv: 2510.09653.
[9] Wojke, N., Bewley, A. and Paulus, D. (2017) Simple Online and Realtime Tracking with a Deep Association Metric. 2017 IEEE International Conference on Image Processing (ICIP), Beijing, 17-20 September 2017, 3645-3649. [Google Scholar] [CrossRef
[10] Arioka, K. and Sawada, Y. (2023) Improved Kalman Filter and Matching Strategy for Multi-Object Tracking System. 2023 62nd Annual Conference of the Society of Instrument and Control Engineers (SICE), Tsu, 6-9 September 2023, 772-777. [Google Scholar] [CrossRef
[11] Tan, Z., Gao, C., Qin, A., Chen, R., Song, T., Yang, F., et al. (2025) Towards Student Actions in Classroom Scenes: New Dataset and Baseline. IEEE Transactions on Multimedia, 27, 6831-6844. [Google Scholar] [CrossRef
[12] Contributors, M. (2020) OpenMMLab’s Next Generation Video Understanding Toolbox and Benchmark.
https://github.com/open-mmlab/mmaction2
[13] Gu, C., Sun, C., Ross, D.A., Vondrick, C., Pantofaru, C., Li, Y., et al. (2018) AVA: A Video Dataset of Spatio-Temporally Localized Atomic Visual Actions. 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, 18-23 June 2018, 6047-6056. [Google Scholar] [CrossRef