融合脸部外观和多行为特征的学生专注度识别网络
A Novel Engagement Recognition Network by Fusion Facial Appearance and Multi-Behavioral Features
摘要: 在线学习环境中,专注度是衡量用户学习体验的重要指标。提高专注度识别的准确率可以帮助老师及时获得课程反馈,提升用户的学习体验。然而大多数现有的基于视频的专注度识别方法都只利用用户面部外观信息。除了面部外观信息之外,头部姿态和注视角度以及眨眼频率在内的细粒度行为线索也和学习专注度密切相关,但是,前人在专注度识别任务中没有很好地综合考虑以上特征。因此,本文提出一种新的专注度识别模型。该方法结合深度残差网络(ResNet)提取的脸部特征和基于OpenFace捕获的行为特征,这些特征输入到时序卷积网络(TCN)用于分析视频帧时空上的变化,以此识别出学习专注度。我们的模型在大型公开的专注度检测数据集DAiSEE上训练,在专注度四分类达到61.4%的准确率,实验结果表明,我们的方法超过DAiSEE上专注度识别的最先进方法。
Abstract: Engagement is an important measure of users’ learning experience in an online learning environment. Improving the accuracy of engagement recognition can help the instructors get timely feed-back on the courses, and enhance users’ learning experience. However, most existing video-based engagement recognition methods only use the user’s facial appearance information. In addition to facial appearance, fine-grained behavioral cues such as head pose, eye gaze and blink rate are also closely related to engagement. But most researchers don’t comprehensively consider these features. Therefore, in this paper, we propose a novel engagement recognition model: our proposed method combines facial features extracted by Deep Residual Network (ResNet) and behavioral features captured by OpenFace. These features are fed into temporal convolutional network (TCN) to analyze the temporal changes in video frames to detect the level of engagement. Our model trained on a large publicly available student’s engagement detection dataset, DAiSEE. We achieved 61.4% in top-1 accuracy in the problem of four classifications for engagement. The results show that our method outperforms state-of-the-art methods.
文章引用:陆玉波, 战荫伟, 杨卓, 李学聪. 融合脸部外观和多行为特征的学生专注度识别网络[J]. 计算机科学与应用, 2022, 12(4): 1163-1174. https://doi.org/10.12677/CSA.2022.124119

参考文献

[1] Nezami, O.M., Dras, M., Hamey, L., Richards, D., Wan, S. and Paris, C. (2018) Automatic Recognition of Student En-gagement Using Deep Learning and Facial Expression. Joint European Conference on Machine Learning and Knowledge Discovery in Databases, Würzburg, 16-20 September 2019, 273-289. [Google Scholar] [CrossRef
[2] Dhall, A., et al. (2020) Emotiw 2020: Driver Gaze, Group Emotion, Student Engagement and Physiological Signal Based Challenges. Proceedings of the 2020 International Con-ference on Multimodal Interaction, Nicolaïkerk, 25-29 October 2020, 784-789. [Google Scholar] [CrossRef
[3] Guhan, P., et al. (2020) ABC-Net: Semi-Supervised Multimodal GAN-Based Engagement Detection Using an Affective, Behavioral and Cognitive Model.
[4] Belle, A., Hargraves, R.H. and Najarian, K. (2012) An Automated Optimal Engagement and Attention Detection System Using Electrocardio-gram. Computational and Mathematical Methods in Medicine, 2012, Article ID: 528781. [Google Scholar] [CrossRef] [PubMed]
[5] Doherty, K. and Doherty, G. (2018) Engagement in HCI: Conception, Theory and Measurement. ACM Computing Surveys, 51, 1-39. [Google Scholar] [CrossRef
[6] Dewan, M.A.A., Murshed, M. and Lin, F. (2019) Engagement Detection in Online Learning: A Review. Smart Learning Envi-ronments, 6, 1-20. [Google Scholar] [CrossRef
[7] Gupta, A., et al. (2016) DAiSEE: Towards User Engagement Recognition in the Wild. Journal of Latex Class Files, 14, 1-12.
[8] Chen, X., et al. (2019) FaceEn-gage: Robust Estimation of Gameplay Engagement from User-Contributed (YouTube) Videos. IEEE Transactions on Affective Computing, 1. [Google Scholar] [CrossRef
[9] Zhao, G.Y. and Pietikainen, M. (2007) Dynamic Texture Recognition Using Local Binary Patterns with an Application to Facial Expressions. IEEE Transactions on Pattern Analysis and Machine Intelligence, 29, 915-928. [Google Scholar] [CrossRef
[10] Whitehill, J., et al. (2014) The Faces of Engagement: Automatic Recognition of Student Engagement from Facial Expressions. IEEE Transactions on Affective Computing, 5, 86-98. [Google Scholar] [CrossRef
[11] Baltrusaitis, T., et al. (2018) Openface 2.0: Facial Behavior Analysis Toolkit. 2018 13th IEEE International Conference on Automatic Face & Gesture Recognition (FG 2018), Xi’an, 15-19 May 2018, 59-66. [Google Scholar] [CrossRef
[12] Cao, Q., et al. (2018) Vggface2: A Dataset for Recognising Faces across Pose and Age. 2018 13th IEEE International Conference on Automatic Face & Gesture Recognition (FG 2018), Xi’an, 15-19 May 2018, 67-74. [Google Scholar] [CrossRef
[13] Hu, J., Shen, L. and Sun, G. (2018) Squeeze-and-Excitation Networks. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, 18-23 June 2018, 7132-7141. [Google Scholar] [CrossRef
[14] Rivas, J.J., et al. (2021) Multi-Label and Multimodal Classifier for Affective States Recognition in Virtual Rehabilitation. IEEE Transactions on Affective Computing, 1. [Google Scholar] [CrossRef
[15] Monkaresi, H., et al. (2016) Automated Detection of Engage-ment Using Video-Based Estimation of Facial Expressions and Heart Rate. IEEE Transactions on Affective Computing, 8, 15-28. [Google Scholar] [CrossRef
[16] Cao, Z., et al. (2019) OpenPose: Realtime Multi-Person 2D Pose Estimation Using Part Affinity Fields. IEEE Transactions on Pattern Analysis and Machine Intelligence, 43, 172-186. [Google Scholar] [CrossRef
[17] Yang, J.F., et al. (2018) Deep Recurrent Mul-ti-Instance Learning with Spatio-Temporal Features for Engagement Intensity Prediction. Proceedings of the 20th ACM International Conference on Multimodal Interaction, Boulder, 16-20 October 2018, 594-598. [Google Scholar] [CrossRef
[18] Niu, X.S., et al. (2018) Automatic Engagement Prediction with GAP Feature. Proceedings of the 20th ACM International Conference on Multimodal Interaction, Boulder, 16-20 Octo-ber 2018, 599-603. [Google Scholar] [CrossRef
[19] Huang, T., et al. (2019) Fine-Grained Engagement Recognition in Online Learning Environment. 2019 IEEE 9th International Conference on Electronics Information and Emergency Communication (ICEIEC), Beijing, 12-14 July 2019, 338-341. [Google Scholar] [CrossRef
[20] Szegedy, C., et al. (2015) Going Deeper with Convolutions. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, 7-12 June 2015, 1-9. [Google Scholar] [CrossRef
[21] Tran, D., et al. (2015) Learning Spatiotemporal Features with 3d Convolutional Networks. Proceedings of the IEEE International Conference on Computer Vision, Santiago, 7-13 De-cember 2015, 4489-4497. [Google Scholar] [CrossRef
[22] Donahue, J., et al. (2015) Long-Term Recurrent Convolutional Net-works for Visual Recognition and Description. IEEE Transactions on Pattern Analysis and Machine Intelligence, 39, 677-691. [Google Scholar] [CrossRef
[23] Zhang, H., et al. (2019) A Novel End-to-End Network for Au-tomatic Student Engagement Recognition. 2019 IEEE 9th International Conference on Electronics Information and Emergency Communication (ICEIEC), Beijing, 12-14 July 2019, 342-345. [Google Scholar] [CrossRef
[24] Liao, J.C., Liang, Y. and Pan, J.H. (2021) Deep Facial Spatio-temporal Network for Engagement Prediction in Online Learning. Applied Intelligence, 51, 1-13. [Google Scholar] [CrossRef
[25] Zhu, X.X. and Ramanan, D. (2012) Face Detection, Pose Esti-mation, and Landmark Localization in the Wild. 2012 IEEE Conference on Computer Vision and Pattern Recognition, Providence, 16-21 June 2012, 2879-2886.
[26] Viola, P. and Jones, M. (2001) Rapid Object Detection Using a Boosted Cascade of Simple Features. Proceedings of the 2001 IEEE Computer Society Conference on Computer Vision and Pat-tern Recognition, Vol. 1, 1.
[27] Xie, S.Y., Hu, H.F. and Wu, Y.B. (2019) Deep Multi-Path Convolutional Neural Net-work Joint with Salient Region Attention for Facial Expression Recognition. Pattern Recognition, 92, 177-191. [Google Scholar] [CrossRef
[28] Trigeorgis, G., et al. (2016) Adieu Features? End-to-End Speech Emotion Recognition Using a Deep Convolutional Recurrent Network. 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Shanghai, 20-25 March 2016, 5200-5204. [Google Scholar] [CrossRef
[29] Ranti, C., et al. (2020) Blink Rate Patterns Provide a Reliable Measure of Individual Engagement with Scene Content. Scientific Reports, 10, Article No. 8267. [Google Scholar] [CrossRef] [PubMed]
[30] Openface Output Format.
https://github.com/TadasBaltrusaitis/OpenFace/wiki/Output-Format
[31] Wu, S.W., et al. (2019) Continuous Emo-tion Recognition in Videos by Fusing Facial Expression, Head Pose and Eye Gaze. 2019 International Conference on Multimodal Interaction, Suzhou, 14-18 October 2019, 40-48. [Google Scholar] [CrossRef
[32] Bai, S.J., Kolter, J.Z. and Koltun, V. (2018) An Empirical Evalua-tion of Generic Convolutional and Recurrent Networks for Sequence Modeling.
[33] Lea, C., et al. (2017) Temporal Convolutional Networks for Action Segmentation and Detection. Proceedings of the IEEE Conference on Computer Vi-sion and Pattern Recognition, Honolulu, 21-26 July 2017, 1003-1012. [Google Scholar] [CrossRef
[34] Chao, Y.-W., et al. (2018) Rethinking the Faster R-CNN Architec-ture for Temporal Action Localization. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City,18-23 June 2018, 1130-1139. [Google Scholar] [CrossRef
[35] Khorram, S., et al. (2017) Capturing Long-Term Temporal De-pendencies with Convolutional Networks for Continuous Emotion Recognition. INTERSPEECH 2017: Conference of the International Speech Communication Association, Stockholm, 20-24 August 2017, 1253-1257. [Google Scholar] [CrossRef
[36] He, K.M., et al. (2016) Deep Residual Learning for Image Recognition. Proceedings of the IEEE Conference on Computer vision and Pattern Recognition, 27-30 June 2016, Las Vegas, 770-778. [Google Scholar] [CrossRef
[37] Kapoor, A., Burleson, W. and Picard, R.W. (2007) Automatic Prediction of Frustration. International Journal of Human-Computer Studies, 65, 724-736. [Google Scholar] [CrossRef
[38] Langton, S., Watt, R.J. and Bruce, V. (2000) Do the Eyes Have It? Cues to the Direction of Social Attention. Trends in Cognitive Sciences, 4, 50-59. [Google Scholar] [CrossRef
[39] Dong, L.G., et al. (2009) Visual Focus of Attention Recog-nition in the Ambient Kitchen. In: Asian Conference on Computer Vision, Springer, Berlin, 548-559.