基于知识蒸馏的实时动作预测方法研究
Action Prediction Research Based on Knowledge Distillation
摘要: 动作预测是一类特殊的动作识别问题,不同于针对完整动作的传统动作识别,动作预测旨在动作尚未完成时尽可能早地识别动作所属的类别,以便对该动作可能造成的影响进行分析,从而实现事故预警、智能陪护、犯罪预警等目标。本文针对实时动作预测问题提出一种应用知识蒸馏技术的多阶段LSTM实时动作预测方法。本文中的动作预测模型为两阶段的LSTM模型,在第一阶段利用全局特征对动作进行分析,第二阶段利用全局特征与动作特征对动作进行分析。为提高动作预测模型的性能,本文利用知识蒸馏技术并设计新型的损失函数提高动作预测模型的性能。UT-Interaction数据集、JHMDB-21数据集以及UCF-101数据集的实验结果表明本文所提出的动作预测方法不但具有良好的动作预测能力,而且能够满足实际应用中的实时性要求。
Abstract: Action recognition is a hot topic in the domain of computer vision, and it’s widely applied in human-computer interaction, studio entertainment, automatic drive, intelligent video surveillance, and intelligent medical care. Action prediction is a special class of action recognition. Different from conventional action recognition which aims at recognizing complete actions, the purpose of action prediction is to distinguish an action before it’s fully executed so that some objectives, such as accident early warning and crime prevention, can be achieved by analyzing the possible impact of the action. In order to solve the problem of real-time action prediction, this paper develops a multi-stage LSTM architecture that leverages knowledge distillation technique. The context-aware fea-ture and action-aware feature are exploited for action modeling. The proposed multi-stage LSTM architecture is composed of two stages. In the first stage it focuses on the global, context-aware information. The second stage then combines these context-aware features with action-aware ones. In order to improve the performance of proposed method in the early stage, the knowledge distillation technique is exploited for transferring the knowledge from teacher model to student model. A novel loss function is designed for the whole action prediction architecture and the performance is improved with the novel loss function. Experimental results on the UT-Interaction dataset, JHMDB-21 dataset and the UCF-101 dataset show that the proposed methods not only improve the accuracy of action prediction but also have the ability of real-time running.
文章引用:王祥. 基于知识蒸馏的实时动作预测方法研究[J]. 计算机科学与应用, 2020, 10(5): 927-934. https://doi.org/10.12677/CSA.2020.105095

参考文献

[1] Ma, S., Sigal, L. and Sclaroff, S. (2016) Learning Activity Progression in LSTMs for Activity Detection and Early De-tection. IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, 27-30 June 2016, 1942-1950. [Google Scholar] [CrossRef
[2] Aliakbarian, M.S., Saleh, F.S., Salzmann, M., Fernando, B., Peters-son, L. and Andersson, L. (2017) Encouraging LSTMs to Anticipate Actions Very Early. Proc. IEEE International Con-ference on Computer Vision, Venice, 22-29 October 2017, 280-289. [Google Scholar] [CrossRef
[3] Hinton, G., Vinyals, O. and Jeff, D. (2015) Distilling the Knowledge in a Neural Network. NIPS 2014 Deep Learning Workshop, Montreal, 8-13 December 2014, 1546-1552.
[4] Adriana, R., Gatta, C. and Bengio, Y. (2015) FitNets: Hints for Thin Deep Nets. 3rd International Conference on Learning Rep-resentations, San Diego, 7-9 May 2015, 1-13.
[5] Yim, J. (2017) A Gift from Knowledge Distillation: Fast Optimiza-tion, Network Minimization and Transfer Learning. IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, 21-26 July 2017, 4133-4141.
[6] Li, Y. and Wang, N. (2017) Demystifying Neural Style Transfer. Pro-ceedings of the Twenty-Sixth International Joint Conference on Artificial Intelligence, Melbourne, 19-25 August 2017, 2230-2236.
[7] Gretton, A., et al. (2012) A Kernel Two-Sample Test. Journal of Machine Learning Research, 13, 723-773.
[8] Kong, Y., Tao, Z. and Fu, Y. (2017) Deep Sequential Context Networks for Action Prediction. IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, 27-30 June 2016, 3662-3670. [Google Scholar] [CrossRef
[9] Kong, Y., Kit, D. and Fu, Y. (2014) A Discriminative Model with Multiple Temporal Scales for Action Prediction. 13th European Conference, Zurich, 6-12 September 2014, 596-611. [Google Scholar] [CrossRef
[10] Kong, Y., Gao, S., Sun, B. and Fu, Y. (2018) Action Predic-tion from Videos via Memorizing Hard-to-Predict Samples. Proc. 32nd AAAI Conference on Artificial Intelligence, New Orleans, 2-7 February 2018, 7000-7007.
[11] Wang, X., Hu, J., Lai, J., Zhang, J. and Zheng, W. (2019) Progressive Teacher-Student Learning for Early Action Prediction. IEEE Conference on Computer Vision and Pattern Recognition, Long Beach, 16-21 June 2019, 3556-3565.
[12] Hochreiter, S. and Schmidhuber, J. (1997) Long Short-Term Memory. Neural Computation, 9, 1-32. [Google Scholar] [CrossRef] [PubMed]
[13] Graves, A. and Schmidhuber, J. (2005) Framewise Phoneme Classi-fication with Bidirectional LSTM Networks. IEEE International Joint Conference on Neural Networks, Montreal, 31 Ju-ly-4 August 2005, 846-853.
[14] Simonyan, K. and Zisserman, A. (2014) Very Deep Convolutional Networks for Large-Scale Image Recognition. 3rd International Conference on Learning Representations, San Diego, 7-9 May 2015, 1-14.
[15] Cao, Z., Simon, T., Wei, S.E. and Sheikh, Y. (2017) Real-Time Multi-Person 2D Pose Estimation Using Part Affinity Fields. 30th IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, 21-26 July 2016, 1302-1310. [Google Scholar] [CrossRef
[16] Ryoo, M.S. (2011) Human Activity Prediction: Early Recognition of Ongoing Activities from Streaming Videos. 13th International Conference on Computer Vision, Barcelona, 6-13 No-vember 2011, 1036-1043. [Google Scholar] [CrossRef
[17] Soomro, K., Idrees, H. and Shah, M. (2016) Predicting the Where and What of Actors and Actions through Online Action Localization. The IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, 27-30 June 2016, 2648-2657. [Google Scholar] [CrossRef
[18] Cao, Y., Barrett, D., Barbu, A., Narayanaswamy, S., Yu, H., Michaux, A., Lin, Y., Dickinson, S., Siskind, J.M. and Wang, S. (2013) Recognize Human Activities from Partially Ob-served Videos. IEEE Computer Society Conference on Computer Vision and Pattern Recognition, Vol. 1, 2658-2665. [Google Scholar] [CrossRef