基于多教师知识蒸馏的新闻文本分类方法
News Text Classification Method Based on Multi-Teacher Knowledge Distillation
DOI: 10.12677/CSA.2023.138150, PDF,    国家自然科学基金支持
作者: 杜潇鉴*, 吕卫东#, 孙钰华:兰州交通大学数理学院,甘肃 兰州
关键词: 知识蒸馏多教师文本分类模型压缩Knowledge Distillation Multiple Teachers Text Classification Model Compression
摘要: 从传统的文本分类到基于深度学习下的文本分类,再到BERT模型的提出,使得其以及其变种模型逐渐成为自然语言处理中的主流模型,但其需要占用和花费大量内存和计算机资源。根据师生网络结构分成同构和异构两种情况,并提出了不同的多教师蒸馏策略。在THUCNews数据集上做实验,发现即使有教师表现较差,也能使得学生模型分类效果分别提升3.26%和3.30%,且性能损失分别为0.79%和0.78%,说明接近教师模型的分类效果;同时参数量只是教师模型的2.05%和2.08%,实现了很好的模型压缩。
Abstract: From traditional text classification to text classification based on deep learning, With the proposal of BERT model, it and its variants gradually become the mainstream model in natural language processing, but it needs to occupy and spend a lot of memory and computer resources. According to the dissimilarity of teacher-student network structure, it is dividing the two cases into isomorphic and heterogeneous teacher-student network, and proposes two different multi-teacher distillation strategies. The experiment on the THUCNews dataset shows that even if there are teachers with poor performance, the classification effect of the student model can be improved by 3.26% and 3.30% respectively, and the performance loss is 0.79% and 0.78% respectively, indicating that the classification effect of the teacher model is close to that of the teacher model. At the same time, the number of participants is only 2.05% and 2.08% of the teacher model, which achieves a good model compression.
文章引用:杜潇鉴, 吕卫东, 孙钰华. 基于多教师知识蒸馏的新闻文本分类方法[J]. 计算机科学与应用, 2023, 13(8): 1515-1526. https://doi.org/10.12677/CSA.2023.138150

参考文献

[1] You, S., Xu, C., Xu, C. and Tao, D.C. (2017) Learning from Multiple Teacher Networks. Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Halifax, 13-17 August 2017, 1285-1294. [Google Scholar] [CrossRef
[2] Fukuda, T., Suzuki, M., Kurata, G., Thomas, S., Cui, J. and Ramabhadran, B. (2017) Efficient Knowledge Distillation from an Ensemble of Teachers. Interspeech 2017, 18th Annual Conference of the International Speech Communication Association, Stockholm, 20-24 August 2017, 3697-3701. [Google Scholar] [CrossRef
[3] Wu, M.-C., Chiu, C.-T. and Wu, K.-H. (2019) Multi-Teacher Knowledge Distillation for Compressed Video Action Recognition on Deep Neural Networks. ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Brighton, 12-17 May 2019, 2202-2206.
[4] Zhang, H., Chen, D. and Wang, C. (2022) Confidence-Aware Multi-Teacher Knowledge Distillation. ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Singapore, 22-27 May 2022, 4498-4502. [Google Scholar] [CrossRef
[5] Kim, Y. (2014) Convolu-tional Neural Networks for Sentence Classification. Proceedings of the 2014 Conference on Empirical Methods in Natu-ral Language Processing (EMNLP), Doha, 25-29 October 2014, 1746-1751. [Google Scholar] [CrossRef
[6] 杨丽, 吴雨茜, 王俊丽, 刘义理. 循环神经网络研究综述[J]. 计算机应用, 2018, 38(S2): 1-6+26.
[7] Hochreiter, S. and Schmidhuber, J. (1997) Long Short-Term Memory. Neural Com-putation, 9, 1735-1780. [Google Scholar] [CrossRef] [PubMed]
[8] Kenton, J. and Toutanova, L.K. (2019) BERT: Pre-Training of Deep Bidirectional Transformers for Language Understanding. Proceedings of NAACL-HLT, Minneapolis, 2-7 June 2019, 4171-4186.
[9] Bahdanau, D., Cho, K. and Bengio, Y. (2015) Neural Machine Translation by Jointly Learning to Align and Translate. The 3rd International Conference on Learning Representations, San Diego, 7-9 May 2015, 1-15.
[10] Chin, T.-W., Ding, R.Z., Zhang, C. and Marculescu, D. (2020) Towards Efficient Model Compression via Learned Global Ranking. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Seattle, 13-19 June 2020, 1518-1528. [Google Scholar] [CrossRef
[11] He, Y.H., Zhang, X.Y. and Sun, J. (2017) Channel Pruning for Accelerating Very Deep Neural Networks. 2017 IEEE International Conference on Computer Vision (ICCV), Venice, 22-29 October 2017, 1389-1397.
[12] Zhuang, Z.W., Tan, M.K., Zhuang, B.H., Liu, J., Guo, Y., Wu, Q.Y., Huang, J.Z. and Zhu, J.H. (2018) Discrimination-Aware Channel Pruning for Deep Neural Net-works. Annual Conference on Neural Information Processing Systems 2018, NeurIPS 2018, Montréal, 3-8 December 2018, 875-886.
[13] Wang, K., Liu, Z.J., Lin, Y.J., Lin, J. and Han, S. (2019) Haq: Hardware-Aware Automated Quan-tization with Mixed Precision. 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR Work-shops 2020, Seattle, 14-19 June 2020, 8612-8620. [Google Scholar] [CrossRef
[14] Wu, J.X., Leng, C., Wang, Y.H., Hu, Q.H. and Cheng, J. (2016) Quantized Convolutional Neural Networks for Mobile Devices. 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, 27-30 June 2016, 4820-4828.
[15] Xie, Z., Wen, Z.Q., Liu, J., Liu, Z.Q., Wu, X.X. and Tan, M.K. (2020) Deep Transferring Quantization. 16th European Conference on Computer Vision, Glasgow, 23-28 August 2020, 625-642. [Google Scholar] [CrossRef
[16] Pham, H., Guan, M.Y., Zoph, B., Le, Q.V. and Dean, J. (2018) Efficient Neural Architecture Search via Parameter Sharing. Proceedings International Conference on Machine Learning, Vol. 2, 4092-4101.
[17] Hinton, G., Vinyals, O. and Dean, J. (2015) Distilling the Knowledge in a Neural Network. Computerence, 14, 38-39.
[18] Romero, A., Ballas, N., et al. (2015) Fitnets: Hints for Thin Deep Nets.
[19] Yuan, L., Tay, F.E.H., Li, G.L., Wang, T. and Feng, J.S. (2020) Revisiting Knowledge Distillation via Label Smoothing Regulari-zation. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Seattle, 14-19 June 2020, 3903-3911. [Google Scholar] [CrossRef
[20] Ma, X.Y., Shen, Y.L., et al. (2020) Adversarial Self-Supervised Data-Free Distillation for Text Classification.
[21] 廖胜兰, 吉建民, 俞畅, 陈小平. 基于BERT模型与知识蒸馏的意图分类方法[J]. 计算机工程, 2021, 47(5): 73-79.
[22] Nityasya, M.N., Wibowo, H.A., Chevi, R., Prasojo, R.E. and Aji, A.F. (2022) Which Student Is Best? A Comprehensive Knowledge Distillation Exam for Task-Specific BERT Models.
[23] Du, S.C., You, S., Li, X.J., et al. (2020) Agree to Disagree: Adaptive Ensemble Knowledge Distillation in Gradient Space. 34th Conference on Neural Information Processing Systems (NeurIPS 2020), Vancouver, 6-12 December 2020, 12345-12355.
[24] Kwon, K., Na, H., Lee, H., et al. (2020) Adaptive Knowledge Distillation based on Entropy. ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Pro-cessing (ICASSP), Barcelona, 4-8 May 2020, 7409-7413. [Google Scholar] [CrossRef