知识蒸馏中损失函数的研究进展综述
A Review on Research Advances of Loss Functions in Knowledge Distillation
摘要: 知识蒸馏作为一种高效的模型压缩与知识迁移技术,其性能的核心决定因素之一是损失函数的设计。损失函数定义了学生模型模仿教师模型时所遵循的优化目标与知识迁移的维度。本文系统综述了知识蒸馏领域损失函数的研究进展。首先,介绍了基于输出响应的经典损失函数,如KL散度与均方误差。其次,梳理了基于中间层特征匹配的损失函数,包括注意力转移与Hint Learning等方法。接着,总结了基于关系与结构化知识匹配的前沿损失函数,如相似性保持与相关性一致性损失。最后,对知识蒸馏损失函数的研究趋势进行了展望,指出自适应损失组合、面向特定任务的定制化损失以及理论分析是未来的重要方向。本文旨在为研究者,特别是工程应用者,在选择与设计知识蒸馏损失函数时提供一个清晰的参考。
Abstract: Knowledge distillation, as an efficient technique for model compression and knowledge transfer, relies critically on the design of its loss functions, which define the optimization objectives and the dimensions of knowledge transfer for the student model to mimic the teacher. This paper provides a systematic survey of the research progsress on loss functions in knowledge distillation. Firstly, it introduces classical loss functions based on output responses, such as Kullback-Leibler divergence and mean squared error. Secondly, it reviews loss functions based on intermediate feature matching, including attention transfer and hint learning. Subsequently, it summarizes advanced loss functions based on relational and structured knowledge matching, such as similarity-preserving and correlation congruence losses. Finally, future research trends are discussed, pointing out that adaptive loss combination, task-specific customization, and theoretical analysis are important directions. This paper aims to provide a clear reference for researchers, especially practitioners, in selecting and designing loss functions for knowledge distillation.
文章引用:赵彤彤. 知识蒸馏中损失函数的研究进展综述[J]. 计算机科学与应用, 2026, 16(2): 251-260. https://doi.org/10.12677/csa.2026.162056

参考文献

[1] Guo, D., Yang, D., Zhang, H., Song, J., Wang, P., Zhu, Q., et al. (2025) DeepSeek-R1 Incentivizes Reasoning in LLMs through Reinforcement Learning. Nature, 645, 633-638. [Google Scholar] [CrossRef
[2] Dosovitskiy, A., Beyer, L., Kolesnikov, A., et al. (2021) An Image Is Worth 16 × 16 Words: Transformers for Image Recognition at Scale. 9th International Conference on Learning Representations, ICLR 2021, 3-7 May 2021, 611-631.
https://openreview.net/forum?id=YicbFdNTTy
[3] Buciluǎ, C., Caruana, R. and Niculescu-Mizil, A. (2006) Model Compression. Proceedings of the 12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Philadelphia, 20-23 August 2006, 535-541. [Google Scholar] [CrossRef
[4] Hinton, G., Vinyals, O. and Dean, J. (2015) Distilling the Knowledge in a Neural Network.
[5] Romero, A., Ballas, N., Kahou, S.E., et al. (2015) FitNets: Hints for Thin Deep Nets.
https://arxiv.org/abs/1412.6550
[6] Zagoruyko, S. and Komodakis, N. (2017) Paying More Attention to Attention: Improving the Performance of Convolutional Neural Networks via Attention Transfer. 5th International Conference on Learning Representations, ICLR 2017, Toulon, 24-26 April 2017, 1489-1501.
https://openreview.net/forum?id=Sks9_ajex
[7] Tung, F. and Mori, G. (2019) Similarity-Preserving Knowledge Distillation. 2019 IEEE/CVF International Conference on Computer Vision (ICCV), Seoul, 27-28 October 2019, 1365-1374. [Google Scholar] [CrossRef
[8] Hu, J., Shen, L. and Sun, G. (2018) Squeeze-and-Excitation Networks. 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, 18-22 June 2018, 7132-7141. [Google Scholar] [CrossRef
[9] Zhou, Z., Zhuge, C., Guan, X., et al. (2020) Channel Distillation: Channel-Wise Attention for Knowledge Distillation.
https://arxiv.org/abs/2006.01683
[10] Yim, J., Joo, D., Bae, J. and Kim, J. (2017) A Gift from Knowledge Distillation: Fast Optimization, Network Minimization and Transfer Learning. 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, 21-26 July 2017, 7130-7138. [Google Scholar] [CrossRef
[11] Peng, B., Jin, X., Li, D., Zhou, S., Wu, Y., Liu, J., et al. (2019) Correlation Congruence for Knowledge Distillation. 2019 IEEE/CVF International Conference on Computer Vision (ICCV), Seoul, 27-28 October 2019, 5006-5015. [Google Scholar] [CrossRef
[12] Wang, K., Vicol, P., Lucas, J., et al. (2018) Adversarial Distillation of Bayesian Neural Network Posteriors.
https://arxiv.org/abs/1806.10317
[13] Tian, Y., Krishnan, D. and Isola, P. (2020) Contrastive Representation Distillation. ICLR.
https://openreview.net/forum?id=SkgpBJrtvS
[14] Zhao, B., Cui, Q., Song, R., Qiu, Y. and Liang, J. (2022) Decoupled Knowledge Distillation. 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, 18-24 June 2022, 11943-11952. [Google Scholar] [CrossRef
[15] Sun, S., Ren, W., Li, J., Wang, R. and Cao, X. (2024) Logit Standardization in Knowledge Distillation. 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, 16-22 June 2024, 15731-15740. [Google Scholar] [CrossRef
[16] Touvron, H., Cord, M., Douze, M., et al. (2020) Training Data-Efficient Image Transformers & Distillation through Attention.
https://arxiv.org/abs/2012.12877
[17] Yang, Z., Li, Z., Zeng, A., Li, Z., Yuan, C. and Li, Y. (2024) ViTKD: Feature-Based Knowledge Distillation for Vision Transformers. 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), Seattle, 16-22 June 2024, 1379-1388. [Google Scholar] [CrossRef
[18] Agarwal, R., Vieillard, N., Zhou, Y., et al. (2024) On-Policy Distillation of Language Models: Learning from Self-Generated Mistakes. The Twelfth International Conference on Learning Representations, ICLR 2024, Vienna, 7-11 May 2024, 9249-9266.
https://openreview.net/forum?id=3zKtaqxLhW
[19] Wang, W., Wei, F., Dong, L., et al. (2020) MiniLM: Deep Self-Attention Distillation for Task-Agnostic Compression of Pre-Trained Transformers.
https://arxiv.org/abs/2002.10957
[20] Chi, Z., Zheng, T., Li, H., et al. (2023) NormKD: Normalized Logits for Knowledge Distillation.
https://arxiv.org/abs/2308.00520
[21] Zhang, W., Liu, D., Cai, W. and Ma, C. (2024) Cross-View Consistency Regularisation for Knowledge Distillation. Proceedings of the 32nd ACM International Conference on Multimedia, MM 2024, Melbourne, 28 October-1 November 2024, 2011-2020. [Google Scholar] [CrossRef