|
[1]
|
Guo, D., Yang, D., Zhang, H., Song, J., Wang, P., Zhu, Q., et al. (2025) DeepSeek-R1 Incentivizes Reasoning in LLMs through Reinforcement Learning. Nature, 645, 633-638. [Google Scholar] [CrossRef]
|
|
[2]
|
Dosovitskiy, A., Beyer, L., Kolesnikov, A., et al. (2021) An Image Is Worth 16 × 16 Words: Transformers for Image Recognition at Scale. 9th International Conference on Learning Representations, ICLR 2021, 3-7 May 2021, 611-631. https://openreview.net/forum?id=YicbFdNTTy
|
|
[3]
|
Buciluǎ, C., Caruana, R. and Niculescu-Mizil, A. (2006) Model Compression. Proceedings of the 12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Philadelphia, 20-23 August 2006, 535-541. [Google Scholar] [CrossRef]
|
|
[4]
|
Hinton, G., Vinyals, O. and Dean, J. (2015) Distilling the Knowledge in a Neural Network.
|
|
[5]
|
Romero, A., Ballas, N., Kahou, S.E., et al. (2015) FitNets: Hints for Thin Deep Nets. https://arxiv.org/abs/1412.6550
|
|
[6]
|
Zagoruyko, S. and Komodakis, N. (2017) Paying More Attention to Attention: Improving the Performance of Convolutional Neural Networks via Attention Transfer. 5th International Conference on Learning Representations, ICLR 2017, Toulon, 24-26 April 2017, 1489-1501. https://openreview.net/forum?id=Sks9_ajex
|
|
[7]
|
Tung, F. and Mori, G. (2019) Similarity-Preserving Knowledge Distillation. 2019 IEEE/CVF International Conference on Computer Vision (ICCV), Seoul, 27-28 October 2019, 1365-1374. [Google Scholar] [CrossRef]
|
|
[8]
|
Hu, J., Shen, L. and Sun, G. (2018) Squeeze-and-Excitation Networks. 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, 18-22 June 2018, 7132-7141. [Google Scholar] [CrossRef]
|
|
[9]
|
Zhou, Z., Zhuge, C., Guan, X., et al. (2020) Channel Distillation: Channel-Wise Attention for Knowledge Distillation. https://arxiv.org/abs/2006.01683
|
|
[10]
|
Yim, J., Joo, D., Bae, J. and Kim, J. (2017) A Gift from Knowledge Distillation: Fast Optimization, Network Minimization and Transfer Learning. 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, 21-26 July 2017, 7130-7138. [Google Scholar] [CrossRef]
|
|
[11]
|
Peng, B., Jin, X., Li, D., Zhou, S., Wu, Y., Liu, J., et al. (2019) Correlation Congruence for Knowledge Distillation. 2019 IEEE/CVF International Conference on Computer Vision (ICCV), Seoul, 27-28 October 2019, 5006-5015. [Google Scholar] [CrossRef]
|
|
[12]
|
Wang, K., Vicol, P., Lucas, J., et al. (2018) Adversarial Distillation of Bayesian Neural Network Posteriors. https://arxiv.org/abs/1806.10317
|
|
[13]
|
Tian, Y., Krishnan, D. and Isola, P. (2020) Contrastive Representation Distillation. ICLR. https://openreview.net/forum?id=SkgpBJrtvS
|
|
[14]
|
Zhao, B., Cui, Q., Song, R., Qiu, Y. and Liang, J. (2022) Decoupled Knowledge Distillation. 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, 18-24 June 2022, 11943-11952. [Google Scholar] [CrossRef]
|
|
[15]
|
Sun, S., Ren, W., Li, J., Wang, R. and Cao, X. (2024) Logit Standardization in Knowledge Distillation. 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, 16-22 June 2024, 15731-15740. [Google Scholar] [CrossRef]
|
|
[16]
|
Touvron, H., Cord, M., Douze, M., et al. (2020) Training Data-Efficient Image Transformers & Distillation through Attention. https://arxiv.org/abs/2012.12877
|
|
[17]
|
Yang, Z., Li, Z., Zeng, A., Li, Z., Yuan, C. and Li, Y. (2024) ViTKD: Feature-Based Knowledge Distillation for Vision Transformers. 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), Seattle, 16-22 June 2024, 1379-1388. [Google Scholar] [CrossRef]
|
|
[18]
|
Agarwal, R., Vieillard, N., Zhou, Y., et al. (2024) On-Policy Distillation of Language Models: Learning from Self-Generated Mistakes. The Twelfth International Conference on Learning Representations, ICLR 2024, Vienna, 7-11 May 2024, 9249-9266. https://openreview.net/forum?id=3zKtaqxLhW
|
|
[19]
|
Wang, W., Wei, F., Dong, L., et al. (2020) MiniLM: Deep Self-Attention Distillation for Task-Agnostic Compression of Pre-Trained Transformers. https://arxiv.org/abs/2002.10957
|
|
[20]
|
Chi, Z., Zheng, T., Li, H., et al. (2023) NormKD: Normalized Logits for Knowledge Distillation. https://arxiv.org/abs/2308.00520
|
|
[21]
|
Zhang, W., Liu, D., Cai, W. and Ma, C. (2024) Cross-View Consistency Regularisation for Knowledge Distillation. Proceedings of the 32nd ACM International Conference on Multimedia, MM 2024, Melbourne, 28 October-1 November 2024, 2011-2020. [Google Scholar] [CrossRef]
|