通过GRU和多头注意力机制增强学习型优化器的泛化能力
Enhancing the Generalization Ability of Learning-Based Optimizers through GRU and Multi-Head Attention Mechanisms
摘要: 近年来,利用机器学习(尤其是深度学习技术)解决数学问题的关注度持续上升。学习优化作为一种借助深度学习求解优化问题的方法,已吸引了越来越多的关注。在当前研究中,仅使用LSTM模型仍然是主流的选择,尽管LSTM模型可以更有效捕捉历史信息,但是其对信息之间的交互是不够充分的。因此我们选择了对其隐藏层的结果加上多头注意力机制,以强化信息之间的交融,并且将LSTM换为轻量化的GRU模型,故模型的参数量甚至是减少了的。实验结果表明,该算法不仅收敛速度更快,还展现出较强的泛化能力。
Abstract: In recent years, there has been a growing interest in using machine learning, particularly deep learning techniques, to address mathematical problems. Learning to Optimize, a method that leverages deep learning to solve optimization problems, has attracted increasing attention. In current research, the exclusive use of LSTM models remains the predominant choice. While LSTM models can effectively capture historical information, their ability to handle interaction between information is insufficient. Therefore, we propose adding a multi-head attention mechanism to the outputs of the hidden layer to enhance the fusion of information. We also replace the LSTM with a lightweight GRU model, resulting in an even reduction in the number of model parameters. Experimental results demonstrate that the algorithm not only achieves faster convergence but also exhibits strong generalization capabilities.
文章引用:刘翔. 通过GRU和多头注意力机制增强学习型优化器的泛化能力[J]. 计算机科学与应用, 2026, 16(1): 205-214. https://doi.org/10.12677/csa.2026.161017

参考文献

[1] Chen, T., Chen, X., Chen, W., et al. (2022) Learning to Optimize: A Primer and a Benchmark. Journal of Machine Learning Research, 23, 1-59.
[2] Andrychowicz, M., Denil, M., Gomez, S., et al. (2016) Learning to Learn by Gradi-ent Descent by Gradient Descent. Proceedings of the 30th International Conference on Neural Information Processing Systems, Barcelona, 5-10 December 2016, 3988-3996.
[3] Gregor, K. and LeCun, Y. (2010) Learning Fast Approxi-mations of Sparse Coding. Proceedings of the 27th International Conference on International Conference on Machine Learning, Haifa, 21-24 June 2010, 399-406.
[4] Liu, J., Chen, X., Wang, Z., et al. (2023) Towards Constituting Mathematical Structures for Learning to Optimize. International Conference on Machine Learning. PMLR, Honolulu, 23-29 July 2023, 21426-21449.
[5] Bengio, Y., Simard, P. and Frasconi, P. (1994) Learning Long-Term Dependencies with Gradient Descent Is Difficult. IEEE Transactions on Neural Networks, 5, 157-166. [Google Scholar] [CrossRef
[6] Hochreiter, S. and Schmidhuber, J. (1997) Long Short-Term Memory. Neural Computation, 9, 1735-1780. [Google Scholar] [CrossRef
[7] Cho, K., van Merrienboer, B., Gulcehre, C., Bahdanau, D., Bougares, F., Schwenk, H., et al. (2014) Learning Phrase Representations Using RNN Encoder-Decoder for Statistical Machine Translation. Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), Doha, October 2014, 1724-1734. [Google Scholar] [CrossRef
[8] Chung, J., Gulcehre, C., Cho, K.H., et al. (2014) Empirical Evaluation of Gated Recurrent Neural Networks on Sequence Modeling.
[9] Greff, K., Srivastava, R.K., Koutnik, J., Steunebrink, B.R. and Schmidhuber, J. (2017) LSTM: A Search Space Odyssey. IEEE Transactions on Neural Networks and Learn-ing Systems, 28, 2222-2232. [Google Scholar] [CrossRef
[10] Zhang, S., Yao, L., Sun, A., et al. (2019) Deep Learning Based Recommender System: A Survey and New Perspectives. ACM Computing Surveys (CSUR), 52, 1-38.
[11] Vaswani, A., et al. (2017) Attention Is All You Need. Proceedings of the 31st International Conference on Neural Information Pro-cessing Systems, Long Beach, 4-9 December 2017, 6000-6010.
[12] Boyd, S., Parikh, N., Chu, E., et al. (2010) Distrib-uted Optimization and Statistical Learning via the Alternating Direction Method of Multipliers. Foundations and Trends® in Machine Learning, 3, 1-122. [Google Scholar] [CrossRef
[13] Nesterov, Y. (2013) Introductory Lec-tures on Convex Optimization: A Basic Course. Springer Science & Business Media.
[14] Devlin, J., Chang, M.W., Lee, K., et al. (2019) Bert: Pre-Training of Deep Bidirectional Transformers for Language Understanding. Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1, 4171-4186.
[15] Martin, D., Fowlkes, C., Tal, D. and Malik, J. (2001) A Database of Human Segmented Natural Images and Its Application to Evaluating Segmentation Algorithms and Measuring Ecological Statis-tics. Proceedings Eighth IEEE International Conference on Computer Vision. ICCV 2001, Vancouver, 7-14 July 2001, 416-423. [Google Scholar] [CrossRef
[16] Asuncion, A. and Newman, D. (2007) UCI Machine Learning Re-pository.
[17] Kingma, D.P. (2014) Adam: A Method for Stochastic Optimization.
[18] Beck, A. and Teboulle, M. (2009) A Fast Iterative Shrinkage-Thresholding Algorithm for Linear Inverse Problems. SIAM Journal on Imaging Sci-ences, 2, 183-202. [Google Scholar] [CrossRef
[19] Lv, K., Jiang, S. and Li, J. (2017) Learning Gradient Descent: Better Generalization and Longer Horizons. International Conference on Machine Learning. PMLR, Sydney, 6-11 August 2017, 2247-2255.