Ada_Nesterov动量法——一种具有自适应学习率的Nesterov动量法
Ada_Nesterov Momentum Algorithm—The Nesterov Momentum Algorithm with Adaptive Learning Rate
摘要: Nesterov动量法可以很好地改进梯度下降方向,但是其所有参数都具有相同的学习率,并且学习率需要人为设定。Adadelta算法可以自适应学习率,并且每维参数具有独立的学习率。因此,本文首先基于Adadelta算法推导出每一维的学习率公式,其次将其带入Nesterov动量法中,得到了Ada_Nesterov动量法。为了验证提出的Ada_Nesterov动量法,本文设计了两个实验。实验结果表明:动量参数0.5时,Ada_Nesterov动量法在VggNet_16神经网络架构上,基于CIFAR_100数据集的验证准确率最高,损失最小,收敛速度最快。即Ada_Nesterov动量法改进了Nesterov动量法,具有自适应学习率。
Abstract: The Nesterov momentum algorithm can efficiently improve the gradient descent direction. However, all parameters of the Nesterov momentum algorithm share the same learning rate, and the learning rate needs to be set by workers. Adadelta algorithm has the characteristic of adaptive learning rate, and each dimension parameter has independent learning rate. Therefore, we firstly deduced the learning rate formula of each dimension, which was based on Adadelta algorithm. Secondly we introduced it into the Nesterov momentum algorithm. And finally, Ada_Nesterov momentum algorithm was proposed. To verify Ada_Nesterov momentum algorithm, two experiments were designed, which indicated that with momentum parameter 0.5, and VggNet_16, the CIFAR_100 dataset, Ada_Nesterov momentum algorithm achieved the highest test and evaluation accuracy, the lowest valuation loss, and fastest rate of convergence. Therefore, Ada_Nesterov momentum algorithm improved Nesterov momentum algorithm with adaptive learning rate.
文章引用:贾熹滨, 史佳帅. Ada_Nesterov动量法——一种具有自适应学习率的Nesterov动量法[J]. 计算机科学与应用, 2019, 9(2): 351-358. https://doi.org/10.12677/CSA.2019.92040

参考文献

[1] Cauchy, A. (1847) Méthode générale pour la résolution des systemes d’équations simultanées. Comp. Rend. Sci. Paris, 25, 536-538.
[2] Bottou, L. (1998) Online Learning and Stochastic Approximations. On-Line Learning in Neural Net-works, 17, 142.
[3] Sutton, R.S. (1986) Two Problems with Backpropagation and Other Steepest-Descent Learning Procedures for Networks. In: Proceedings of the 8th Annual Conference of the Cognitive Science Society, Erlbaum, Hillsdale, NJ, 823-831.
[4] Levenberg, K. (1944) A Method for the Solution of Certain Non-Linear Problems in Least Squares. Quarterly of Applied Mathematics, 2, 164-168. [Google Scholar] [CrossRef
[5] Marquardt, D.W. (1963) An Algorithm for Least-Squares Estimation of Nonlinear Parameters. Journal of the Society for Industrial and Applied Mathematics, 11, 431-441. [Google Scholar] [CrossRef
[6] Møller, M.F. (1993) Efficient Training of Feed-Forward Neural Networks. DAIMI Report Series, 22, No. 464.
[7] Le Roux, N., Bengio, Y. and Fitzgibbon, A. (2011) 15. Improving First and Second-Order Methods by Modeling Uncertainty. In: Sra, S., Nowozin, S. and Wright, S.J., Eds., Optimization for Machine Learning, The MIT Press, Cambridge, MA, 403.
[8] LeCun, Y., Boser, B., Denker, J.S., et al. (1989) Backpropagation Applied to Handwritten Zip Code Recognition. Neural Computation, 1, 541-551. [Google Scholar] [CrossRef
[9] LeCun, Y.A., Bottou, L., Orr, G.B., et al. (2012) Efficient Backprop. In: Neural Networks: Tricks of the Trade, Springer Berlin Heidelberg, 9-48.
[10] Nesterov, Y. (1983) A Method of Solving a Convex Programming Problem with Convergence Rate O (1/k2). Soviet Mathematics Doklady, 27, 372-376.
[11] Nesterov, Y. (2013) Introductory Lectures on Convex Optimization: A Basic Course. Springer Science & Business Media.
[12] Sutskever, I., Martens, J., Dahl, G., et al. (2013) On the Importance of Initialization and Momen-tum in Deep Learning. Proceedings of the 30th International Conference on Machine Learning, 28, 1139-1147.
[13] Jacobs, R.A. (1988) Increased Rates of Convergence through Learning Rate Adaptation. Neural Net-works, 1, 295-307. [Google Scholar] [CrossRef
[14] Duchi, J., Hazan, E. and Singer, Y. (2011) Adaptive Subgra-dient Methods for Online Learning and Stochastic Optimization. Journal of Machine Learning Research, 12, 2121-2159.
[15] Dean, J., Corrado, G., Monga, R., et al. (2012) Large Scale Distributed Deep Networks. Proceedings of the 25th International Conference on Neural Information Processing Systems, 1, 1223-1231.
[16] Pennington, J., Socher, R. and Manning, C.D. (2014) Glove: Global Vectors for Word Representation. EMNLP, 14, 1532-1543.
[17] Zeiler, M.D. (2012) ADADELTA: An Adaptive Learning Rate Method. arXiv preprint arXiv:1212.5701.
[18] 张慧. 深度学习中优化算法的研究与改进[D]: [硕士学位论文]. 北京: 北京邮电大学, 2018.
[19] Glorot, X., Bordes, A. and Bengio, Y. (2011) Deep Sparse Rectifier Neural Networks. Proceedings of the 14th International Conference on Artificial Intelligence and Statistics, PMLR 15, 315-323.