一种带矩限制的批量自适应方差缩减算法
An Adaptive and Momental Bound Method for Stochastic Variance Reduced Gradient
DOI: 10.12677/AAM.2022.1112952, PDF, HTML, 下载: 168  浏览: 242 
作者: 朱桂勇:浙江师范大学行知学院,浙江 金华;雷家林:浙江师范大学,浙江 金华
关键词: 机器学习自适应学习率矩限制小批量随机方差缩减梯度法Machine Learing Adaptive Learning Rate Momental Bound Mini-Batch SVRG
摘要: 诸多机器学习模型的训练过程可以归结为求解一个优化问题,传统的梯度下降算法及其变种的收敛速度并不能让人满意。本文在随机方差缩减梯度算法的基础上,结合了自适应学习率和矩限制的特点,并利用批量思想,选取大批量样本代替全部样本进行全局梯度的计算,选取小批量样本 进行参数的迭代更新,提出了带矩限制的批量自适应方差缩减算法,并对算法的收敛性进行了说明。通过基于MNIST的数值实验,验证了该算法的有效性,并探究得出实验参数对算法稳定性的影响。
Abstract: The training process of many machine learning models can be reduced to solving an optimization problem, and the convergence speed of the traditional gradient descent algorithm and its variants is not satisfactory. In this paper, based on the stochas- tic variance reduction gradient algorithm, we combine the characteristics of adaptive learning rate and moment limitation, and use the batch idea to select large batch samples instead of all samples for the calculation of global gradient, and select small batch samples for the iterative update of parameters, then propose the batch adaptive variance reduction algorithm with moment limitation, and illustrate the convergence of the algorithm. The effectiveness of the algorithm is verified through MNIST-based numerical experiments, and the influence of the experimental parameters on the sta- bility of the algorithm is explored.
文章引用:朱桂勇, 雷家林. 一种带矩限制的批量自适应方差缩减算法[J]. 应用数学进展, 2022, 11(12): 9026-9038. https://doi.org/10.12677/AAM.2022.1112952

参考文献

[1] Bottou, L. (2010) Large-Scale Machine Learning with Stochastic Gradient Descent. In: Lechevallier, Y. and Saporta, G., Eds., Proceedings of COMPSTAT’2010, Physica-Verlag HD, 177-186.
https://doi.org/10.1007/978-3-7908-2604-3_16
[2] Sebastian, R. (2016) An Overview of Gradient Descent Optimization Algorithms. Computer Science Machine Learning, 1-12.
[3] Lecun, Y. and Bottou, L. (1998) Gradient-Based Learning Applied to Document Recognition. Proceedings of the IEEE, 86, 2278-2324.
https://doi.org/10.1109/5.726791
[4] Bubeck, S. (2015) Convex Optimization: Algorithms and Complexity. Foundations and Trends in Machine Learning, 8, 231-357.
https://doi.org/10.1561/2200000050
[5] Bottou, L., Curtis, F.E. and Nocedal, J. (2018) Optimization Methods for Large-Scale Machine Learning. SIAM Review, 60, 223-311.
https://doi.org/10.1137/16M1080173
[6] Nesterov, Y. (2013) Gradient Methods for Minimizing Composite Functions. Mathematical Programming, 140, 125-161.
https://doi.org/10.1007/s10107-012-0629-5
[7] Robbins, H. and Monro, S. (1951) A Stochastic Approximation Method. Annals of Mathemat- ical Statistics, 22, 400-407.
https://doi.org/10.1214/aoms/1177729586
[8] Li, M., Zhang, T., Chen, Y., et al. (2014) Efficient Mini-Batch Training for Stochastic Op- timization. Proceedings of the 20th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, New York, 24-27 August 2014, 661-670.
[9] Qian, N. (1999) On the Momentum Term in Gradient Descent Learning Algorithms. Neural Networks, 12, 145-151.
https://doi.org/10.1016/S0893-6080(98)00116-6
[10] Nesterov, Y. (1983) A Method of Solving a Convex Programming Problem with Convergence Rate O(1/k2). Soviet Mathematics Doklady, 27, 372-376.
[11] Botev, A., Lever, G. and Barber, D. (2017) Nesterov’s Accelerated Gradient and Momentum as Approximations to Regularised Update Descent. 2017 International Joint Conference on Neural Networks, Anchorage, AK, 14-19 May 2017, 1899-1903.
[12] Duchi, J., Hazan, E. and Singer, Y. (2011) Adaptive Subgradient Methods for Online Learning and Stochastic Optimization. Journal of Machine Learning Research, 12, 2121-2159.
[13] Tieleman, T. and Hinton, G. (2012) Lecture 6.5-RMSProp: Divide the Gradient by a Running Average of Its Recent Magnitude. COURSERA: Neural Networks for Machine Learning, 4, 26-30.
[14] Kingma, D. and Ba, J. (2015) Adam: A Method for Stochastic Optimization. Proceedings of the 3rd International Conference on Learning Representations, San Diego, 7-9 May 2015, 1-13.
[15] Ding, J., Ren, X., Luo, R., et al. (2019) An Adaptive and Momental Bound Method for Stochastic Learning. arXiv: 1910.12249
[16] Schmidt, M., Le Roux, N. and Bach, F. (2017) Minimizing Finite Sums with the Stochastic Average Gradient. Mathematical Programming, 162, 83-112.
[17] Shalev, S. and Zhang, T. (2013) Stochastic Dual Coordinate Ascent Methods for Regularized Loss Minimization. Journal of Machine Learning Research, 14, 567-599.
https://doi.org/10.1007/s10107-016-1030-6
[18] Johnson, R. and Zhang, T. (2013) Accelerating Stochastic Gradient Descent Using Predictive Variance Reduction. Proceedings of the 27th Neural Information Processing Systems, Lake Tahoe, 5-10 December 2013, 315-323.
[19] Babanezhad, R., Ahmed, M.O., Virani, A., et al. (2015) Stop Wasting My Gradients: Practical SVRG. Proceedings of the 28th International Conference on Neural Information Processing Systems, 2, 2251-2259.
[20] Qiu, X.P. (2020) Neural Networks and Deep Learning. China Machine Press, Beijing, 160-169.