[1]
|
Bottou, L. (2010) Large-Scale Machine Learning with Stochastic Gradient Descent. In: Lechevallier, Y. and Saporta, G., Eds., Proceedings of COMPSTAT’2010, Physica-Verlag HD, 177-186. https://doi.org/10.1007/978-3-7908-2604-3_16
|
[2]
|
Sebastian, R. (2016) An Overview of Gradient Descent Optimization Algorithms. Computer Science Machine Learning, 1-12.
|
[3]
|
Lecun, Y. and Bottou, L. (1998) Gradient-Based Learning Applied to Document Recognition. Proceedings of the IEEE, 86, 2278-2324. https://doi.org/10.1109/5.726791
|
[4]
|
Bubeck, S. (2015) Convex Optimization: Algorithms and Complexity. Foundations and Trends in Machine Learning, 8, 231-357. https://doi.org/10.1561/2200000050
|
[5]
|
Bottou, L., Curtis, F.E. and Nocedal, J. (2018) Optimization Methods for Large-Scale Machine Learning. SIAM Review, 60, 223-311. https://doi.org/10.1137/16M1080173
|
[6]
|
Nesterov, Y. (2013) Gradient Methods for Minimizing Composite Functions. Mathematical Programming, 140, 125-161. https://doi.org/10.1007/s10107-012-0629-5
|
[7]
|
Robbins, H. and Monro, S. (1951) A Stochastic Approximation Method. Annals of Mathemat- ical Statistics, 22, 400-407. https://doi.org/10.1214/aoms/1177729586
|
[8]
|
Li, M., Zhang, T., Chen, Y., et al. (2014) Efficient Mini-Batch Training for Stochastic Op- timization. Proceedings of the 20th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, New York, 24-27 August 2014, 661-670.
|
[9]
|
Qian, N. (1999) On the Momentum Term in Gradient Descent Learning Algorithms. Neural Networks, 12, 145-151. https://doi.org/10.1016/S0893-6080(98)00116-6
|
[10]
|
Nesterov, Y. (1983) A Method of Solving a Convex Programming Problem with Convergence Rate O(1/k2). Soviet Mathematics Doklady, 27, 372-376.
|
[11]
|
Botev, A., Lever, G. and Barber, D. (2017) Nesterov’s Accelerated Gradient and Momentum as Approximations to Regularised Update Descent. 2017 International Joint Conference on Neural Networks, Anchorage, AK, 14-19 May 2017, 1899-1903.
|
[12]
|
Duchi, J., Hazan, E. and Singer, Y. (2011) Adaptive Subgradient Methods for Online Learning and Stochastic Optimization. Journal of Machine Learning Research, 12, 2121-2159.
|
[13]
|
Tieleman, T. and Hinton, G. (2012) Lecture 6.5-RMSProp: Divide the Gradient by a Running Average of Its Recent Magnitude. COURSERA: Neural Networks for Machine Learning, 4, 26-30.
|
[14]
|
Kingma, D. and Ba, J. (2015) Adam: A Method for Stochastic Optimization. Proceedings of the 3rd International Conference on Learning Representations, San Diego, 7-9 May 2015, 1-13.
|
[15]
|
Ding, J., Ren, X., Luo, R., et al. (2019) An Adaptive and Momental Bound Method for Stochastic Learning. arXiv: 1910.12249
|
[16]
|
Schmidt, M., Le Roux, N. and Bach, F. (2017) Minimizing Finite Sums with the Stochastic Average Gradient. Mathematical Programming, 162, 83-112.
|
[17]
|
Shalev, S. and Zhang, T. (2013) Stochastic Dual Coordinate Ascent Methods for Regularized Loss Minimization. Journal of Machine Learning Research, 14, 567-599. https://doi.org/10.1007/s10107-016-1030-6
|
[18]
|
Johnson, R. and Zhang, T. (2013) Accelerating Stochastic Gradient Descent Using Predictive Variance Reduction. Proceedings of the 27th Neural Information Processing Systems, Lake Tahoe, 5-10 December 2013, 315-323.
|
[19]
|
Babanezhad, R., Ahmed, M.O., Virani, A., et al. (2015) Stop Wasting My Gradients: Practical SVRG. Proceedings of the 28th International Conference on Neural Information Processing Systems, 2, 2251-2259.
|
[20]
|
Qiu, X.P. (2020) Neural Networks and Deep Learning. China Machine Press, Beijing, 160-169.
|