深度学习优化器方法及学习率衰减方式综述
An Overview of Deep Learning Optimization Methods and Learning Rate Attenuation Methods
DOI: 10.12677/HJDM.2018.84020, PDF,  被引量    国家自然科学基金支持
作者: 冯宇旭*, 李裕梅:北京工商大学理学院,北京
关键词: 深度学习优化器梯度下降AdagradRMSPropAdadeltaAdam学习率衰减Deep Learning The Optimizer Gradient Descent Adagrad RMSProp Adadelta Adam Learning Rate Attenuation
摘要: 深度学习作为现今机器学习领域中的重要的技术手段,在图像识别、机器翻译、自然语言处理等领域都已经很成熟,并获得了很好的成果。文中针对深度学习模型优化器的发展进行了梳理,介绍了常用的梯度下降、动量的梯度下降、Adagrad、RMSProp、Adadelta、Adam、Nadam、ANGD等优化方法,也对学习率的衰减方式有分段常数衰减、多项式衰减、指数衰减、自然指数衰减、余弦衰减、线性余弦衰减、噪声线性余弦衰减等方法进行了总结,对深度学习现阶段存在的问题以及对未来的发展趋势进行了阐述,为入门深度学习的研究者提供了较为完整的最优化学习材料以及文献支持。
Abstract: As an important technology in the field of machine learning, deep learning has been mature in image recognition, machine translation, natural language processing and other fields, and it has been achieved many good results. In this paper, the development of deep learning model optimizers is analyzed, and the commonly methods such as gradient descent, gradient descent of momentum, Adagrad, RMSProp, Adadelta, Adam, Nadam and ANGD are introduced. The attenuation mode of learning rate is summarized as piecewise constant attenuation, polynomial attenuation, exponential attenuation, natural exponential attenuation, cosine attenuation, linear cosine attenuation and noise linear cosine attenuation. The existing problems and future development trend of deep learning are described, which provide relatively complete learning materials and literature support for the researchers who are engaged in deep learning.
文章引用:冯宇旭, 李裕梅. 深度学习优化器方法及学习率衰减方式综述[J]. 数据挖掘, 2018, 8(4): 186-200. https://doi.org/10.12677/HJDM.2018.84020

参考文献

[1] Hochreiter, S. and Schmidhuber, J. (1997) Long Short-Term Memory. Neural Computation, 9, 1735-1780. | [Google Scholar] [CrossRef] [PubMed]
[2] Socher, R., Lin, C.C., Ng, A.Y. and Manning, C. (2011) Parsing Natural Scenes and Natural Language with Recursive Neural Networks. Proceedings of the 28th International Confer-ence on Machine Learning (ICML-11), Bellevue, WA, 28 June-2 July 2011, 129-136.
[3] Glorot, X., Bordes, A. and Bengio, Y. (2012) Deep Sparse Rectifier Neural Networks. International Conference on Artificial Intelligence and Sta-tistics, 15, 315-323.
[4] Le, Q.V., Jaitly, N. and Hinton, G.E. (2015) A Simple Way to Initialize Recurrent Networks of Rectified Linear Units. Computer Science.
[5] Talathi, S.S. and Vartak, A. (2015) Improving Performance of Re-current Neural Network with Relu Nonlinearity. Computer Science.
[6] Poljak, B.T. (1964) Some Methods of Speeding up the Convergence of Iterative Methods. USSR Computational Mathematics & Mathematical Physics, 4, 1-17. [Google Scholar] [CrossRef
[7] Duchi, J., Hazan, E. and Singer, Y. (2011) Adaptive Subgradient Methods for Online Learning and Stochastic Optimization. Journal of Machine Learning Research, 12, 257-269.
[8] Martens, J. (2010) Deep Learning via Hessian-Free Optimization. International Conference on Machine Learning, Haifa, Israel, 21-24 June 2010, 735-742.
[9] Nesterov, Y. (2004) Introductory Lectures on Convex Opti-mization. Applied Optimization, 87, xviii, 236.
[10] Meng, X., Bradley, J., Yavuz, B., et al. (2015) MLlib: Machine Learning in Apache Spark. Journal of Machine Learning Research, 17, 1235-1241.
[11] Robbins, H. and Monro, S. (1951) A Stochastic Approximation Method. Annals of Mathematical Statistics, 22, 400-407. [Google Scholar] [CrossRef
[12] Hinton, G.E. and Salakhutdinov, R.R. (2006) Reducing the Di-mensionality of Data with Neural Networks. Science, 313, 504-507. [Google Scholar] [CrossRef] [PubMed]
[13] Hinton, G., Deng, L., Yu, D., et al. (2012) Deep Neural Networks for Acoustic Modeling in Speech Recognition: The Shared Views of Four Research Groups. IEEE Signal Processing Magazine, 29, 82-97. [Google Scholar] [CrossRef
[14] Krizhevsky, A., Sutskever, I. and Hinton, G.E. (2012) ImageNet Classification with Deep Convolutional Neural Networks. International Conference on Neural Information Processing Systems, 60, 1097-1105.
[15] Deng, L., Li, J., Huang, J.T., et al. (2013) Recent Advances in Deep Learning for Speech Research at Microsoft. IEEE International Conference on Acoustics, Speech and Signal Processing, Vancouver, BC, 26-31 May 2013, 8604-8608.
[16] Graves, A. (2014) Generating Sequences with Recurrent Neural Networks.
[17] Nesterov, Y. (1983) A Method of Solving a Convex Programming Problem with Convergence Rate O(1/K2). Soviet Mathematics Doklady, 27, 372-376.
[18] Tieleman, T. and Hinton, G. (2012) Lecture 6.5—RMSProp, COURSERA: Neural Networks for Machine Learning. Technical Report.
[19] Graves, A., Mohamed, A.R. and Hinton, G. (2013) Speech Recognition with Deep Recurrent Neural Networks. IEEE International Conference on Acoustics, Speech and Signal Processing, Vancouver, 26-30 May 2013, 6645-6649.
[20] Zeiler, M.D. (2012) ADADELTA: An Adaptive Learning Rate Method.
[21] Kingma, D. and Ba, J. (2014) Adam: A Method for Stochastic Optimiza-tion.
[22] Timothy, D. (2016) Incorporating Nesterov Momentum into Adam.
[23] Wei, W.G.H., Liu, T., Song, A., et al. (2018) An Adaptive Natural Gradient Method with Adaptive Step Size in Multilayer Perceptrons. Chinese Auto-mation Congress, 1593-1597.
[24] Loshchilov, I. and Hutter, F. (2016) SGDR: Stochastic Gradient Descent with Warm Restarts.