|
[1]
|
Cauchy, A. (1847) Méthode générale pour la résolution des systemes d’équations simultanées. Comp. Rend. Sci. Paris, 25, 536-538.
|
|
[2]
|
Bottou, L. (1998) Online Learning and Stochastic Approximations. On-Line Learning in Neural Net-works, 17, 142.
|
|
[3]
|
Sutton, R.S. (1986) Two Problems with Backpropagation and Other Steepest-Descent Learning Procedures for Networks. In: Proceedings of the 8th Annual Conference of the Cognitive Science Society, Erlbaum, Hillsdale, NJ, 823-831.
|
|
[4]
|
Levenberg, K. (1944) A Method for the Solution of Certain Non-Linear Problems in Least Squares. Quarterly of Applied Mathematics, 2, 164-168. [Google Scholar] [CrossRef]
|
|
[5]
|
Marquardt, D.W. (1963) An Algorithm for Least-Squares Estimation of Nonlinear Parameters. Journal of the Society for Industrial and Applied Mathematics, 11, 431-441. [Google Scholar] [CrossRef]
|
|
[6]
|
Møller, M.F. (1993) Efficient Training of Feed-Forward Neural Networks. DAIMI Report Series, 22, No. 464.
|
|
[7]
|
Le Roux, N., Bengio, Y. and Fitzgibbon, A. (2011) 15. Improving First and Second-Order Methods by Modeling Uncertainty. In: Sra, S., Nowozin, S. and Wright, S.J., Eds., Optimization for Machine Learning, The MIT Press, Cambridge, MA, 403.
|
|
[8]
|
LeCun, Y., Boser, B., Denker, J.S., et al. (1989) Backpropagation Applied to Handwritten Zip Code Recognition. Neural Computation, 1, 541-551. [Google Scholar] [CrossRef]
|
|
[9]
|
LeCun, Y.A., Bottou, L., Orr, G.B., et al. (2012) Efficient Backprop. In: Neural Networks: Tricks of the Trade, Springer Berlin Heidelberg, 9-48.
|
|
[10]
|
Nesterov, Y. (1983) A Method of Solving a Convex Programming Problem with Convergence Rate O (1/k2). Soviet Mathematics Doklady, 27, 372-376.
|
|
[11]
|
Nesterov, Y. (2013) Introductory Lectures on Convex Optimization: A Basic Course. Springer Science & Business Media.
|
|
[12]
|
Sutskever, I., Martens, J., Dahl, G., et al. (2013) On the Importance of Initialization and Momen-tum in Deep Learning. Proceedings of the 30th International Conference on Machine Learning, 28, 1139-1147.
|
|
[13]
|
Jacobs, R.A. (1988) Increased Rates of Convergence through Learning Rate Adaptation. Neural Net-works, 1, 295-307. [Google Scholar] [CrossRef]
|
|
[14]
|
Duchi, J., Hazan, E. and Singer, Y. (2011) Adaptive Subgra-dient Methods for Online Learning and Stochastic Optimization. Journal of Machine Learning Research, 12, 2121-2159.
|
|
[15]
|
Dean, J., Corrado, G., Monga, R., et al. (2012) Large Scale Distributed Deep Networks. Proceedings of the 25th International Conference on Neural Information Processing Systems, 1, 1223-1231.
|
|
[16]
|
Pennington, J., Socher, R. and Manning, C.D. (2014) Glove: Global Vectors for Word Representation. EMNLP, 14, 1532-1543.
|
|
[17]
|
Zeiler, M.D. (2012) ADADELTA: An Adaptive Learning Rate Method. arXiv preprint arXiv:1212.5701.
|
|
[18]
|
张慧. 深度学习中优化算法的研究与改进[D]: [硕士学位论文]. 北京: 北京邮电大学, 2018.
|
|
[19]
|
Glorot, X., Bordes, A. and Bengio, Y. (2011) Deep Sparse Rectifier Neural Networks. Proceedings of the 14th International Conference on Artificial Intelligence and Statistics, PMLR 15, 315-323.
|