| [1] | Kawaguchi, K. and Lu, H.H. (2020) Ordered SGD: A New Stochastic Optimization Framework for Empirical Risk Minimization. Proceedings of the 23rd International Conference on Artificial Intelligence and Statistics (AISTATS), 108. | 
                     
                                
                                    
                                        | [2] | Shalev-Shwartz, S. and Ben-David, S. (2014) References. In: Understanding Machine Learning, Cambridge University Press, Cambridge, 385-394. https://doi.org/10.1017/CBO9781107298019.036
 | 
                     
                                
                                    
                                        | [3] | Taheri, H., Pedarsani, R. and Thrampoulidis, C. (2021) Fundamental Limits of Ridge- Regularized Empirical Risk Minimization in High Dimensions. Proceedings of the 24th In- ternational Conference on Artificial Intelligence and Statistics, 130. | 
                     
                                
                                    
                                        | [4] | Shalev-Shwartz, S. and Srebro, N. (2008) SVM Optimization: Inverse Dependence on Training Set Size. Proceedings of the 25th International Conference on Machine Learning, Helsinki, 5-9 June  2008. https://doi.org/10.1145/1390156.1390273
 | 
                     
                                
                                    
                                        | [5] | Bottou, L. (2010) Large-Scale Machine Learning with Stochastic Gradient Descent. In: Lechevallier, Y. and Saporta, G., Eds., Proceedings of COMPSTAT’2010, Physica-Verlag HD, 177-186. https://doi.org/10.1007/978-3-7908-2604-3 16
 | 
                     
                                
                                    
                                        | [6] | Bottou, L., Curtis, F.E. and Nocedal, J. (2018) Optimization Methods for Large-Scale Machine Learning. SIAM Review, 60, 223-311. https://doi.org/10.1137/16M1080173
 | 
                     
                                
                                    
                                        | [7] | Mokhtari, A. and Ribeiro, A. (2013) A Dual Stochastic DFP Algorithm for Optimal Resource Allocation in Wireless Systems. 2013 IEEE 14th Workshop on Signal Processing Advances in Wireless Communications (SPAWC), Darmstadt, 16-19 June 2013, 21-25. https://doi.org/10.1109/SPAWC.2013.6612004
 | 
                     
                                
                                    
                                        | [8] | Couillard, O. (2020) Fast and Flexible Optimization of Power Allocation in Wireless Commu- nication Systems Using Neural Networks. McGill University, Montreal, Canada. | 
                     
                                
                                    
                                        | [9] | Robbins, H. and Monro, S. (1951) A Stochastic Approximation Method. The Annals of Math- ematical Statistics, 22, 400-407. https://doi.org/10.1214/aoms/1177729586
 | 
                     
                                
                                    
                                        | [10] | Le Roux, N., Schmidt, M. and Bach, F. (2012) A Stochastic Gradient Method with an Expo- nential Convergence Rate for Finite Training Sets. arXiv preprint arXiv:1202.6258 | 
                     
                                
                                    
                                        | [11] | Schmidt, M., Le Roux, N. and Bach, F. (2017) Minimizing finite Sums with the Stochastic Average Gradient. Mathematical Programming, 162, 83-112. https://doi.org/10.1007/s10107-016-1030-6
 | 
                     
                                
                                    
                                        | [12] | Defazio,  A.,  Bach,  F.  and  Lacoste-Julien,  S.  (2014)  SAGA:  A  Fast  Incremental  Gradient Method with Support for Non-Strongly Convex Composite Objectives. In: Advances in Neural Information Processing Systems 27 (NIPS 2014). | 
                     
                                
                                    
                                        | [13] | Kulunchakov, A. (2020) Stochastic Optimization for Large-Scale Machine Learning: Variance Reduction and Acceleration. Grenoble Alpes University, France. | 
                     
                                
                                    
                                        | [14] | Wang, C., et al. (2013) Variance Reduction for Stochastic Gradient Optimization. In:  Advances in Neural Information Processing Systems 26 (NIPS 2013). | 
                     
                                
                                    
                                        | [15] | Shen, Z., et al. (2016) Adaptive Variance Reducing for Stochastic Gradient Descent. Proceed- ings of the Twenty-Fifth International Joint Conference on Artificial Intelligence, New York, July 2016, 1990-1996. | 
                     
                                
                                    
                                        | [16] | Koneˇcny´, J. and Richt´arik, P. (2017) Semi-Stochastic Gradient Descent Methods. Frontiers in Applied Mathematics and Statistics, 3, Article 9. https://doi.org/10.3389/fams.2017.00009
 | 
                     
                                
                                    
                                        | [17] | Shang, F., et al. (2021) Efficient Asynchronous Semi-Stochastic Block Coordinate Descent Methods for Large-Scale SVD. IEEE Access, 9, 126159-126171. https://doi.org/10.1109/ACCESS.2021.3094282
 | 
                     
                                
                                    
                                        | [18] | Duchi, J., Hazan, E. and Singer, Y. (2011) Adaptive Subgradient Methods for Online Learning and Stochastic Optimization. Journal of Machine Learning Research, 12, 2121-2159. | 
                     
                                
                                    
                                        | [19] | Tieleman, T. and Hinton, G. (2012) Lecture 6.5-rmsprop: Divide the Gradient by a Running Average of Its Recent Magnitude. COURSERA: Neural Networks for Machine Learning, 4, 26-31. | 
                     
                                
                                    
                                        | [20] | Kingma, D.P. and Ba, J. (2014) Adam: A Method for Stochastic Optimization. arXiv preprint arXiv:1412.6980 | 
                     
                                
                                    
                                        | [21] | Mokhtari, A. and Ribeiro, A. (2013) Regularized Stochastic BFGS Algorithm. 2013 IEEE Global Conference on Signal and Information Processing, Austin, TX, 3-5 December 2013, 1109-1112. https://doi.org/10.1109/GlobalSIP.2013.6737088
 | 
                     
                                
                                    
                                        | [22] | Byrd, R.H., et al. (2016) A Stochastic Quasi-Newton Method for Large-Scale Optimization. SIAM Journal on Optimization,  26,  1008-1031. https://doi.org/10.1137/140954362
 | 
                     
                                
                                    
                                        | [23] | Liu, D.C. and Nocedal, J. (1989) On the Limited Memory BFGS Method for Large Scale Optimization. Mathematical Programming, 45, 503-528. https://doi.org/10.1007/BF01589116
 | 
                     
                                
                                    
                                        | [24] | Moritz, P., Nishihara, R. and Jordan, M. (2016) A Linearly-Convergent Stochastic L-BFGS Algorithm. Proceedings of the 19th International Conference on Artificial Intelligence and S- tatistics (AISTATS), 41. | 
                     
                                
                                    
                                        | [25] | Gower, R., Goldfarb, D. and Richt´arik, P. (2016) Stochastic Block BFGS: Squeezing More Curvature Out of Data. International Conference on Machine Learning, New York, June 2016. | 
                     
                                
                                    
                                        | [26] | Fletcher, R. and Reeves, C.M. (1964) Function Minimization by Conjugate Gradients. The Computer Journal, 7, 149-154. https://doi.org/10.1093/comjnl/7.2.149
 | 
                     
                                
                                    
                                        | [27] | Andrei, N. (2013) On Three-Term Conjugate Gradient Algorithms for Unconstrained Opti- mization. Applied Mathematics and Computation, 219, 6316-6327. https://doi.org/10.1016/j.amc.2012.11.097
 | 
                     
                                
                                    
                                        | [28] | Yao,  S.W.,  et al. (2020) A Class of Globally Convergent Three-Term Dai-Liao Conjugate Gradient Methods. Applied Numerical Mathematics, 151, 354-366. https://doi.org/10.1016/j.apnum.2019.12.026
 | 
                     
                                
                                    
                                        | [29] | Dai, Y.-H. and Liao, L.-Z. (2001) New Conjugacy Conditions and Related Nonlinear Conjugate Gradient Methods. Applied Mathematics and Optimization, 43, 87-101. https://doi.org/10.1007/s002450010019
 | 
                     
                                
                                    
                                        | [30] | Babaie-Kafaki, S. and Ghanbari, R. (2014) A Descent Family of Dai-Liao Conjugate Gradient Methods. Optimization Methods and Software, 29, 583-591. https://doi.org/10.1080/10556788.2013.833199
 | 
                     
                                
                                    
                                        | [31] | Andrei, N. (2015) A New Three-Term Conjugate Gradient Algorithm for Unconstrained Op- timization. Numerical Algorithms, 68, 305-321. https://doi.org/10.1007/s11075-014-9845-9
 | 
                     
                                
                                    
                                        | [32] | Yao,  S.W.,  et al. (2020) A Class of Globally Convergent Three-Term Dai-Liao Conjugate Gradient Methods. Applied Numerical Mathematics, 151, 354-366. https://doi.org/10.1016/j.apnum.2019.12.026
 | 
                     
                                
                                    
                                        | [33] | Powell, M.J.D. (1977) Restart Procedures for the Conjugate Gradient Method. Mathematical Programming, 12, 241-254. https://doi.org/10.1007/BF01593790
 | 
                     
                                
                                    
                                        | [34] | Jiang, X.z., et al. (2021) An Improved Polak-Ribi`ere-Polyak Conjugate Gradient Method with an Efficient Restart Direction. Computational and Applied Mathematics, 40, Article No. 174. https://doi.org/10.1007/s40314-021-01557-9
 | 
                     
                                
                                    
                                        | [35] | Zoutendijk, G. (1966) Nonlinear Programming: A Numerical Survey. SIAM Journal on Con- trol, 4, 194-210. https://doi.org/10.1137/0304019
 |