基于滑动窗口的自适应学习率优化方法
Sliding-Window Based Adaptive Learning Rate Optimization Method
DOI: 10.12677/aam.2026.154173, PDF,    科研立项经费支持
作者: 靳唯一:南宁师范大学数学与统计学院,广西 南宁;陆 莎*:南宁师范大学广西应用数学中心,广西 南宁
关键词: 深度学习自适应学习率滑动窗口方差估计梯度优化Deep Learning Adaptive Learning Rate Sliding Window Variance Estimation Gradient Optimization
摘要: 针对深度神经网络训练中梯度噪声干扰导致参数更新偏离和高曲率区域参数振荡的问题,给出一种在随机优化中能够快速响应局部梯度噪声变化的自适应学习率优化方法(SW-Adam)。该方法设计了滑动窗口方差估计和方差感知学习率调整两个协同机制:一方面,采用Welford在线算法对窗口内梯度赋予均匀权重1/k,将方差估计的响应延迟从传统指数加权移动平均(EMA)的 O( 1/ ( 1β ) ) (约1000步)降至O(k) (k为窗口长度,通常取50);另一方面,通过指数衰减函数建立学习率与方差估计的负相关关系,在高噪声区域自动降低学习率以抑制振荡,在低噪声区域提高学习率以加速收敛。在60维带噪声旋转Rastrigin函数上的初步数值实验结果表明,SW-Adam的最终函数值为130.80,相比AdamW降低77.6%,相比Shampoo降低88.1%;方差估计波动系数相比Adam降低8.6%;虽然SW-Adam的单步计算耗时略高于Adam,但其达到相同目标函数值所需的总迭代次数更少;Wilcoxon秩和检验证实与各基线算法的性能差异具有统计显著性(p < 0.001)。消融实验表明三个组件各具独立贡献,其中滑动窗口方差估计贡献最大(移除后性能下降202.3%),方差感知指数衰减和学习率裁剪分别提供关键的信号转化和边界稳定保障。在CIFAR-10 + ResNet-18深度学习任务上,SW-Adam取得93.47%的测试准确率,泛化差距为3.21个百分点,验证了该方法在深度学习任务上的泛化能力。滑动窗口方差估计机制能够较好地反映局部梯度噪声水平,与方差感知学习率调整策略协同工作,在保持计算高效的同时有效改善收敛性能。
Abstract: To address the problems of parameter update deviation caused by gradient noise and parameter oscillation in high-curvature regions during deep neural network training, this study proposes an adaptive learning-rate optimization method, termed SW-Adam, that can rapidly respond to local variations in gradient noise under stochastic optimization. The method incorporates two coordinated mechanisms: sliding-window variance estimation and variance-aware learning rate adjustment. On the one hand, the Welford online algorithm is employed to assign a uniform weight of 1/k to gradients within the window, thereby reducing the response delay of variance estimation from O( 1/ ( 1β ) ) in the conventional exponential moving average (EMA) approach (approximately 1000 steps) to O(k), where k denotes the window length and is typically set to 50. On the other hand, an exponential decay function is used to establish a negative correlation between the learning rate and the variance estimate, so that the learning rate is automatically decreased in high-noise regions to suppress oscillations and increased in low-noise regions to accelerate convergence. Preliminary numerical experiments on the 60-dimensional noisy rotated Rastrigin function show that SW-Adam achieves a final function value of 130.80, representing a 77.6% reduction compared to AdamW and an 88.1% reduction compared to Shampoo; the variance estimation coefficient of variation decreases by 8.6% compared to Adam; although the single-step computation time of SW-Adam is slightly higher than that of Adam, it requires fewer total iterations to achieve the same objective function value; Wilcoxon rank-sum test confirms that the performance differences with all baseline algorithms are statistically significant (p < 0.001). Ablation experiments confirm the independent contribution of each component: the sliding-window variance estimation is the primary contributor (performance degrades by 202.3% upon removal), while the variance-aware exponential decay and learning rate clipping provide essential signal transformation and boundary stabilization, respectively. On the CIFAR-10 image classification task with ResNet-18, SW-Adam achieves a test accuracy of 93.47% with a generalization gap of 3.21 percentage points, demonstrating its generalization capability on practical deep learning tasks. The sliding-window variance estimation mechanism can effectively reflect local gradient noise levels, working synergistically with the variance-aware learning rate adjustment strategy to improve convergence performance while maintaining computational efficiency.
文章引用:靳唯一, 陆莎. 基于滑动窗口的自适应学习率优化方法[J]. 应用数学进展, 2026, 15(4): 450-463. https://doi.org/10.12677/aam.2026.154173

参考文献

[1] 邱锡鹏. 神经网络与深度学习[M]. 北京: 机械工业出版社, 2020.
[2] Duchi, J., Hazan, E. and Singer, Y. (2011) Adaptive Subgradient Methods for Online Learning and Stochastic Optimization. Journal of Machine Learning Research, 12, 2121-2159.
[3] Hinton, G., Srivastava, N. and Swersky, K. (2012) Neural Networks for Machine Learning Lecture 6a: Overview of Mini-Batch Gradient Descent. University of Toronto.
[4] Kingma, D.P. and Ba, J. (2015) Adam: A Method for Stochastic Optimization. Proceedings of the 3rd International Conference on Learning Representations (ICLR), San Diego, 7-9 May 2015, 1-15.
[5] Anil, R., Gupta, V., Koren, T., et al. (2021) Scalable Second Order Optimization for Deep Learning.
[6] Chen, X., Dong, X., Hsieh, C., Huang, D., Le, Q.V., Liang, C., et al. (2023). Symbolic Discovery of Optimization Algorithms. Advances in Neural Information Processing Systems, 36, 49205-49233.[CrossRef
[7] Shazeer, N. and Stern, M. (2018) Adafactor: Adaptive Learning Rates with Sublinear Memory Cost. Proceedings of the 35th International Conference on Machine Learning (ICML), Stockholm, 10-15 July 2018, 4596-4604.
[8] Liu, L.Y., Jiang, H.M., He, P.C., et al. (2020) On the Variance of the Adaptive Learning Rate and beyond. Proceedings of the 8th International Conference on Learning Representations (ICLR), Addis Ababa, 26-30 April 2020, 1-13.
[9] Loshchilov, I. and Hutter, F. (2019) Decoupled Weight Decay Regularization. Proceedings of the 7th International Conference on Learning Representations (ICLR), New Orleans, 6-9 May 2019, 2-15.
[10] Schmidt, R., Schneider, F. and Hennig, P. (2021) Descending through a Crowded Valley: Benchmarking Deep Learning Optimizers. Proceedings of the 38th International Conference on Machine Learning (ICML), Online, 18-24 July 2021, 10-12.
[11] Zarghani, A. and Abedi, S. (2025) Designing Adaptive Algorithms Based on Reinforcement Learning for Dynamic Optimization of Sliding Window Size in Multi-Dimensional Data Streams. 7-9. arXiv preprint arXiv:2507.06901.
[12] Simsekli, U., Sagun, L. and Gurbuzbalaban, M. (2019) A Tail-Index Analysis of Stochastic Gradient Noise in Deep Neural Networks. Proceedings of the 36th International Conference on Machine Learning (ICML), Long Beach, 9-15 June 2019, 5827-5837.
[13] Reddi, S.J., Kale, S. and Kumar, S. (2018) On the Convergence of Adam and Beyond. Proceedings of the 6th International Conference on Learning Representations (ICLR), Vancouver, 30 April-3 May 2018.
[14] Welford, B.P. (1962) Note on a Method for Calculating Corrected Sums of Squares and Products. Technometrics, 4, 419-420. [Google Scholar] [CrossRef
[15] Rastrigin, L.A. (1974) Systems of Extremal Control. Nauka, Moscow.