基于1/t-Polyak步长的随机控制的随机梯度算法

doi:10.12677/aam.2024.133095

期刊菜单

基于1/t-Polyak步长的随机控制的随机梯度算法
1/t-Polyak Stepsize for the Stochastically Controlled Stochastic Gradient Algorithm

DOI: 10.12677/aam.2024.133095, PDF, HTML, XML,
作者: 刘晨晨：河北工业大学理学院，天津
关键词: 有限和优化；随机算法；方差缩减；1/t-带步长；Finite-Sum Optimization； Stochastic Algorithms； Variance Reduction； 1/t-Band Stepsize

摘要: 随机梯度下降算法已成为求解大规模有限和优化问题的流行算法，然而，由于其在迭代过程中会产生方差，导致了振荡现象。随机控制的随机梯度(SCSG)算法缩减了该方差，但SCSG算法对于步长有较强的限制。为了扩大SCSG算法的步长选择范围，基于1/t-带步长与Polyak步长，提出1/t-Polyak步长，并将其与SCSG算法结合，提出SCSGP算法。建立了SCSGP算法在强凸条件下的线性收敛性，数值实验表明SCSGP算法与其他随机梯度类算法相比有明显优势。

Abstract: The stochastic gradient descent algorithm has become popular algorithm for solving large-scale finite-sum optimization problems. However, this algorithm leads to oscillations due to the variance in the iterative process. The stochastically controlled stochastic gradient (SCSG) algorithm reduces this variance, but the SCSG algorithm has strong limit on stepsize. To expand the range of stepsize selection of the SCSG algorithm, we propose 1/t-Polyak stepsize by combining the 1/t-band stepsize and the Polyak stepsize. Using this new stepsize for the stochastically controlled stochastic gradient (SCSG) algorithm, the SCSGP algorithm is proposed. We establish the linear convergence rate of SCSGP for strongly convex problems. Numerical experiments demonstrate a clear advantage of SCSGP over other stochastic gradient algorithms.

文章引用：刘晨晨. 基于1/t-Polyak步长的随机控制的随机梯度算法[J]. 应用数学进展, 2024, 13(3): 1008-1017. https://doi.org/10.12677/aam.2024.133095

1. 引言

考虑有限和优化问题：

$\min_{x \in ℜ^{d}} f (x) = \frac{1}{n} \sum_{i = 1}^{n} f_{i} (x)$ ， (1)

其中分量函数 $f_{i} (x)$ 连续可微，假设 $f (x)$ 是强凸的。机器学习中满足条件的优化问题有很多，例如带 $l_{2}$ 正则项的逻辑回归问题和带 $l_{2}$ 正则项的最小平方回归问题等 [1] [2] [3] 。

当数据规模过大时，随机梯度下降(SGD)算法 [4] 是求解问题(1)的主流算法，即用随机梯度估计全梯度，其迭代格式为

$x_{t + 1} = x_{t} - η_{t} \nabla f_{i_{t}} (x_{t})$ ，

其中 $η_{t} > 0$ 是步长， $\nabla f_{i_{t}} (x_{t})$ 是分量函数 $f_{i_{t}} (x)$ 在 $x_{t}$ 处的梯度。随机梯度 $\nabla f_{i_{t}} (x_{t})$ 与全梯度 $\nabla f (x_{t})$ 之间的方差导致SGD即使在强凸条件下，也只能达到次线性收敛速度 [5] 。方差缩减梯度(SVRG)算法 [6] 通过内外两层循环达到缩减方差的目的，但由于其在外循环中需要计算全梯度且内循环次数较大，导致数据规模过大时计算量大。为了改善这个问题，SCSG [7] 令内循环次数服从几何分布且在外循环中计算批量梯度

$\tilde{g} = \frac{1}{| I_{t} |} \sum_{i \in I_{t}} \nabla f_{i} (\tilde{x})$ ，

其中 $I_{t} \subset [n]$ ， $| I_{t} |$ 为 $I_{t}$ 的批量大小， $\tilde{x}$ 为在外循环中设置的快照点。在内循环中，SCSG用与SVRG相同的格式更新梯度估计量：

$g_{t} = \nabla f_{i_{t}} (x_{t}) - \nabla f_{i_{t}} (\tilde{x}) + \tilde{g}$ 。

在强凸条件下，其使用固定批量可线性收敛到解的邻域。SCSG适用于求解大规模 $n \in [10^{4}, 10^{9}]$ 、低精度 $ε \in [10^{- 4}, 10^{- 2}]$ 的优化问题 [7] [8] [9] ，可以经过很少的有效循环次数收敛到上述目标精度。

步长是保证随机梯度类算法收敛的关键因素，很小的常数步长和衰减步长都会使算法收敛缓慢，并且手动调整常数步长的过程相当耗时 [10] [11] [12] 。Polyak步长 [13] 利用迭代过程中产生的函数值和梯度自动地计算步长，避免了手动调整的过程，其计算公式为

$η_{t} = 2 \frac{f (x_{t}) - f^{*}}{{‖ \nabla f (x_{t}) ‖}^{2}}$ ，

其中 $f^{*}$ 是 $f (x)$ 的极小值。为了将Polyak步长与随机梯度类算法结合，Loizou等人 [14] 提出Polyak步长的随机版本(SPS)：

$η_{t} = 2 \frac{f_{i_{t}} (x_{t}) - f_{i_{t}}^{*}}{{‖ \nabla f_{i_{t}} (x_{t}) ‖}^{2}}$ ，

其中 $f_{i_{t}}^{*}$ 是 $f_{i_{t}} (x)$ 的极小值。SGD结合SPS步长比结合固定步长数值表现好。当SPS步长中 $f_{i_{t}}^{*}$ 不易求解时，可用一个下界 $l_{i_{t}}^{*} \leq f_{i_{t}}^{*}$ 来替换 [15] 。最近，Wang等人 [16] 介绍了1/t-带步长，其允许步长在一定范围内扰动，具体格式为

$\frac{m}{t} \leq η_{t} \leq \frac{M}{t}$ ， $\forall t \geq 1$ ，

其中 $m \leq M$ 是正常数。显然，衰减步长 $η_{t} = η_{0} / t$ 是1/t-带步长的特殊情况。

受1/t-带步长和Polyak步长启发提出1/t-Polyak步长，并将其与SCSG结合提出新的算法——SCSGP。在强凸光滑的条件下，SCSGP结合变化的批量可达到线性收敛速度。数值实验结果表明SCSGP比SCSG及其他随机梯度类算法表现好。

论文其余部分概括如下：在第2部分中提出1/t-Polyak步长并描述SCSGP算法。收敛性分析在第3部分。在第4部分中设置了数值实验。最后在第5部分进行总结。

2. 1/t-Polyak步长与SCSGP算法

首先，利用Polyak步长的随机版本并将其与1/t-带步长结合，提出1/t-Polyak步长：

$η_{t} = {\begin{cases} m / t, {\bar{η}}_{t}^{P} \leq m / t; \\ {\bar{η}}_{t}^{P}, m / t < {\bar{η}}_{t}^{P} < M / t; \\ M / t, {\bar{η}}_{t}^{P} \geq M / t, \end{cases}$ (2)

其中 ${\bar{η}}_{t}^{P}$ 形式如下：

${\bar{η}}_{t}^{P} = c_{B} \frac{f_{I_{t}} ({\tilde{x}}_{t - 1}) - l_{I_{t}}^{*}}{{‖ \nabla f_{I_{t}} ({\tilde{x}}_{t - 1}) ‖}^{2}}$ ，

其中 $l_{I_{t}}^{*}$ 为批量函数 $f_{I_{t}} (x)$ 的极小值， $c_{B}$ 用于调整步长的范围，由用户给定。对于一些非负的损失函数，根据批量函数的定义可取 $l_{I_{t}}^{*} = 0$ 。结合1/t-Polyak步长和SCSG算法提出SCSGP算法，见算法1。

在SCSGP算法的第t次外循环中，内循环次数 $N_{t} ~ G e o m (γ_{t})$ 是非负的几何随机变量，其概率分布为 $P (N_{t} = k) = γ_{t}^{k} (1 - γ_{t}^{k})$ ， $\forall k = 0, 1, \dots$ 。值得注意的是，SCSG中 $γ_{t}$ 取固定值，但在SCSGP中 $γ_{t}$ 随迭代次数变化。若 $γ_{t} = B_{t} / (B_{t} + b_{t})$ ，则有

$Ε_{N_{t} ~ G e o m (γ_{t})} = \frac{γ_{t}}{1 - γ_{t}} = \frac{B_{t}}{b_{t}}$ ，

其中 $Ε_{N_{t}}$ 记为对 $N_{t}$ 取期望。该性质在后续分析中起到重要作用。另外不难发现 $v_{k}^{(t)}$ 是梯度估计量 $\nabla f (x_{k})$ 的有偏估计：

$Ε_{{\tilde{I}}_{k}} v_{k}^{(t)} = \nabla f (x_{k}^{(t)}) - \nabla f ({\tilde{x}}_{t - 1}) + \nabla f_{I_{t}} ({\tilde{x}}_{t - 1}) = \nabla f (x_{k}^{(t)}) + e_{t}$ ， (3)

其中 $e_{t} = \nabla f_{I_{t}} ({\tilde{x}}_{t - 1}) - \nabla f ({\tilde{x}}_{t - 1})$ 。

3. 收敛性分析

由于几何随机变量 $N_{t}$ 在收敛分析中占据重要地位，需要给出下面关键的引理。

引理1 [8] 由于 $N_{t} ~ G e o m (γ_{t})$ ，其中 $γ_{t} > 0$ ，则对任意满足 $Ε | D_{N_{t}} | < \infty$ 的序列 ${D_{n}}$ 有

$Ε (D_{N_{t}} - D_{N_{t} + 1}) = (\frac{1}{γ_{t}} - 1) (D_{0} - Ε D_{N_{t}})$ ，

其中E记为对所有随机变量取期望。

记 $γ = \min_{t} γ_{t}$ ，则对任意 $t \in [T]$ 有

$E (D_{N_{t}} - D_{N_{t} - 1}) \leq (\frac{1}{γ} - 1) (D_{0} - Ε D_{N_{t}})$ 。 (4)

为了应用(3)，需要证明用到的相关序列 ${D_{n}}$ 满足 $Ε | D_{N_{t}} | < \infty$ 。下面引理保证了该性质。

引理2 假设 $f_{i} (x)$ 是L-光滑的，令 $\frac{M L}{t} \leq \frac{1}{3} {(\frac{b_{t}}{B_{t}})}^{2 / 3}$ 且 $B_{t} \geq 8 b_{t}$ ，则对任意 $t \geq 1$ ， $Ε {‖ {\tilde{x}}_{t} - {\tilde{x}}_{t - 1} ‖}^{2} < \infty$ ，

$Ε [f ({\tilde{x}}_{t}) - f^{*}] < \infty$ ， $Ε {‖ \nabla f ({\tilde{x}}_{t}) ‖}^{2} < \infty$ ， $Ε | 〈 e_{t}, {\tilde{x}}_{t} - {\tilde{x}}_{t - 1} 〉 | < \infty$ ， $Ε | 〈 e_{t}, \nabla f ({\tilde{x}}_{t}) 〉 | < \infty$ 。

证明：因为 $f_{i} (x)$ 是L-光滑的和(3)，可得

$\begin{matrix} Ε_{{\tilde{I}}_{k}} f (x_{k + 1}^{(t)}) \leq f (x_{k}^{(t)}) - η_{t} 〈 Ε_{{\tilde{I}}_{k}} v_{k}^{(t)}, \nabla f (x_{k}^{(t)}) 〉 + \frac{L η_{t}^{2}}{2} Ε_{{\tilde{I}}_{k}} {‖ v_{k}^{(t)} ‖}^{2} \\ = f (x_{k}^{(t)}) - η_{t} {‖ \nabla f (x_{k}^{(t)}) ‖}^{2} - η_{t} 〈 e_{t}, \nabla f (x_{k}^{(t)}) 〉 + \frac{L η_{t}^{2}}{2} Ε_{{\tilde{I}}_{k}} {‖ v_{k}^{(t)} ‖}^{2} \\ \leq f (x_{k}^{(t)}) - η_{t} (1 - L η_{t}) {‖ \nabla f (x_{k}^{(t)}) ‖}^{2} - η_{t} 〈 e_{t}, \nabla f (x_{k}^{(t)}) 〉 + \frac{L^{3} η_{t}^{2}}{2 b_{t}} {‖ x_{k}^{(t)} - x_{0}^{(t)} ‖}^{2} + L η_{t}^{2} {‖ e_{t} ‖}^{2}, \end{matrix}$ (5)

其中最后一个不等式利用了 [8] 中引理B.2。由于对任意 $c > 0$ 有 $2 〈 a, b 〉 \leq \frac{{‖ a ‖}^{2}}{c} + c {‖ b ‖}^{2}$ ，令 $c = 2$ ，则有

$η_{t} 〈 e_{t}, - \nabla f (x_{k}^{t}) 〉 \leq \frac{1}{4} η_{t} {‖ \nabla f (x_{k}^{(t)}) ‖}^{2} + η_{t} {‖ e_{t} ‖}^{2}$ 。 (6)

因为 $η_{t} \leq \frac{M}{t}$ ， $\frac{M L}{t} \leq \frac{1}{3} {(\frac{b_{t}}{B_{t}})}^{2 / 3}$ 且 $B_{t} \geq 8 b_{t}$ ，可知 $\frac{3}{4} - L η_{t} > 0$ 。由(5)和(6)得到

$\begin{matrix} {‖ \nabla f (x_{k}^{(t)}) ‖}^{2} \leq \frac{1}{η_{t} (\frac{3}{4} - L η_{t})} (f (x_{k}^{(t)}) - Ε_{{\tilde{I}}_{k}} f (x_{k + 1}^{(t)})) + \frac{1 + L η_{t}}{\frac{3}{4} - L η_{t}} {‖ e_{t} ‖}^{2} \\ + \frac{L^{3} η_{t}}{2 b_{t} (\frac{3}{4} - L η_{t})} {‖ x_{k}^{(t)} - x_{0}^{(t)} ‖}^{2} . \end{matrix}$ (7)

注意到 $x_{k + 1}^{(t)} = x_{k}^{(t)} - η_{t} v_{k}^{(t)}$ ，用类似(5)的推导过程可得

$\begin{matrix} Ε_{{\tilde{I}}_{k}} {‖ x_{k + 1}^{(t)} - x_{0}^{(t)} ‖}^{2} = {‖ x_{k}^{(t)} - x_{0}^{(t)} ‖}^{2} - 2 η_{t} 〈 Ε_{{\tilde{I}}_{k}} v_{k}^{(t)}, x_{k}^{(t)} - x_{0}^{(t)} 〉 + η_{t}^{2} Ε_{{\tilde{I}}_{k}} {‖ v_{k}^{(t)} ‖}^{2} \\ = {‖ x_{k}^{(t)} - x_{0}^{(t)} ‖}^{2} - 2 η_{t} 〈 \nabla f (x_{k}^{(t)}), x_{k}^{(t)} - x_{0}^{(t)} 〉 - 2 η_{t} 〈 e_{t}, x_{k}^{(t)} - x_{0}^{(t)} 〉 + η_{t}^{2} Ε_{{\tilde{I}}_{k}} {‖ v_{k}^{(t)} ‖}^{2} \\ \leq (1 + \frac{η_{t}^{2} L^{2}}{b_{t}}) {‖ x_{k}^{(t)} - x_{0}^{(t)} ‖}^{2} - 2 η_{t} 〈 \nabla f (x_{k}^{(t)}), x_{k}^{(t)} - x_{0}^{(t)} 〉 - 2 η_{t} 〈 e_{t}, x_{k}^{(t)} - x_{0}^{(t)} 〉 \\ + 2 η_{t}^{2} {‖ \nabla f (x_{k}^{(t)}) ‖}^{2} + 2 η_{t}^{2} {‖ e_{t} ‖}^{2} . \end{matrix}$ (8)

再次使用 $2 〈 a, b 〉 \leq \frac{{‖ a ‖}^{2}}{c} + c {‖ b ‖}^{2}$ 并取 $c = \frac{b_{t}}{8 η_{t}^{2} B_{t}}$ ，则有

$〈 - 2 η_{t} \nabla f (x_{k}^{(t)}), x_{k}^{(t)} - x_{0}^{(t)} 〉 \leq \frac{8 η_{t}^{2} B_{t}}{b_{t}} {‖ \nabla f (x_{k}^{(t)}) ‖}^{2} + \frac{b_{t}}{8 B_{t}} {‖ x_{k}^{(t)} - x_{0}^{(t)} ‖}^{2}$ ，

$〈 - 2 η_{t} e_{t}, x_{k}^{(t)} - x_{0}^{(t)} 〉 \leq \frac{8 η_{t}^{2} B_{t}}{b_{t}} {‖ e_{t} ‖}^{2} + \frac{b_{t}}{8 B_{t}} {‖ x_{k}^{(t)} - x_{0}^{(t)} ‖}^{2}$ 。

将上述不等式和(7)代入(8)得到

$\begin{array}{l} Ε_{{\tilde{I}}_{k}} {‖ x_{k + 1}^{(t)} - x_{0}^{(t)} ‖}^{2} \leq (1 + \frac{b_{t}}{4 B_{t}} + \frac{3 η_{t}^{2} L^{2} / 2 + 8 η_{t}^{3} L^{3} B_{t} / b_{t}}{2 b_{t} (3 / 4 - η_{t} L)}) {‖ x_{k}^{(t)} - x_{0}^{(t)} ‖}^{2} + (2 η_{t}^{2} + \frac{8 η_{t}^{2} B_{t}}{b_{t}}) (1 + \frac{1 + η_{t} L}{3 / 4 - η_{t} L}) {‖ e_{t} ‖}^{2} \\ + \frac{2 η_{t} + 8 η_{t} B_{t} / b_{t}}{3 / 4 - η_{t} L} (f (x_{k}^{(t)}) - Ε_{{\tilde{I}}_{k}} f (x_{k + 1}^{(t)})) . \end{array}$ (9)

由 $η_{t} L \leq \frac{1}{3} {(\frac{b_{t}}{B_{t}})}^{2 / 3}$ 和 $B_{t} \geq 8 b_{t}$ 可得

$\begin{matrix} \frac{3 η_{t}^{2} L^{2} / 2 + 8 η_{t}^{3} L^{3} B_{t} / b_{t}}{2 b_{t} (3 / 4 - η_{t} L)} \leq \frac{(1 / 6) \times {(b_{t} / B_{t})}^{4 / 3} + (8 / 27) \times (b_{t} / B_{t})}{2 b_{t} (3 / 4 - (1 / 3) \times {(b_{t} / B_{t})}^{2 / 3})} \\ \leq \frac{1 / 12 B_{t} + 8 / 27 B_{t}}{2 (3 / 4 - 1 / 12)} = \frac{41}{144 B_{t}} \leq \frac{7 b_{t}}{24 B_{t}} . \end{matrix}$

结合上式和(9)有

$\begin{matrix} Ε_{{\tilde{I}}_{k}} {‖ x_{k + 1}^{(t)} - x_{0}^{(t)} ‖}^{2} \leq (1 + \frac{b_{t}}{4 B_{t}} + \frac{7 b_{t}}{24 B_{t}}) {‖ x_{k}^{(t)} - x_{0}^{(t)} ‖}^{2} + (2 + \frac{8 B_{t}}{b_{t}}) (1 + \frac{1 + (1 / 3) \times {(b_{t} / B_{t})}^{2 / 3}}{3 / 4 - (1 / 3) \times {(b_{t} / B_{t})}^{2 / 3}}) η_{t}^{2} {‖ e_{t} ‖}^{2} \\ + \frac{2 η_{t} + 8 η_{t} B_{t} / b_{t}}{3 / 4 - (1 / 3) \times {(b_{t} / B_{t})}^{2 / 3}} (f (x_{k}^{(t)}) - Ε_{{\tilde{I}}_{k}} f (x_{k + 1}^{(t)})) \\ \leq (1 + \frac{13 b_{t}}{24 B_{t}}) {‖ x_{k}^{(t)} - x_{0}^{(t)} ‖}^{2} + (\frac{21}{4} + \frac{21 B_{t}}{b_{t}}) η_{t}^{2} {‖ e_{t} ‖}^{2} + (3 + \frac{12 B_{t}}{b_{t}}) η_{t} (f (x_{k}^{(t)}) - Ε_{{\tilde{I}}_{k}} f (x_{k + 1}^{(t)})) . \end{matrix}$ (10)

为了证明 $Ε [f (x_{k}^{(t)}) - f^{*}]$ 和 $Ε {‖ x_{k + 1}^{(t)} - x_{0}^{(t)} ‖}^{2}$ 的上界，记

$G_{k}^{(t)} = (3 + \frac{12 B_{t}}{b_{t}}) η_{t} Ε [f (x_{k}^{(t)}) - f^{*}] + Ε {‖ x_{k}^{(t)} - x_{0}^{(t)} ‖}^{2} .$

对(10)取全期望得到

$\begin{matrix} G_{k + 1}^{(t)} \leq G_{k}^{(t)} + \frac{13 b_{t}}{24 B_{t}} Ε {‖ x_{k}^{(t)} - x_{0}^{(t)} ‖}^{2} + (\frac{21}{4} + \frac{21 B_{t}}{b_{t}}) η_{t}^{2} Ε {‖ e_{t} ‖}^{2} \\ \leq (1 + \frac{13 b_{t}}{24 B_{t}}) (G_{k}^{(t)} + (\frac{21}{4} + \frac{21 B_{t}}{b_{t}}) η_{t}^{2} Ε {‖ e_{t} ‖}^{2}) \\ \leq {(1 + \frac{13 b_{t}}{24 B_{t}})}^{k} (G_{0}^{(t)} + (\frac{21}{4} + \frac{21 B_{t}}{b_{t}}) η_{t}^{2} Ε {‖ e_{t} ‖}^{2}) . \end{matrix}$

由 $N_{t} ~ G e o m (\frac{B_{t}}{B_{t} + b_{t}})$ 可得

$P (N_{t} = k) = \frac{b_{t}}{B_{t} + b_{t}} {(\frac{B_{t}}{B_{t} + b_{t}})}^{k} \leq {(\frac{B_{t}}{B_{t} + b_{t}})}^{k},$

$Ε {(1 + \frac{13 b_{t}}{24 B_{t}})}^{N_{t}} \leq {\sum_{k \geq 0} (\frac{24 B_{t} + 13 b_{t}}{24 B_{t}})}^{k} {(\frac{B_{t}}{B_{t} + b_{t}})}^{k} = {\sum_{k \geq 0} (\frac{24 B_{t} + 13 b_{t}}{24 B_{t} + 24 b_{t}})}^{k} = \frac{24 B_{t} + 24 b_{t}}{11 b_{t}} .$

于是有

$Ε G_{N_{t}}^{(t)} \leq \frac{24 B_{t} + 24 b_{t}}{11 b_{t}} (G_{0}^{(t)} + (\frac{21}{4} + \frac{21 B_{t}}{b_{t}}) η_{t}^{2} Ε {‖ e_{t} ‖}^{2}),$

即

$\begin{array}{l} (3 + \frac{12 B_{t}}{b_{t}}) η_{t} Ε [f (x_{N_{t}}^{(t)}) - f^{*}] + Ε ‖ x_{N_{t}}^{(t)} - x_{0}^{(t)} ‖ \\ \leq \frac{24 B_{t} + 24 b_{t}}{11 b_{t}} ((3 + \frac{12 B_{t}}{b_{t}}) η_{t} Ε [f (x_{0}^{(t)}) - f^{*}] + (\frac{21}{4} + \frac{21 B_{t}}{b_{t}}) η_{t}^{2} Ε {‖ e_{t} ‖}^{2}) . \end{array}$

分别用 ${\tilde{x}}_{t}$ 替换 $x_{N_{t}}^{(t)}$ ，用 ${\tilde{x}}_{t - 1}$ 替换 $x_{0}^{(t)}$ ，由 [8] 中引理B.3得到 $Ε {‖ e_{t} ‖}^{2} < \infty$ ，这表明 $Ε {‖ {\tilde{x}}_{t} - {\tilde{x}}_{t - 1} ‖}^{2} < \infty$ 和 $Ε [f ({\tilde{x}}_{t}) - f^{*}] < \infty$ 。由(7)可知 $Ε {‖ \nabla f ({\tilde{x}}_{t}) ‖}^{2} < \infty$ 。利用 $2 〈 a, b 〉 \leq \frac{{‖ a ‖}^{2}}{c} + c {‖ b ‖}^{2}$ 可得 $Ε | 〈 e_{t}, {\tilde{x}}_{t} - {\tilde{x}}_{t - 1} 〉 | < \infty$ 和

$Ε | 〈 e_{t}, \nabla f ({\tilde{x}}_{t}) 〉 | < \infty$ 。结论得证。

现在分析强凸条件下SCSGP的线性收敛速度。

定理1 假设 $f_{i} (x)$ 是L-光滑的且 $f (x)$ 是 $μ$ -强凸的，令 $b_{t} = t^{1 / 2}$ ， $B_{t} = B_{0} t^{3 / 2}$ ，则

$Ε [f ({\tilde{x}}_{T}) - f^{*}] \leq c_{0}^{T} Δ_{f} + \frac{15 M Η^{*}}{8 μ m B_{0}} J (B_{t} < n),$

其中 $c_{0} = \frac{3}{3 + 2 μ m B_{0}}$ ， $Δ_{f} = f ({\tilde{x}}_{0}) - f^{*}$ ， $Η^{*} = \sup_{x} \frac{1}{n} \sum_{i = 1}^{n} ‖ \nabla f_{i} (x) - \nabla f (x) ‖$ 和 $J (B_{t} < n) = {\begin{matrix} 1, & B_{t} < n; \\ 0, & 其他 . \end{matrix}$

证明：由 [8] 中引理B.3和等式(20)得到

$\begin{array}{l} \frac{η_{t} B_{t}}{b_{t}} (2 - \frac{2 b_{t}}{B_{t}} - 2 η_{t} L - \frac{b_{t}^{3}}{b_{t}^{3} - η_{t}^{2} L^{2} b_{t} B_{t} - η_{t}^{3} L^{3} B_{t}^{2}}) Ε {‖ \nabla f ({\tilde{x}}_{t}) ‖}^{2} \\ \leq 2 Ε [f ({\tilde{x}}_{t - 1}) - f ({\tilde{x}}_{t})] + \frac{2 η_{t}}{b_{t}} (1 + η_{t} L + \frac{b_{t}}{B_{t}}) Η^{*} J (B_{t} < n) . \end{array}$ (11)

由 $\frac{m}{t} \leq η_{t} \leq \frac{M}{t}$ ， $\frac{M L}{t} \leq \frac{1}{3} {(\frac{b_{t}}{B_{t}})}^{2 / 3}$ 和 $B_{t} \geq 8 b_{t}$ 可得

$\begin{matrix} 2 - \frac{2 b_{t}}{B_{t}} - 2 η_{t} L - \frac{b_{t}^{3}}{b_{t}^{3} - η_{t}^{2} L^{2} b_{t} B_{t} - η_{t}^{3} L^{3} B_{t}^{2}} \geq 2 - \frac{2 b_{t}}{B_{t}} - \frac{2 M L}{t} - \frac{t^{3} b_{t}^{3}}{t^{3} b_{t}^{3} - M^{2} t L^{2} b_{t} B_{t} - M^{3} L^{3} B_{t}^{2}} \\ \geq 2 - 2 \times \frac{1}{8} - 2 \times \frac{1}{3} \times \frac{1}{4} - \frac{1}{1 - 1 / (18 b_{t}) - 1 / (27 b_{t})} \geq \frac{73}{108} \geq \frac{2}{3}, \end{matrix}$

其中第三个不等式用了 $b_{t} \geq 1$ 。另外，

$1 + η_{t} L + \frac{b_{t}}{B_{t}} \leq 1 + \frac{M L}{t} + \frac{b_{t}}{B_{t}} \leq 1 + \frac{1}{12} + \frac{1}{8} = \frac{29}{24} \leq \frac{5}{4}$ 。

将上述两个系数代入(11)，并再次使用 $\frac{m}{t} \leq η_{t} \leq \frac{M}{t}$ 得到

$\frac{B_{t}}{t b_{t}} Ε {‖ \nabla f ({\tilde{x}}_{t}) ‖}^{2} \leq \frac{3}{m} Ε [f ({\tilde{x}}_{t - 1}) - f ({\tilde{x}}_{t})] + \frac{15 M}{4 m t b_{t}} Η^{*} J (B_{t} < n) .$ (12)

因为f是 $μ$ -强凸的，可得

$\frac{μ}{2} ‖ {\tilde{x}}_{t} - x^{*} ‖ \leq 〈 \nabla f ({\tilde{x}}_{t}), {\tilde{x}}_{t} - x^{*} 〉 + f^{*} - f ({\tilde{x}}_{t}) \leq \frac{{‖ \nabla f ({\tilde{x}}_{t}) ‖}^{2}}{2 μ} + \frac{μ {‖ {\tilde{x}}_{t} - x^{*} ‖}^{2}}{2} + f^{*} - f ({\tilde{x}}_{t}),$

其中第二个不等式利用了 $2 〈 a, b 〉 \leq \frac{{‖ a ‖}^{2}}{c} + c {‖ b ‖}^{2}$ 。重新整理上式得

${‖ \nabla f ({\tilde{x}}_{t}) ‖}^{2} \geq 2 μ (f ({\tilde{x}}_{t}) - f^{*}) .$

将上式代入(12)得到

$(3 t b_{t} + 2 μ m B_{t}) Ε [f ({\tilde{x}}_{t}) - f^{*}] \leq 3 t b_{t} Ε [f ({\tilde{x}}_{t - 1}) - f^{*}] + \frac{15}{4} M Η^{*} J (B_{t} < n) .$

替换 $b_{t} = t^{1 / 2}$ 和 $B_{t} = B_{0} t^{3 / 2}$ ，然后两边同除 $3 t^{3 / 2} + 2 μ m B_{0} t^{3 / 2}$ 可得

$\begin{matrix} Ε [f ({\tilde{x}}_{t}) - f^{*}] \leq (\frac{3}{3 + 2 μ m B_{0}}) Ε [f ({\tilde{x}}_{t - 1}) - f^{*}] + \frac{15 M Η^{*} J (B_{t} < n)}{4 (3 + 2 μ m B_{0}) t^{3 / 2}} \\ \leq (\frac{3}{3 + 2 μ m B_{0}}) Ε [f ({\tilde{x}}_{t - 1}) - f^{*}] + \frac{15 M Η^{*} J (B_{t} < n)}{4 (3 + 2 μ m B_{0})} . \end{matrix}$

其中最后一个不等式成立是因为 $t \geq 1$ 。上式可以写为

$Ε [f ({\tilde{x}}_{t}) - f^{*}] - \frac{15 M Η^{*} J (B_{t} < n)}{8 μ m B_{0}} \leq (\frac{3}{3 + 2 μ m B_{0}}) (Ε [f ({\tilde{x}}_{t - 1}) - f^{*}] - \frac{15 M Η^{*} J (B_{t} < n)}{8 μ m B_{0}}) .$ (13)

将 $t = T, \dots, 1$ 时的(13)累加求和得到

$Ε [f ({\tilde{x}}_{t}) - f^{*}] - \frac{15 M Η^{*} J (B_{t} < n)}{8 μ m B_{0}} \leq {(\frac{3}{3 + 2 μ m B_{0}})}^{T} (Ε [f ({\tilde{x}}_{0}) - f^{*}] - \frac{15 M Η^{*} J (B_{t} < n)}{8 μ m B_{0}}) \leq c_{0}^{T} Δ_{f} .$ (14)

重新整理(14)，证毕。

4. 数值实验

Figure 1. Comparison of different stochastic gradient algorithms

图1. 不同随机梯度类算法的对比

考虑正则化的逻辑回归问题

$f (x) = \frac{1}{n} \sum_{i = 1}^{n} \log (1 + \exp (- b_{i} a_{i}^{T} x)) + \frac{1}{2 n} {‖ x ‖}^{2},$

其中 ${(a_{i}, b_{i})}_{i = 1}^{n} \subset ℜ^{d} \times {- 1, 1}$ 是给定的训练集。 [7] 中指出内循环次数 $N_{t}$ 取期望值有助于增加SCSG算法的稳定性，且从 $I_{t}$ 中选取 ${\tilde{I}}_{k}$ 可以减小计算代价，所以在实验中设置 $N_{t} = ⌊ B_{t} / b_{t} ⌋$ (几何随机变量 $N_{t}$ 的期望)且从 $I_{t}$ 中选取 ${\tilde{I}}_{k}$ ，其中 $⌊ \cdot ⌋$ 记为向下取整。为了验证SCSGP的有效性，比较SCSGP、SCSG、SVRG、SVRGBB和SGD。具体地，SVRG中设置小批量 $b_{t} \equiv 1$ ；SCSG设置 $B_{t} \equiv 0.05 n$ ， $b_{t} \equiv 1$ ， $N_{t} = ⌊ B_{t} / b_{t} ⌋$ ；SCSGP设置 $B_{t} = ⌊ B_{0} t^{3 / 2} \land n ⌋$ ， $b_{t} = ⌊ t^{1 / 2} \land n ⌋$ ， $N_{t} = ⌊ B_{t} / b_{t} ⌋$ 。表1给出LIBSVM (网址： https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/)中四个标准数据集的信息。用表2的参数值进行对比实验，最优间隙随有效循环次数变化情况见图1。SCSGP明显比SCSG表现好，并且在前几个有效循环次数中，SCSGP与其它随机梯度类算法相比具有更好的数值结果。

Table 1. The information of data sets

表1. 数据集信息

Table 2. Parameters used for experiments

表2. 实验中的参数设置

5. 总结

基于Polyak步长和1/t-带步长提出1/t-Polyak步长，并将该步长与SCSG结合提出SCSGP算法。当目标函数强凸光滑时，SCSGP线性收敛。数值实验考虑正则化的逻辑回归问题，实验结果表明在前几个有效循环次数中SCSGP比其他随机梯度类算法表现好。

参考文献

[1]	Kasiviswanathan, S.P. and Jin, H. (2016) Efficient Private Empirical Risk Minimization for High-Dimensional Learning. International Conference on Machine Learning, 48, 488-497.
[2]	Krizhevsky, A., Sutskever, I. and Hinton, G.E. (2017) Imagenet Classification with Deep Convolutional Neural Networks. Communications of the ACM, 60, 84-90. https://doi.org/10.1145/3065386
[3]	Sutskever, I., Martens, J., Dahl, G., et al. (2013) On the Importance of Initialization and Momentum in Deep Learning. International Conference on Machine Learning, 28, 1139-1147.
[4]	Robbins, H. and Monro, S. (1951) A Stochastic Approximation Method. The Annals of Mathematical Statistics, 22, 400-407. https://doi.org/10.1214/aoms/1177729586
[5]	Bottou, L., Curtis, F.E. and Nocedal, J. (2018) Optimization Methods for Large-Scale Machine Learning. SIAM Review, 60, 223-311. https://doi.org/10.1137/16M1080173
[6]	Johnson, R. and Zhang, T. (2013) Accelerating Stochastic Gradient Descent Using Predictive Variance Reduction. Advances in Neural Information Processing Systems, 1, 315-323.
[7]	Lei, L. and Jordan, M. (2017) Less than a Single Pass: Stochastically Controlled Stochastic Gradient. Artificial Intelligence and Statistics, 54, 148-156.
[8]	Lei, L., Ju, C., Chen, J., et al. (2017) Non-Convex Finite-Sum Optimization via SCSG Methods. Advances in Neural Information Processing Systems, 11, 2345-2355.
[9]	Lei, L. and Jordan, M.I. (2020) On the Adaptivity of Stochastic Gradient-Based Optimization. SIAM Journal on Optimization, 30, 1473-1500. https://doi.org/10.1137/19M1256919
[10]	Gower, R.M., Loizou, N., Qian, X., et al. (2019) SGD: General Analysis and Improved Rates. International Conference on Machine Learning, 97, 5200-5209.
[11]	Ghadimi, S. and Lan, G. (2013) Stochastic First-and Zeroth-Order Methods for Nonconvex Stochastic Programming. SIAM Journal on Optimization, 23, 2341-2368. https://doi.org/10.1137/120880811
[12]	Rakhlin, A., Shamir, O. and Sridharan, K. (2011) Making Gradient Descent Optimal for Strongly Convex Stochastic Optimization. arXiv: 1109.5647.
[13]	Polyak, B.T. (1987) Introduction to Optimization. Optimization Software. Publications Division, New York.
[14]	Loizou, N., Vaswani, S., Laradji, I.H., et al. (2021) Stochastic Polyak Step-Size for SGD: An Adaptive Learning Rate for Fast Convergence. International Conference on Artificial Intelligence and Statistics, 130, 1306-1314.
[15]	Orvieto, A., Lacoste-Julien, S. and Loizou, N. (2022) Dynamics of SGD with Stochastic Polyak Stepsizes: Truly Adaptive Variants and Convergence to Exact Solution. Advances in Neural Information Processing Systems, 35, 26943-26954.
[16]	Wang, X. and Yuan, Y. (2023) On the Convergence of Stochastic Gradient Descent with Bandwidth-Based Step Size. Journal of Machine Learning Research, 24, 1-49.

为你推荐

友情链接