# 基于正则化风险函数模型的束方法The Bundle Methods Based on the Regularized Risk Minimization Model

DOI: 10.12677/AAM.2019.87144, PDF, HTML, XML, 下载: 407  浏览: 1,616

Abstract: Based on the regularized risk function model (SRM) of machine learning (ML), this paper combines the bundle method for solving non-smooth functions, presents an algorithm for solving empirical risk function model. The objective function is approximated by the cut-plane model, and the step size is obtained by the inexact line search. Under appropriate assumptions, the global convergence and convergence speed of the algorithm are analyzed.

1. 引言

$\underset{w}{\mathrm{min}}J\left(w\right):=\lambda \Omega \left(w\right)+{R}_{emp}\left(w\right),$ (1.1)

2. 基础知识

$J\left(w\right)\ge {J}_{t}^{CP}\left(w\right):=\underset{1\le i\le t}{\mathrm{max}}\left\{J\left({w}_{i-1}\right)+〈w-{w}_{i-1},{s}_{i}〉\right\}.$ (2.2)

${w}_{t}:=\underset{w}{\mathrm{arg}\mathrm{min}}{J}_{t}^{CP}\left(w\right).$ (2.3)

${w}_{t}:=\underset{w}{\mathrm{arg}\mathrm{min}}\left\{\frac{{\zeta }_{t}}{2}{‖w-{\stackrel{^}{w}}_{t-1}‖}^{2}+{J}_{t}^{CP}\left(w\right)\right\}.$ (2.4)

3. 一个新的最小化正则风险函数算法(BMRM)

BMRM算法具体如下：

$\begin{array}{l}{a}_{t}\in {\partial }_{\omega }{R}_{emp}\left({w}_{t-1}\right),\\ {b}_{t}:={R}_{emp}\left({w}_{t-1}\right)-〈{w}_{t-1},{a}_{t}〉,\end{array}$

a) BMRM算法是用一个分段线性下界来接近 ${R}_{emp}\left(w\right)$ 而不是 $J\left(w\right)$

b) 在邻近束方法中，邻近项( $\frac{\zeta }{2}{‖w-{\stackrel{^}{w}}_{t}‖}^{2}$ )被正则化项 $\Omega \left(w\right)$ 替代，因此，就不需要调整邻近参数了。

$\begin{array}{l}{a}_{t}\in {\partial }_{\omega }{R}_{emp}\left({w}_{t-1}^{c}\right),\\ {b}_{t}:={R}_{emp}\left({w}_{t-1}^{c}\right)-〈{w}_{t-1}^{c},{a}_{t}〉,\end{array}$

$\begin{array}{l}{\eta }_{t}:=\underset{\eta \in R}{\mathrm{arg}\mathrm{min}}J\left({w}_{t-1}^{b}+\eta \left({w}_{t}-{w}_{t-1}^{b}\right)\right)\\ {w}_{t}^{b}:={w}_{t-1}^{b}+{\eta }_{t}\left({w}_{t}-{w}_{t-1}^{b}\right)\\ {w}_{t}^{c}:=\left(1-\theta \right){w}_{t}^{b}+\theta {w}_{t}\end{array}$

4. 改进的非精确线搜索BMRM算法

$\begin{array}{l}{a}_{t}\in {\partial }_{\omega }{R}_{emp}\left({w}_{t-1}^{c}\right),\\ {b}_{t}:={R}_{emp}\left({w}_{t-1}^{c}\right)-〈{w}_{t-1}^{c},{a}_{t}〉,\end{array}$

$\begin{array}{l}\text{if}\text{\hspace{0.17em}}J\left({w}_{t-1}^{b}+{\beta }^{m}\left({w}_{t}-{w}_{t-1}^{b}\right)\right)\le J\left({w}_{t-1}^{b}\right)+\sigma {\beta }^{m}\left({a}_{t}+\lambda {\Omega }^{\prime }\left(w\right)\right)\left({w}_{t}-{w}_{t-1}^{b}\right)\\ {\eta }_{t}={\beta }^{m}\\ m:=m+1\end{array}$

$\begin{array}{l}{w}_{t}^{b}:={w}_{t-1}^{b}+{\eta }_{t}\left({w}_{t}-{w}_{t-1}^{b}\right)\\ {w}_{t}^{c}:=\left(1-\theta \right){w}_{t}^{b}+\theta {w}_{t}\end{array}$

${\Omega }^{\ast }\left(\mu \right):=\underset{w\in W}{\mathrm{sup}}〈w,\mu 〉-\Omega \left(w\right).$

${w}_{t}:=\underset{w\in {R}^{d}}{\mathrm{arg}\mathrm{min}}\left\{{J}_{t}\left(w\right):=\underset{1\le i\le t}{\mathrm{max}}〈w,{a}_{i}〉+{b}_{i}+\lambda \Omega \left(w\right)\right\}$ (4.1)

${\alpha }_{t}=\underset{\alpha \in {R}^{t}}{\mathrm{arg}\mathrm{max}}\left\{{J}_{t}^{\ast }\left(\alpha \right):=-\lambda {\Omega }^{\ast }\left(-{\lambda }^{-1}A\alpha \right)+{\alpha }^{\text{T}}b|\alpha \ge 0,{‖\alpha ‖}_{1}=1\right\}$ (4.2)

$\begin{array}{l}\underset{w,\xi }{\mathrm{min}}\lambda \Omega \left(w\right)+\xi \\ \text{st}.\text{ }\text{ }\text{ }\text{ }\xi \ge 〈w,{a}_{i}〉+{b}_{i}\end{array}$

$\left\{\begin{array}{l}L\left(w,\xi ,\alpha \right)=\lambda \Omega \left(w\right)+\xi -{\alpha }^{\text{T}}\left(\xi {1}_{t}-{A}^{\text{T}}w-b\right)\\ \alpha \ge 0\end{array}$ (4.3)

${\mathrm{max}}_{w}〈w,-{\lambda }^{-1}A\alpha 〉-\Omega \left(w\right)={\Omega }^{\ast }\left(-{\lambda }^{-1}A\alpha \right).$

${\alpha }_{t}=\underset{\alpha \in {R}^{t}}{\mathrm{arg}\mathrm{max}}\left\{-\frac{1}{2\lambda }{\alpha }^{\text{T}}{A}^{\text{T}}A\alpha +{\alpha }^{\text{T}}b|\alpha \ge 0,{‖\alpha ‖}_{1}=1\right\}$

5. 收敛性分析

${\epsilon }_{t}-{\epsilon }_{t+1}\ge \frac{{\epsilon }_{t}}{2}\mathrm{min}\left(1,\frac{\lambda {\epsilon }_{t}}{4{G}^{2}{H}^{\ast }}\right).$

$n\le {\mathrm{log}}_{2}\frac{\lambda J\left(0\right)}{{G}^{2}{H}^{\ast }}+\frac{8{G}^{2}{H}^{\ast }}{\lambda \epsilon }-1$

$J\left({w}_{t}^{c}\right)={J}_{t+1}\left({w}_{t}^{c}\right)\text{=}\lambda \Omega \left({w}_{t}^{c}\right)+〈{a}_{t+1},{w}_{t}^{c}〉+{b}_{t+1}\ge J\left({w}_{t}^{b}\right).$ (5.1)

$\Omega \left(\left(1-\theta \right){w}_{t}^{b}+\theta {w}_{t}\right)\le \left(1-\theta \right)\Omega \left({w}_{t}^{b}\right)+\theta \Omega \left({w}_{t}\right),$

$\theta \left(\Omega \left({w}_{t}^{b}\right)-\Omega \left({w}_{t}\right)\right)\le \Omega \left({w}_{t}^{b}\right)-\Omega \left({w}_{t}^{c}\right),$

$\begin{array}{l}\lambda \theta \text{\hspace{0.17em}}\Omega \left({w}_{t}^{b}\right)\text{+}\theta {R}_{emp}\left({w}_{t}^{b}\right)-\lambda \theta \text{\hspace{0.17em}}\Omega \left({w}_{t}\right)-\theta {R}_{t}\left({w}_{t}\right)\\ \le \lambda \Omega \left({w}_{t}^{b}\right)+{R}_{emp}\left({w}_{t}^{b}\right)-\lambda \Omega \left({w}_{t}^{c}\right)-\left(1-\theta \right){R}_{emp}\left({w}_{t}^{b}\right)-\theta {R}_{t}\left({w}_{t}\right),\end{array}$

$\theta {\epsilon }_{t}\le J\left({w}_{t}^{b}\right)-\lambda \Omega \left({w}_{t}^{c}\right)-\left(1-\theta \right){R}_{emp}\left({w}_{t}^{b}\right)-\theta {R}_{t}\left({w}_{t}\right),$ (5.2)

$〈{a}_{t+1},{w}_{t}^{c}〉+{b}_{t+1}\ge J\left({w}_{t}^{b}\right)-\lambda \Omega \left({w}_{t}^{c}\right)\ge \left(1-\theta \right){R}_{emp}\left({w}_{t}^{b}\right)+\theta {R}_{t}\left({w}_{t}\right)+\theta {\epsilon }_{t}$

${w}_{t}^{c}=\left(1-\theta \right){w}_{t}^{b}+\theta {w}_{t}$ 可得：

$\left(1-\theta \right)〈{a}_{t+1},{w}_{t}^{b}〉+\theta 〈{a}_{t+1},{w}_{t}〉+{b}_{t+1}\ge J\left({w}_{t}^{b}\right)-\lambda \Omega \left({w}_{t}^{c}\right)\ge \left(1-\theta \right){R}_{emp}\left({w}_{t}^{b}\right)+\theta {R}_{t}\left({w}_{t}\right)+\theta {\epsilon }_{t},$

$\left(1-\theta \right)\left(〈{a}_{t+1},{w}_{t}^{b}〉-{R}_{emp}\left({w}_{t}^{b}\right)\right)+\theta \left(〈{a}_{t+1},{w}_{t}〉-{R}_{t}\left({w}_{t}\right)\right)+{b}_{t+1}\ge J\left({w}_{t}^{b}\right)-\lambda \Omega \left({w}_{t}^{c}\right)\ge \theta {\epsilon }_{t}.$ (5.3)

${R}_{emp}\left({w}_{t}^{b}\right)\ge 〈{w}_{t}^{b},{a}_{t+1}〉+{b}_{t+1}$ (5.4)

$\left(1-\theta \right)\left(-{b}_{t+1}\right)+\theta \left(〈{w}_{t}^{b},{a}_{t+1}〉-{R}_{t}\left({w}_{t}\right)\right)+{b}_{t+1}\ge \theta {\epsilon }_{t}$

$〈{w}_{t}^{b},{a}_{t+1}〉+{b}_{t+1}\ge {R}_{t}\left({w}_{t}\right)+{\epsilon }_{t}$

${R}_{t+1}\left({w}_{t}\right)=\mathrm{max}\left(〈{w}_{t},{a}_{t+1}〉+{b}_{t+1},{R}_{t}\left({w}_{t}\right)\right)=〈{w}_{t},{a}_{t+1}〉+{b}_{t+1}$

${J}_{t+1}\left({w}_{t}\right)=\lambda \Omega \left({w}_{t}\right)+{R}_{t+1}\left(wt\right)$

$\begin{array}{c}{\epsilon }_{t}-{\epsilon }_{t+1}=J\left({w}_{t}^{b}\right)-{J}_{t}\left({w}_{t}\right)-J\left({w}_{t+1}^{b}\right)+{J}_{t+1}\left({w}_{t+1}\right)\\ =J\left({w}_{t}^{b}\right)-J\left({w}_{t+1}^{b}\right)+{J}_{t+1}\left({w}_{t+1}\right)-{J}_{t}\left({w}_{t}\right)\\ \ge {J}_{t+1}\left({w}_{t+1}\right)-{J}_{t}\left({w}_{t}\right).\end{array}$

${\left[-{\alpha }_{t},1\right]}^{\text{T}}{\stackrel{¯}{A}}^{\text{T}}\stackrel{¯}{A}\left[-{\alpha }_{t},1\right]\le 4{G}^{2}.$

$\begin{array}{c}{\left[-{\alpha }_{t},1\right]}^{\text{T}}{\stackrel{¯}{A}}^{\text{T}}\stackrel{¯}{A}\left[-{\alpha }_{t},1\right]={‖{\partial }_{w}\lambda \Omega \left({w}_{t}\right)+{a}_{t+1}‖}^{2}\\ ={‖{\partial }_{w}\lambda \Omega \left({w}_{t}\right)‖}^{2}+2{\partial }_{w}\lambda \Omega {\left({w}_{t}\right)}^{\text{T}}{a}_{t+1}+{‖{a}_{t+1}‖}^{2}\\ \le 4{G}^{2}.\end{array}$

${\rho }_{t}-{\rho }_{t+1}\ge c{\left({\rho }_{t}\right)}^{2}$

${\rho }_{t}\le \frac{1}{c\left(t-1+\frac{1}{{\rho }_{1}c}\right)}.$

$n\le \frac{8{G}^{2}{H}^{\ast }}{\lambda \epsilon }$ 步。

${\epsilon }_{t}-{\epsilon }_{t\text{+}1}\ge \frac{\lambda {\epsilon }_{t}^{\text{2}}}{8{G}^{2}{H}^{\ast }}$

$n\le \frac{1}{c\epsilon }=\frac{8{G}^{2}{H}^{\ast }}{\lambda \epsilon }.$

 [1] 周志华. 机器学习[M]. 北京: 清华大学出版社, 2016. [2] 李航. 统计机器学习[M]. 北京: 清华大学出版社, 2012. [3] 袁亚湘, 孙文瑜. 最优化理论与方法[M]. 武汉: 科学出版社, 2002. [4] Kelley, J.E. (1960) The Cutting-Plane Method for Solving Convex Programs. Journal of the Society for Industrial & Applied Mathematics, 8, 703-712. https://doi.org/10.1137/0108053 [5] 陈韵梅, 张维. 基于近似一阶信息的加速的Bundle Level算法[J]. 中国科学: 数学, 2017(10): 1119-1142. [6] Teo, C.H., Viswanathan, S.V.N., Smola, A.J. and Le, Q. (2010) Bundle Methods for Regularized Risk Minimization, Journal of Machine Learning Research, 11, 311-365. [7] Shalev-Shwartz and Singer, Y. (2006) Online Learning Meets Optimization in the Dual. In: Simon, H.U. and Lugosi, G., Eds., Computational Learning Theory, Springer, Berlin, 423-437. https://doi.org/10.1007/11776420_32 [8] Abe, N., Takeuchi, J. and Warmuth, M.K. (2001) Polynomial Learn Ability of Sto-chastic Rules with Respect to the KL-Divergence and Quadratic Distance. IEICE Transactions on Information and Systems, 84, 299-316.