海量数据中的分布式支持向量回归
Distributed Support Vector Regression of Massive Datasets
摘要: 大规模的数据给传统的统计推断方法带来了新的挑战,比如在分析超过一台计算机容量的海量数据集时,由于数据太大,无法保存在计算机内存中,计算任务可能需要花费很长时间才能获得结果。为了有效地解决海量数据情形下的诸多问题,本文研究了支持向量回归的分布式估计。首先采用平滑技术发展了一种平滑支持向量回归(S-SVR)估计方法。然后基于分而治之的思想,针对海量数据集对S-SVR估计方法提出了分而治之支持向量回归估计算法(DC-SVR),该方法解决了内存限制和计算时间的问题。此外,本文中提出的DC-SVR方法中的参数可通过网格搜索和交叉验证相结合的方法获得,具有自适应性,其中最优的参数是由每次数据自动选择的。在模拟研究中,通过不同情形的实验表明了文章所提估计量的优越性,模拟结果显示通过DC-SVR所得的估计量在平均绝对偏差和均方误差评价准则下的差异更小。
Abstract: Large scale data brings new challenges to the traditional statistical inference methods. For example, when analyzing massive datasets whose sizes usually exceed the capacity of a single computer, the data is too large to be saved in computer memory and computing task may take too long to get the results. In order to effectively solve many problems in the case of massive data, this paper studies the distributed estimation of support vector regression. Firstly, we develop smoothed support vector regression (S-SVR) estimation method. Then based on the idea of divide-and-conquer, we propose divide-and-conquer support vector regression estimation algorithm (DC-SVR) for the S-SVR estimation method of massive datasets. This method solves the problems of memory limitation and computing time. In addition, the parameters in the DC-SVR method can be obtained by the combination of grid search and cross validation, which is adaptive. The optimal parameters are automatically selected by each data. In the simulation study, a lot of numerical simulation studies are carried out to verify the superiority of our proposed distributed estimator. The simulation results show that the DC-SVR estimator has less difference under the evaluation criteria of mean absolute deviation and mean square error.
文章引用:梁姝娜, 张齐. 海量数据中的分布式支持向量回归[J]. 应用数学进展, 2022, 11(4): 1876-1889. https://doi.org/10.12677/AAM.2022.114205

参考文献

[1] Mcdonald, R., Mohri, M., Silberman, N., Walker, D. and Mann, G.S. (2009) Efficient Large-Scale Distributed Training of Conditional Maximum Entropy Models. In: Advances in Neural Information Processing Systems, MIT Press, Cambridge, 1231-1239.
[2] Zinkevich, M., Weimer, M., Li, L.H. and Smola, A.J. (2010) Parallelized Stochastic Gradient Descent. In: Advances in Neural Information Processing Systems, MIT Press, Cambridge, 2595-2603.
[3] Zhang, Y., Duchi, J.C. and Wainwright, M.J. (2013) Communication-Efficient Algorithms for Statistical Optimization. The Journal of Machine Learning Research, 14, 3321-3363.
[4] Cheyer, A.J., Guzzoni, D.R., Gruber, T.R. and Brigham, C.D. (2014) Service Orchestration for Intelligent Automated Assistant. US20130111487A1.
[5] Jordan, M.I., Lee, J.D. and Yang, Y. (2016) Communication-Efficient Distributed Statistical Learning. arXiv:1605.07689.
[6] Lee, J.D., Liu, Q., Sun, Y. and Taylor, J.E. (2017) Communication-Efficient Sparse Regression. Journal of Machine Learning Research, 18, 1-30.
[7] Chen, X., Lee, J.D., Li, H. and Yang, Y. (2020) Distributed Estimation for Principal Component Analysis: A Gap-Free Approach. arXiv:2004.02336.
[8] Cortes, C. and Vapnik, V. (1995) Support-Vector Networks. Machine Learning, 20, 273-297. [Google Scholar] [CrossRef
[9] Liu, Y.F., Zhang, H.H., Park, C. and Ahn, J.Y. (2007) Support Vector Machines with Adaptive Lq Penalty. Computational Statistics & Data Analysis, 51, 6380-6394. [Google Scholar] [CrossRef
[10] Hsieh, C.J., Si, S. and Dhillon, I.S. (2013) A Divide-and-Conquer Solver for Kernel Support Vector Machines. arXiv:1311.0914.
[11] Lian, H. and Fan, Z.Y. (2018) Divide-and-Conquer for Debiased l1-Norm Support Vector Machine in Ultra-High Dimensions. Journal of Machine Learning Research, 18, 1-26.
[12] Smola, A.J. and Schölkopf, B. (2004) A Tutorial on Support Vector Regression. Statistics and Computing, 14, 199-222. [Google Scholar] [CrossRef
[13] Khemchandani, R., Jayadeva and Chandra, S. (2009) Regularized Least Squares Fuzzy Support Vector Regression for Financial Time Series Forecasting. Expert System with Applications, 36, 132-138. [Google Scholar] [CrossRef
[14] Rivas-Perea, P. and Cota-Ruiz, J. (2013) An Algorithm for Training a Large Scale Support Vector Machine for Regression Based on Linear Programming and Decomposition Methods. Pattern Recognition Letters, 34, 439-451. [Google Scholar] [CrossRef
[15] Cheng, A.Y., Jiang, X., Li, Y.F., Chao, Z. and Zhu, H. (2016) Multiple Sources and Multiple Measures Based Traffic Flow Prediction Using the Chaos Theory and Support Vector Regression Method. Physica A: Statistical Mechanics and Its Applications, 466, 422-434. [Google Scholar] [CrossRef
[16] Maldonado, S., Gonzalez, A. and Crone, S. (2019) Automatic Time Series Analysis for Electric Load Forecasting via Support Vector Regression. Applied Soft Computing Journal, 83, Article ID: 105616. [Google Scholar] [CrossRef
[17] Horowitz, J.L. (1998) Bootstrap Methods for Median Regression Models. Econometrica, 66, 1327-1351. [Google Scholar] [CrossRef
[18] Pang, L., Lu, W.B. and Wang, H.J. (2012) Variance Estimation in Censored Quantile Regression via Induced Smoothing. Computational Statistics & Data Analysis, 56, 785-796. [Google Scholar] [CrossRef] [PubMed]
[19] Chen, X., Liu, W.D. and Zhang, Y.C. (2018) Quantile Regression under Memory Constraint. Annals of Statistics, 44, 3244-3273. [Google Scholar] [CrossRef
[20] Wang, X., Yang, Z., Chen, X. and Liu, W. (2019) Distributed Inference for Linear Support Vector Machine. Journal of Machine Learning Research, 20, 1-41.
[21] Chen, K.Y. (2007) Forecasting Systems Reliability Based on Support Vector Regression with Genetic Algorithms. Reliability Engineering & System Safety, 92, 423-432. [Google Scholar] [CrossRef
[22] Nahil, A. and Lyhyaoui, A. (2018) Short-Term Stock Price Forecasting Using Kernel Principal Component Analysis and Support Vector Machines: The Case of Casablanca Stock Exchange. Procedia Computer Science, 127, 161-169. [Google Scholar] [CrossRef