稀疏模型下的模型选择方法比较及应用
Application and Comparison of Model Selection Methods under Sparse Models
DOI: 10.12677/SA.2018.75062, PDF,   
作者: 董赛玉, 李新民*:青岛大学数学与统计学院,山东 青岛;李康硕:山东科技大学数学与系统科学学院,山东 青岛
关键词: 模型选择变量选择稀疏模型Model Selection Variable Selection Sparse Model
摘要: 模型选择是统计研究的热点,随着大数据时代的到来,数据维数越来越高,在经济金融学领域,生物统计领域,图像处理等领域对模型的选择产生了更大的需求。同时稀疏模型在机器学习中发挥越来越重要的作用,它可以避免模型过度拟合的情况。本文主要基于多元线性回归模型进行模型选择的研究,主要对带有惩罚因子的模型选择方法中的岭回归、LASSO和SCAD三种方法和贝叶斯模型选择方法进行了总结阐述。后面通过数据模拟和实例分析,重点研究在稀疏性模型的前提下,对这四种模型选择的方法进行分析对比。通过研究分析可以发现在模型稀疏性较强的情况下,岭回归表现较好,SCAD方法能较好地去除不重要变量,结果与岭回归相差甚小,由此可以根据这种良好的性质在实际的应用当中得到充分利用。
Abstract: In this paper, we study the model selection in multiple linear regressions. Model selection is a hot topic in statistical research. With the advent of the era of large data, the dimension of data is getting higher and higher. There is a greater demand for model selection in the fields of economics and finance, biostatistics and image processing. At the same time, sparse model plays an increasingly important role in machine learning, which can avoid over-fitting. This paper mainly studies the model selection based on multiple linear regression model, and summarizes the ridge regression, LASSO and SCAD methods and Bayesian model selection methods. Later, through data simulation and case analysis, we focus on the sparse model under the premise of the four model selection methods that are analyzed and compared. Through research and analysis, it can be found that the results of ridge regression are better when the model is approximately sparse, SCAD method can better remove the unimportant variables, and the results are very little different from ridge regression. Therefore, this good property can be fully utilized in practical application.
文章引用:董赛玉, 李康硕, 李新民. 稀疏模型下的模型选择方法比较及应用[J]. 统计学与应用, 2018, 7(5): 533-541. https://doi.org/10.12677/SA.2018.75062

参考文献

[1] Hoerl, A.E. and Kennard, R.W. (1970) Ridge Regression: Biased Estimation for Nonorthogonal Problems. Technometrics, 12, 55-67. [Google Scholar] [CrossRef
[2] Tibshirani, R. (1996) Regression Shrinkage and Selection via the Lasso. Journal of the Royal Statistical Society, Series B, 58, 267-288.
[3] Fan, J. and Li, R. (2001) Variable Selection via Nonconcave Penalized Likelihood and Its Oracle Properties. Journal of the American Statistical Association, 96, 1348-1360. [Google Scholar] [CrossRef
[4] Luo, S. and Chen, Z. (2013) Extended BIC for Linear Regression Models with Diverging Number of Relevant Features and High or Ultra-High Feature Spaces. Journal of Statistical Planning and Inference, 143, 494-504. [Google Scholar] [CrossRef
[5] Cho, H. and Fryzlewicz, P. (2011) High Dimensional Variable Selection Viatilting. Journal of the Royal Statistical Society, Series B, 74, 593-622. [Google Scholar] [CrossRef
[6] Fan, J. and Lv, J. (2008) Sure Independence Screening for Ultrahigh Dimensional Feature Space. Journal of the Royal Statistical Society, Series B, 70, 849-911. [Google Scholar] [CrossRef] [PubMed]
[7] Fan, J. and Lv, J. (2010) A Selective Overview of Variable Selection in high Dimensional Feature Space. Statistica Sinica, 20, 101-148.
[8] Zhang, K., Yin, F. and Xiong, S. (2014) Comparisons of Penalized Least Squares Methods by Simulations. arXiv:1405.1796v1 [stat.CO]
[9] 白玥, 田茂再. 几种高维变量选择方法的比较及应用[J]. 统计与决策, 2017(22), 11-16.
[10] 李佳蓓, 朱永忠, 王明刚. 贝叶斯变量选择及模型平均的研究[J]. 统计与信息论坛, 2015, 30(8), 20-24.
[11] Breiman, L. (1995) Better Subset Regression Using the Nonnegative Garrote. Technometrics, 37, 373-384. [Google Scholar] [CrossRef
[12] Zou, H. (2006) The Adaptive Lasso and Its Oracle Properties. Journal of the American Statistical Association, 101, 1418-1429. [Google Scholar] [CrossRef
[13] 麦考斯, 德鲁伊特, 利凯. R软件教程与统计分析: 入门到精通[M]. 潘东东, 等, 译. 北京: 高等教育出版社, 2015.