基于Logistic回归模型的高维数据变量选择方法比较研究
A Comparative Study of Variable Selection Methods for High Dimensional Data Based on Logistic Regression Model
DOI: 10.12677/SA.2019.83062, PDF,    科研立项经费支持
作者: 廖 丹*:北方工业大学理学院,北京
关键词: 高维数据变量选择Logistic回归模型 High Dimensional Data Variable Selection Logistic Regression Model
摘要: 高维数据已成为现代大数据分析中的热点研究领域。变量选择是一种被广泛用于高维数据分析问题的方法。文献中已出现大量高维变量选择方法,为研究其中有影响的几种方法的适用范围和利弊,本文考虑了lasso、自适应lasso等变量选择方法来研究logistic回归模型中的变量选择问题。首先,通过随机模拟实验研究,分别在低维和高维的情况下比较不同变量选择方法的预测和变量选择效果。然后,在实际数据集中做进一步地实证比较研究。研究结果表明:在同等条件下,自适应lasso在模型预测和可解释性方面均比lasso更具优势。
Abstract: High-dimensional data has become a hot research field in modern large data analysis. Variable selection is a widely-used method for high-dimensional data analysis. A large number of high-dimensional variable selection methods have appeared in the literatures. In order to compare the scope of application, advantages and disadvantages of several influential methods, in this paper, we consider the variable selection methods such as lasso and adaptive lasso to study the variable selection problem in logistic regression model. Firstly, by random simulation experiments, we compare the prediction and selection effects of different variable selection methods in low and high dimensions respectively. Then, we do further empirical analysis in the real data. The results show that under the same conditions, adaptive lasso has more advantages than lasso in model prediction and interpretability.
文章引用:廖丹. 基于Logistic回归模型的高维数据变量选择方法比较研究[J]. 统计学与应用, 2019, 8(3): 553-559. https://doi.org/10.12677/SA.2019.83062

参考文献

[1] Hastie, T., Tibshirani, R. and Friedman, J. (2009) The Elements of Statistical Learning: Data Mining, Inference and Prediction. 2nd Edition, Springer, Berlin.
[2] Tibshirani, R. (1996) Regression Shrinkage and Selection via the Lasso. Journal of the Royal Statistical Society. Series B, 58, 267-288. [Google Scholar] [CrossRef
[3] Zou, H. (2006) The Adaptive Lasso and Its Oracle Properties. Journal of the American Statistical Association, 101, 1418-1429. [Google Scholar] [CrossRef
[4] Bielza, C., Robles, V. and Larrañaga, P. (2011) Regularized Logistic Regression without Apenalty Term: An Application to Cancer Classification with Microarray Data. Expert Systems with Applications, 38, 5110-5118. [Google Scholar] [CrossRef
[5] James, G., Witten, D., Hastie, T. and Tibshirani, R. (2013) An Introduction to Statistical Learning. 2nd Edition, Springer, Berlin, 204-219.
[6] 宋瑞琪, 朱永忠, 王新军. 高维数据中变量选择研究[J]. 统计与决策, 2019, 3(2): 13-16