基于Logistic回归模型的高维数据变量选择方法比较研究
A Comparative Study of Variable Selection Methods for High Dimensional Data Based on Logistic Regression Model
摘要:
高维数据已成为现代大数据分析中的热点研究领域。变量选择是一种被广泛用于高维数据分析问题的方法。文献中已出现大量高维变量选择方法,为研究其中有影响的几种方法的适用范围和利弊,本文考虑了lasso、自适应lasso等变量选择方法来研究logistic回归模型中的变量选择问题。首先,通过随机模拟实验研究,分别在低维和高维的情况下比较不同变量选择方法的预测和变量选择效果。然后,在实际数据集中做进一步地实证比较研究。研究结果表明:在同等条件下,自适应lasso在模型预测和可解释性方面均比lasso更具优势。
Abstract:
High-dimensional
data has become a hot research field in modern large data analysis. Variable
selection is a widely-used method for high-dimensional data analysis. A large
number of high-dimensional variable
selection methods have appeared in the literatures. In order to compare the
scope of application, advantages and disadvantages of several influential
methods, in this paper, we consider the variable selection methods such as
lasso and adaptive lasso to study the variable selection problem in logistic
regression model. Firstly, by random simulation experiments, we compare the prediction
and selection effects of different variable selection methods in low and high
dimensions respectively. Then, we do further empirical analysis in the real
data. The results show that under the same conditions, adaptive lasso has more
advantages than lasso in model prediction and interpretability.
参考文献
|
[1]
|
Hastie, T., Tibshirani, R. and Friedman, J. (2009) The Elements of Statistical Learning: Data Mining, Inference and Prediction. 2nd Edition, Springer, Berlin.
|
|
[2]
|
Tibshirani, R. (1996) Regression Shrinkage and Selection via the Lasso. Journal of the Royal Statistical Society. Series B, 58, 267-288. [Google Scholar] [CrossRef]
|
|
[3]
|
Zou, H. (2006) The Adaptive Lasso and Its Oracle Properties. Journal of the American Statistical Association, 101, 1418-1429. [Google Scholar] [CrossRef]
|
|
[4]
|
Bielza, C., Robles, V. and Larrañaga, P. (2011) Regularized Logistic Regression without Apenalty Term: An Application to Cancer Classification with Microarray Data. Expert Systems with Applications, 38, 5110-5118. [Google Scholar] [CrossRef]
|
|
[5]
|
James, G., Witten, D., Hastie, T. and Tibshirani, R. (2013) An Introduction to Statistical Learning. 2nd Edition, Springer, Berlin, 204-219.
|
|
[6]
|
宋瑞琪, 朱永忠, 王新军. 高维数据中变量选择研究[J]. 统计与决策, 2019, 3(2): 13-16
|