零膨胀几何分布的变量选择
Variable Selection of Zero-Inflated Geometric Distribution
DOI: 10.12677/AAM.2021.104135, PDF,   
作者: 文静蕊, 赵丽华:太原理工大学数学学院,山西 晋中
关键词: 零膨胀几何回归变量选择LASSOSCADMCPZero-Inflated Geometric Regression Variable Selection LASSO SCAD MCP
摘要: 在卫生服务和结果研究中,经常遇到计数结果,并且通常零占有有很大比例。零膨胀几何回归模型是分析几何部分过多零的有力工具。在实际建模中,搜集到的变量中可能存在与目标完全无关的变量(冗余变量)或者有些变量已知和目标相关,但实际的影响微乎其微。针对协变量多且具有相关性的问题,本文在似然函数的基础上添加SCAD、MCP和LASSO惩罚,得到基于零膨胀几何回归的惩罚目标函数,然后利用EM算法研究模型的参数估计和变量选择。仿真研究表明:该模型不仅具有准确的参数估计,而且比传统的逐步选择方法更优越。
Abstract: In health services and outcome research, count results are often encountered, and there is usually a large proportion of zeros. The zero-inflated geometric regression model is a powerful tool for analyzing excessive zeros in geometrical parts. In actual modeling, there may be variables that are completely unrelated to the target (redundant variables) among the collected variables, or some variables are known to be related to the target, but the actual impact is minimal. Aiming at the problem of many covariates and correlations, this paper adds SCAD, MCP and LASSO penalties to the likelihood function to obtain a penalty objective function based on zero-inflated geometric regression, and then uses the EM algorithm to study the parameter estimation and variable selection of the model problem. Simulation research shows that the model not only has accurate parameter estimation, but also is superior to the traditional stepwise selection method.
文章引用:文静蕊, 赵丽华. 零膨胀几何分布的变量选择[J]. 应用数学进展, 2021, 10(4): 1243-1254. https://doi.org/10.12677/AAM.2021.104135

参考文献

[1] Cohen Jr., A.C. (1960) Estimating the Parameters of a Modified Poisson Distribution. Journal of the American Statistical Association, 55, 139-143. [Google Scholar] [CrossRef
[2] Mullahy, J. (1986) Specification and Testing of Some Modified Count Data Models. Journal of Econometrics, 33, 341-365. [Google Scholar] [CrossRef
[3] Lambert, D. (1992) Zero-Inflated Poisson Regression with an Application to Defects in Manufacturing. Technometrics, 34, 1-14. [Google Scholar] [CrossRef
[4] Lee, J.H., Han, G., Fulp, W.J., et al. (2012) Analysis of Overdispersed Count Data: Application to the Human Papillomavirus Infection in Men (HIM) Study. Epidemiology & Infection, 140, 1087-1094. [Google Scholar] [CrossRef
[5] Lee, S.M., Li, C.S., Hsieh, S.H., et al. (2012) Semiparametric Estimation of Logistic Regression Model with Missing Covariates and Outcome. Metrika, 75, 621-653. [Google Scholar] [CrossRef
[6] Huang, L., Zheng, D., Zalkikar, J., et al. (2017) Zero-Inflated Poisson Model Based Likelihood Ratio Test for Drug Safety Signal Detection. Statistical Methods in Medical Research, 26, 471-488. [Google Scholar] [CrossRef] [PubMed]
[7] Liu, H. (2007) Growth Curve Models for Zero-Inflated Count Data: An Application to Smoking Behavior. Structural Equation Modeling: A Multidisciplinary Journal, 14, 247-279. [Google Scholar] [CrossRef
[8] 肖翔, 刘福窑. 零膨胀几何分布的参数估计[J]. 上海工程技术大学学报, 2018, 32(3): 267-271+277.
[9] 肖翔. 0-1膨胀几何分布回归模型及其应用[J]. 系统科学与数学, 2019, 39(9): 1486-1499.
[10] Breiman, L. (1995) Better Subset Regression Using the Nonnegative Garrote. Technometrics, 37, 373-384. [Google Scholar] [CrossRef
[11] Shen, X. and Ye, J. (2002) Adaptive Model Selection. Journal of the American Statistical Association, 97, 210-221. [Google Scholar] [CrossRef
[12] Hoerl, A.E. and Kennard, R.W. (1970) Ridge Regression: Biased Estimation for Nonorthogonal Problems. Technometrics, 12, 55-67. [Google Scholar] [CrossRef
[13] Frank, L.L.E. and Friedman, J.H. (1993) A Statistical View of Some Chemometrics Regression Tools. Technometrics, 35, 109-135. [Google Scholar] [CrossRef
[14] Tibshirani, R. (1996) Regression Shrinkage and Selection via the Lasso. Journal of the Royal Statistical Society: Series B (Methodological), 58, 267-288. [Google Scholar] [CrossRef
[15] Meinshausen, N. (2007) A Note on the Lasso for Gaussian Graphical Model Selection. Statistics and Probability Letters, 78, 880-884. [Google Scholar] [CrossRef
[16] Fan, J. and Li, R. (2001) Variable Selection via Nonconcave Penalized Likelihood and Its Oracle Properties. Journal of the American Statistical Association, 96, 1348-1360. [Google Scholar] [CrossRef
[17] Fan, J. and Li, R. (2006) Statistical Challenges with High Dimensionality: Feature Selection in Knowledge Discovery. Proceedings of the International Congress of Mathematicians, Madrid, 22-30 August 2006, 595-622.
[18] Zou, H. and Hastie, T. (2005) Regularization and Variable Selection via the Elastic Net. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 67, 301-320. [Google Scholar] [CrossRef
[19] Zhang, C.H. (2010) Nearly Unbiased Variable Selection under Minimax Concave Penalty. The Annals of Statistics, 38, 894-942. [Google Scholar] [CrossRef
[20] Buu, A., Johnson, N.J., Li, R., et al. (2011) New Variable Selection Methods for Zero-Inflated Count Data with Applications to the Substance Abuse Field. Statistics in Medicine, 30, 2326-2340. [Google Scholar] [CrossRef] [PubMed]
[21] Wang, Z., Ma, S., Wang, C.Y., et al. (2014) EM for Regularized Zero-Inflated Regression Models with Applications to Postoperative Morbidity after Cardiac Surgery in Children. Statistics in Medicine, 33, 5192-5208. [Google Scholar] [CrossRef] [PubMed]
[22] Chen, T., Wu, P., Tang, W., et al. (2016) Variable Selection for Distribution-Free Models for Longitudinal Zero-Inflated Count Responses. Statistics in Medicine, 35, 2770-2785. [Google Scholar] [CrossRef] [PubMed]