基于GINI相关系数的超高维分类特征变量筛选
Selection of Ultra-High Dimensional Classification Feature Variables Based on GINI Correlation Coefficient
摘要: 本文提出了一种基于GINI相关系数的超高维判别分析的GINI相关特征筛选方法。该方法在简单条件下建立了稳健筛选性能。首先,对重尾分布、潜在异常值情况下的数据具有稳健性。其次,它没有具体的数据模型限制,适用于参数及非参数模型。第三,由于所得统计量的有界性,在理论推导上比较简单。第四,筛选指标结构简单,计算成本低。通过蒙特卡罗模拟和实际数据实例验证了该方法的有效性。
Abstract: We proposed a new method named GINI correlation feature screening for ultrahigh dimensional discriminant analysis based GINI correlation coefficients. We also establish the sure screening property for the proposed procedure under simple assumptions. The new procedure has some additional desirable characters. First, it is robust against heavy-tailed distributions, potential outliers and the sample shortage for some categories. Second, it is model-free without any specification of a regression model and directly applicable to the situation with many categories. Third, it is simple in theoretical derivation due to the boundedness of the resulting statistics. Forth, it is relatively inexpensive in computational cost because of the simple structure of the screening index. Monte Carlo simulations and real data examples are used to demonstrate the finite sample performance.
文章引用:司萌, 张俊英, 张妍. 基于GINI相关系数的超高维分类特征变量筛选[J]. 应用数学进展, 2024, 13(3): 967-980. https://doi.org/10.12677/aam.2024.133091

参考文献

[1] Fan, J. and Lv, J. (2008) Sure Independence Screening for Ultrahigh Dimensional Feature Space (with Discussion). Journal of the Royal Statistical Society, Series B, 70, 849-911. [Google Scholar] [CrossRef] [PubMed]
[2] Fan, J. and Song, R. (2010) Sure Independence Screening in Generalized Linear Models with NP-Dimensionality. The Annals of Statistics, 38, 3567-3604. [Google Scholar] [CrossRef
[3] Fan, J., Feng, Y. and Song, R. (2011) Nonparametric Independence Screening in Sparse Ultra-High Dimensional Additive Models. Journal of the American Statistical Association, 106, 544-557. [Google Scholar] [CrossRef] [PubMed]
[4] Fan, J., Feng, Y. and Wu, Y. (2010) High-Dimensional Variable Selection for Cox’s Proportional Hazards Model. In: Berger, J.O., Cai, T.T. and Johnstone, I.M., Eds., IMS Collections, Borrowing Strength: Theory Powering Applications, A Festschrift for Lawrence D. Brown, Vol. 6, IMS, Beachwood, 70-86. [Google Scholar] [CrossRef
[5] Ma, S., Li, R. and Tsai, C.-L. (2017) Variable Screening via Quantile Partial Correlation. Journal of the American Statistical Association, 112, 650-663. [Google Scholar] [CrossRef] [PubMed]
[6] Fan, J., Ma, J. and Dai, W. (2014) Nonparametric Independence Screening in Sparse Ultra-High Dimensional Varying Coefficient Models. Journal of the American Statistical Association, 109, 1270-1284. [Google Scholar] [CrossRef] [PubMed]
[7] Zhang, J.Y., Zhang, R.Q. and Lu, Z.P. (2016) Quantile-Adaptive Variable Screening in Ultra-High Dimensional Varying Coefficient Models. Journal of Applied Statistics, 43, 643-654. [Google Scholar] [CrossRef
[8] Li, R., Zhong, W. and Zhu, L. (2012) Feature Screening via Distance Correlation Learning. Journal of the American Statistical Association, 107, 1129-1139. [Google Scholar] [CrossRef] [PubMed]
[9] Mai, Q., Zou, H., et al. (2015) The Fused Kolmogorov Filter: A Nonparametric Model-Free Screening Method. The Annals of Statistics, 43, 1471-1497. [Google Scholar] [CrossRef
[10] Liu, Y. and Wang, Q. (2017) Model-Free Feature Screening for Ultrahigh-Dimensional Data Conditional on Some Variables. Annals of the Institute of Statistical Mathematics, 23, 1-19.
[11] Huang, Q. and Zhu, Y. (2016) Model-Free Sure Screening via Maximum Correlation. Journal of Multivariate Analysis, 148, 89-106. [Google Scholar] [CrossRef
[12] Shao, X. and Zhang, J. (2014) Martingale Difference Correlation and Its Use in High-Dimensional Variable Screening. Journal of the American Statistical Association, 109, 1302-1318. [Google Scholar] [CrossRef
[13] Feng, Y., Wu, Y. and Stefanski, L.A. (2018) Nonparametric Independence Screening via Favored Smoothing Bandwidth. Journal of Statistical Planning and Inference, 197, 1-14. [Google Scholar] [CrossRef
[14] Zhang, J.Y., Zhang, R.Q. and Zhang, J.J. (2018) Feature Screening for Nonparametric and Semiparametric Models with Ultrahigh-Dimensional Covariates. Journal of Systems Science and Complexity, 31, 1350-1361. [Google Scholar] [CrossRef
[15] Dang, X., Nguyena, D., Chen, Y.X. and Zhang, J.Y. (2019) New Gini Correlation between Quantitative and Qualitative Variables, Scandinavian Journal of Statistics, 48, 1314-1343. [Google Scholar] [CrossRef
[16] David, H.A. (1968) Gini’s Mean Difference Rediscovered. Biometrika, 55, 573-575. [Google Scholar] [CrossRef
[17] Gini, C. (1914) Sulla misura della concentrazione e della variabiliaà dei caratteri. Atti del Reale Istituto Veneto di Scienze, Lettere ed Aeti, 62, 1203-1248.
[18] Yitzhaki, S. and Schechtman, E. (2013) The Gini Methodology. Springer, New York. [Google Scholar] [CrossRef
[19] Dorfman, R. (1979) A Formula for the Gini Coefficient. Review of Economics and Statistics, 61, 146-149. [Google Scholar] [CrossRef
[20] Huang, J., Horowitz, J. and Ma, S. (2008) Asymptotic Properties of Bridge Estimators in Sparse High-Dimensional Regression Models. The Annals of Statistics, 36, 587-613. [Google Scholar] [CrossRef
[21] Hoeffding, W. (1948) A Class of Statistics with Asymptotically Normal Distribution. The Annals of Mathematical Statistics, 19, 293-325. [Google Scholar] [CrossRef
[22] Schechtman, E. (1991) On Estimating the Asymptotic Variance of a Function of U Statistics. The American Statistician, 45, 103-106. [Google Scholar] [CrossRef
[23] Serfling, R.J. (2009) Approximation Theorems of Mathematical Statistics. John Wiley & Sons, Hoboken.
[24] Cui, H., Li, R. and Zhong, W. (2012) Model-Free Feature Screening for Ultrahigh Dimensional Discriminant Analysis. Journal of the American Statistical Association, 110, 630-641. [Google Scholar] [CrossRef] [PubMed]
[25] Pan, R., Wang, H. and Li, R. (2013) On the Ultrahigh Dimensional Linear Discriminant Analysis Problem with a Diverging Number of Classes.
[26] Zhu, L.P., Li, L., Li, R. and Zhu, L.X. (2011) Model-Free Feature Screening for Ultrahigh Dimensional Data. Journal of the American Statistical Association, 106, 1464-1475. [Google Scholar] [CrossRef] [PubMed]
[27] Meier, L., Van De Geer, S. and Bühlmann, P. (2009) High-Dimensional Additive Modeling. Annals of Statistics, 37, 3779-3821. [Google Scholar] [CrossRef
[28] Gordon, G., Jensen, R., Hsiao, L., Gullans, S., Blumenstock, J., Ramaswamy, S., Richards, W., Sugarbaker, D. and Bueno, R. (2002) Translation of Microarray Data into Clinically Relevant Cancer Diagnostic Tests Using Gene Expression Ratios in Lung Cancer and Mesothelioma. Cancer Research, 62, 4963-4967.
[29] Fan, J. and Fan, Y. (2008) High-Dimensional Classification Using Features Annealed Independence Rules. The Annals of Statistics, 36, 2605-2637. [Google Scholar] [CrossRef
[30] Tibshirani, R., Hastie, T., Narasimhan, B. and Chu, G. (2002) Diagnosis of Multiple Cancer Types by Shrunken Centroids of Gene Expression. Proceedings of the National Academy of Sciences, 99, 6567-6572. [Google Scholar] [CrossRef] [PubMed]
[31] Witten, D.M. and Tibshirani, R. (2011) Penalized Classification Using Fisher’s Linear Discriminant. Journal of the Royal Statistical Society Series B: Statistical Methodology, 73, 753-772. [Google Scholar] [CrossRef] [PubMed]
[32] Clemmensen, L., Hastie, T., Witten, D., et al. (2011) Sparse Discriminant Analysis. Technometrics, 53, 406-413. [Google Scholar] [CrossRef
[33] Bhattacharjee, A., Richards, W., Staunton, J., Li, C., Monti, S., Vasal, P., et al. (2001) Classification of Human Lung Carcinomas by MRNA Expression Profiling Reveals Distinct Adenocarcinoma Subclasses. PNAS, 98, 13790-13795. [Google Scholar] [CrossRef] [PubMed]