基于多准则决策的不平衡感知数据集成特征选择算法
Imbalance-Aware Data Based on Multi-Criteria Decision Making Integrated Feature Selection Algorithm
DOI: 10.12677/JSTA.2023.116061, PDF,    科研立项经费支持
作者: 王 刚, 任丽萍, 方 力:南京航空航天大学自动化学院,江苏 南京;徐维磊:南通思振电子科技有限公司,江苏 南通
关键词: 不平衡数据分类组合采样多准则决策VIKOR法前向序列选择 Unbalanced Data Classification Combined Sampling Multi-Criteria Decision Making VIKOR Method Forward Sequence Selection
摘要: 在数据挖掘领域,不平衡数据普遍存在。在许多情况下,这些数据通常具有高维性和类不平衡性。不平衡数据集特征属性分布失衡,会造成分类性能下降,数据的高维性则会导致学习算法非常耗时。针对这一问题,提出了一种基于组合采样和集成学习的特征选择方法。首先使用组合采样方法,处理类不平衡问题,重点合成少数类样本,在保证数据集达到平衡的前提下去除噪声样本,将集成特征选择建模为一个多准则决策过程,使用VIKOR方法得到特征重要性排序,然后在序列前向搜索特征的过程中,使用XGBoost算法的准确率作为评估特征子集优劣的指标,确定最优特征子集。选择AUC、G-mean和F-measure作为评价指标,通过在5组不平衡数据集进行实验,证实了所提算法具有更好的分类效果,且模型的鲁棒性更好。
Abstract: In the field of data mining, unbalanced data are prevalent. In many cases, these data are usually of high dimensionality and class imbalance. An unbalanced distribution of feature attributes in unbalanced datasets can cause degradation of classification performance, while the high dimensionality of the data can lead to very time-consuming learning algorithms. To address this problem, a feature selection method based on combinatorial sampling and integrated learning is proposed. Firstly, we use the combined sampling method to deal with the class imbalance problem, focus on synthesizing a few class samples, and remove the noise samples under the premise of ensuring that the dataset is balanced, model the integrated feature selection as a multi-criteria decision-making process, and use the VIKOR method to get the feature importance ranking, and then in the process of sequential forward searching for the features, we use the accuracy of the XGBoost algorithm as an indicator of the assessment of the feature subset’s The optimal feature subset is determined by using the index of superiority and inferiority. AUC, G-mean, and Fmeasure are chosen as the evaluation indexes, and the proposed algorithm is confirmed to have a better classification effect and better robustness of the model through the experiments in five unbalanced datasets.
文章引用:王刚, 任丽萍, 方力, 徐维磊. 基于多准则决策的不平衡感知数据集成特征选择算法[J]. 传感器技术与应用, 2023, 11(6): 538-549. https://doi.org/10.12677/JSTA.2023.116061

参考文献

[1] Liu, Q., Lu, G.Y., Huang, J.R. and Bai, D.X. (2020) Development of Tunnel Intelligent Monitoring and Early Warning System Based on Micro-Service Architecture: The Case of AnPing Tunnel. Geomatics, Natural Hazards and Risk, 11, 1404-1425. [Google Scholar] [CrossRef
[2] BenSaid, F. and Alimi, A.M. (2021) Online Feature Selection System for Big Data Classification Based on Multi-Objective Automated Negotiation. Pattern Recognition, 110, Article ID: 107629. [Google Scholar] [CrossRef
[3] Jia, W., Sun, M., Lian, J. and Hou, S.J. (2022) Feature Dimensionality Reduc-tion: A Review. Complex &Intelligent Systems, 8, 2663-2693. [Google Scholar] [CrossRef
[4] Chawla, N.V., Bowyer, K.W., Hall, L.O. and Kegelmeyer, W.P. (2002) SMOTE: Synthetic Minority Over-Sampling Technique. Journal of Artificial Intelligence Research, 16, 321-357. [Google Scholar] [CrossRef
[5] He, H.B., et al. (2008) ADASYN: Adaptive Synthetic Sampling Approach for Imbalanced Learning. 2008 IEEE International Joint Conference on Neural Networks (IEEE World Congress on Computational Intelligence), Hong Kong, 1-8 June 2008, 1322-1328. [Google Scholar] [CrossRef
[6] Fernández-Navarro, F., Hervás-Martínez, C. and Antonio Gutiérrez, P. (2011) A Dynamic Oversampling Procedure Based on Sensitivity for Multi-Class Problems. Pattern Recognition, 44, 1821-1833. [Google Scholar] [CrossRef
[7] Yu, H.L., Ni, J. and Zhao, J. (2013) ACOSampling: An Ant Colony Optimiza-tion-Based Undersampling Method for Classifying Imbalanced DNA Microarray Data. Neurocomputing, 101, 309-318. [Google Scholar] [CrossRef
[8] Yen, S.J. and Le, Y.S. (2009) Cluster-Based Under-Sampling Approaches for Imbalanced Data Distributions. Expert Systems with Applications, 36, 5718-5727. [Google Scholar] [CrossRef
[9] Chawla, N.V. (1996) Data Mining for Imbalanced Datasets: An Overview. In: Maimon, O. and Rokach, L., Eds., Data Mining and Knowledge Discovery Handbook, Springer, Boston, 123-140. [Google Scholar] [CrossRef
[10] Yong, Y. (2012) The Research of Imbalanced Data Set of Sample Sampling Method Based on K-Means Cluster and Genetic Algorithm. Energy Procedia, 17, 164-170. [Google Scholar] [CrossRef
[11] 邹春安, 王嘉宝, 付光辉. MetaCost与重采样结合的不平衡分类算法——RS-MetaCost[J]. 软件导刊, 2022, 21(3): 34-41.
[12] Li, J., Cheng, K., Wang, S., et al. (2017) Feature Selection: A Data Perspective. ACM Computing Surveys, 50, 1-45. [Google Scholar] [CrossRef
[13] Saeys, Y., Abeel, T. and Van de Peer, Y. (2008) Robust Feature Selection Using Ensemble Feature Selection Techniques. In: Daelemans, W., Goethals, B. and Morik, K., Eds., ECML PKDD 2008: Machine Learning and Knowledge Discovery in Databases, Springer, Berlin, 313-325. [Google Scholar] [CrossRef
[14] Chai, J. and Ngai, E.W.T. (2020) Decision-Making Techniques in Supplier Selection: Recent Accomplishments and What Lies Ahead. Expert Sys-tems with Applications, 140, Article ID: 112903. [Google Scholar] [CrossRef
[15] Acuña-Soto, C.M., Liern, V. and Pérez-Gladish, B. (2019) A VIKOR-Based Approach for the Ranking of Mathematical Instructional Videos. Management Decision, 57, 501-522. [Google Scholar] [CrossRef
[16] Kononenko, I. (1994) Estimating Attributes: Analysis and Extensions of RELIEF. In: Bergadano, F. and De Raedt, L., Eds., Machine Learning: ECML-94, Springer, Berlin, 171-182. [Google Scholar] [CrossRef
[17] Kanimozhi, U. and Manjula, D. (2017) An Intelligent Incremental Filtering Fea-ture Selection and Clustering Algorithm for Effective Classification. Intelligent Automation & Soft Computing.
[18] 施启军, 潘峰, 龙福海, 李娜娜, 苟辉朋, 苏浩辀, 谢雨寒. 特征选择方法研究综述[J]. 微电子学与计算机, 2022, 39(3): 1-8.
[19] Drotár, P., Gazda, M. and Vokorokos, L. (2019) Ensemble Feature Selection Using Election Methods and Ranker Clustering. Information Scienc-es, 480, 365-380. [Google Scholar] [CrossRef
[20] Jacob, S. and Raju, G. (2017) Software Defect Prediction in Large Space Systems through Hybrid Feature Selection and Classification. The International Arab Journal of Information Technology, 14, 208-214.