一种邻域合成的软件缺陷预测过采样方法
A Neighborhood Synthesis Software Defect Prediction Oversampling Method
DOI: 10.12677/SEA.2023.126091, PDF,  被引量   
作者: 纪晨辉, 李英梅:哈尔滨师范大学计算机科学与信息工程学院,黑龙江 哈尔滨
关键词: 软件缺陷预测类不平衡过采样Software Defect Prediction Class Imbalance Oversampling
摘要: 在软件缺陷预测中,数据集的类不平衡是其主要挑战之一。针对该问题,提出了一种邻域合成的过采样方法,该方法充分考虑了缺陷类样本近邻的特征信息,参照周围近邻样本的特征信息合成新样本,使得生成的样本更加多样化,并在缺陷样本较少的区域生成更多的样本,用来增强模型对稀疏缺陷类样本的识别能力。以AEEEM数据集和NASA数据集作为样本,采用F1值作为评估标准进行实验,实验结果显示,这种方法优于其他传统的过采样算法。
Abstract: In software defect prediction, the class imbalance of the data set is one of its main challenges. To solve this problem, a neighborhood synthesis oversampling method is proposed. This method fully considers the characteristic information of the nearest neighbors of defective samples, and synthesizes new samples with reference to the characteristic information of surrounding nearby samples, making the generated samples more diverse and generating more samples in areas with fewer defective samples to enhance the model for identifying sparse defective samples. The AEEEM data set and NASA data set were used as samples, and the F1 value was used as the evaluation criterion for experiments. The experimental results show that this method is better than other traditional oversampling algorithms.
文章引用:纪晨辉, 李英梅. 一种邻域合成的软件缺陷预测过采样方法[J]. 软件工程与应用, 2023, 12(6): 930-939. https://doi.org/10.12677/SEA.2023.126091

参考文献

[1] La Toza, T.D., Venolia, G. and De Line, R. (2006) Maintaining Mental Models: A Study of Developer Work Habits. Proceedings of the 28th International Conference on Software Engineering, Shanghai, 20-28 May 2006, 492-501. [Google Scholar] [CrossRef
[2] 宫丽娜, 姜淑娟, 姜丽. 软件缺陷预测技术研究进展[J]. 软件学报, 2019, 30(10): 3090-3114. [Google Scholar] [CrossRef
[3] Soltanzadeh, P. and Hashemzadeh, M. (2021) RCSMOTE: Range-Controlled Synthetic Minority Over-Sampling Technique for Handling the Class Imbalance Problem. Information Sciences, 542, 92-111. [Google Scholar] [CrossRef
[4] Khleel, N.A.A. and Nehéz, K. (2023) A Novel Approach for Software Defect Prediction Using CNN and GRU Based on SMOTE Tomek Method. [Google Scholar] [CrossRef
[5] Yap, B.W., Rani, K.A., Rahman, H.A.A., et al. (2014) An Application of Oversampling, Undersampling, Bagging and Boosting in Handling Imbalanced Datasets. In: Herawan, T., Deris, M., Abawajy, J., Eds., Proceedings of the First International Conference on Advanced Data and Information Engineering (DaEng-2013), Springer, Singapore, 13-22. [Google Scholar] [CrossRef
[6] Arafat, M.Y., Hoque, S. and Farid, D.M. (2017) Cluster-Based Under-Sampling with Random Forest for Multi-Class Imbalanced Classification. 2017 11th International Conference on Software, Knowledge, Information Management and Applications (SKIMA), Malabe, 6-8 December 2017, 1-6. [Google Scholar] [CrossRef
[7] Song, A. and Xu, Q. (2018) Imbalanced Data Classification Based on MBCDK-Means Undersampling and GA-ANN. Artificial Neural Networks and Machine Learning-ICANN 2018, Rhodes, October 4-7 2018, 349-358. [Google Scholar] [CrossRef
[8] Tsai, C.F., Lin, W.C., Hu, Y.H., et al. (2019) Under-Sampling Class Imbalanced Datasets by Combining Clustering Analysis and Instance Selection. Information Sciences, 477, 47-54. [Google Scholar] [CrossRef
[9] 周传华, 朱俊杰, 徐文倩, 等. 基于聚类欠采样的集成分类算法[J]. 计算机与现代化, 2021(11): 72-76.
[10] Goyal, S. (2022) Handling Class-Imbalance with KNN (Neighbourhood) Under-Sampling for Software Defect Prediction. Artificial Intelligence Review, 55, 2023-2064. [Google Scholar] [CrossRef
[11] Chawla, N.V., Bowyer, K.W., Hall, L.O., et al. (2002) SMOTE: Synthetic Minority Over-Sampling Technique. Journal of Artificial Intelligence Research, 16, 321-357. [Google Scholar] [CrossRef
[12] Han, H., Wang, W.Y., Mao, B.H. (2005) Borderline-SMOTE: A New Over-Sampling Method in Imbalanced Data Sets Learning. In: Huang, D.S., Zhang, X.P., Huang, G.B., Eds., International Conference on Intelligent Computing, Springer, Berlin, 878-887. [Google Scholar] [CrossRef
[13] He, H., Bai, Y., Garcia, E.A., et al. (2008) ADASYN: Adaptive Synthetic Sampling Approach for Imbalanced Learning. 2008 IEEE International Joint Conference on Neural Networks (IEEE World Congress on Computational Intelligence), Hong Kong, 1-8 June 2008, 1322-1328.
[14] Islam, A., Belhaouari, S.B., Rehman, A.U., et al. (2022) KNNOR: An Oversampling Technique for Imbalanced Datasets. Applied Soft Computing, 115, Article ID: 108288. [Google Scholar] [CrossRef
[15] Douzas, G., Bacao, F. and Last, F. (2018) Improving Imbalanced Learning through a Heuristic Oversampling Method Based on K-Means and SMOTE. Information Sciences, 465, 1-20. [Google Scholar] [CrossRef
[16] 饶珍丹, 李英梅, 董昊, 等. 多层次过采样集成的不平衡数据缺陷预测模型[J]. 小型微型计算机系统, 2023, 44(4): 888-896. [Google Scholar] [CrossRef
[17] Soltanzadeh, P. and Hashemzadeh, M. (2021) RCSMOTE: Range-Controlled Synthetic Minority Over-Sampling Technique for Handling the Class Imbalance Problem. Information Sciences, 542, 92-111. [Google Scholar] [CrossRef
[18] Koziarski, M., Krawczyk, B. and Woźniak, M. (2019) Radial-Based Oversampling for Noisy Imbalanced Data Classification. Neurocomputing, 343, 19-33. [Google Scholar] [CrossRef
[19] Maldonado, S., Vairetti, C., Fernandez, A., et al. (2022) FW-SMOTE: A Feature-Weighted Oversampling Approach for Imbalanced Classification. Pattern Recognition, 124, Article ID: 108511. [Google Scholar] [CrossRef