基于PCA-Smote-XGBoost的软件缺陷预测研究
Research on Software Defect Prediction Based on PCA-Smote-XGBoost
DOI: 10.12677/sea.2024.133035, PDF,   
作者: 曾子安, 李英梅:哈尔滨师范大学计算机科学与信息工程学院,黑龙江 哈尔滨
关键词: 软件缺陷预测PCASmoteXGBoostSoftware Defect Prediction PCA Smote XGBoost
摘要: 随着软件系统的复杂性日益增加,软件缺陷预测成为了确保软件质量的重要手段。本研究提出了一种基于PCA-Smote-XGBoost的软件缺陷预测模型,旨在提高缺陷预测的准确性和效率。本文采用主成分分析(PCA)进行数据降维,保留95%的方差,以减少特征数量并提取关键信息;利用Smote过采样方法解决数据不平衡问题;结合XGBoost算法构建预测模型,并通过实验验证模型的有效性。在软件缺陷预测常用数据集的十一个项目中,实验结果表明,该模型在软件缺陷预测方面相较于其他八种基准模型,具有最高的准确率ACC和F1,能够有效地辅助软件开发团队识别潜在的缺陷风险。
Abstract: With the increasing complexity of software systems, software defect prediction has become an important means to ensure software quality. This study proposes a software defect prediction model based on PCA-Smote-XGBoost, aiming to improve the accuracy and efficiency of defect prediction. This article uses Principal Component Analysis (PCA) for data dimensionality reduction, retaining 95% of the variance to reduce the number of features and extract key information; uses Smote oversampling method to solve the problem of data imbalance; builds a prediction model using the XGBoost algorithm and validates its effectiveness through experiments. Among the eleven commonly used datasets for software defect prediction, experimental results show that the model has the highest accuracy ACC and F1 compared to the other eight benchmark models in software defect prediction, and can effectively assist software development teams in identifying potential defect risks.
文章引用:曾子安, 李英梅. 基于PCA-Smote-XGBoost的软件缺陷预测研究[J]. 软件工程与应用, 2024, 13(3): 346-357. https://doi.org/10.12677/sea.2024.133035

参考文献

[1] 纪晨辉, 李英梅. 一种邻域合成的软件缺陷预测过采样方法[J]. 软件工程与应用, 2023, 12(6): 930-939. [Google Scholar] [CrossRef
[2] 饶珍丹, 李英梅, 董昊, 等. 多层次过采样集成的不平衡数据缺陷预测模型[J]. 小型微型计算机系统, 2023, 44(4): 888-896. [Google Scholar] [CrossRef
[3] Goyal, S. (2021) Handling Class-Imbalance with KNN (Neighbourhood) Under-Sampling for Software Defect Prediction. Artificial Intelligence Review, 55, 2023-2064. [Google Scholar] [CrossRef
[4] 张丽, 沈雅婷, 朱园园. 基于改进SMOTE的软件缺陷预测[J]. 计算机工程与设计, 2023, 44(10): 2965-2972. [Google Scholar] [CrossRef
[5] Rajbahadur, G.K., Wang, S.W., Oliva, G.A., Kamei, Y. and Hassan, A.E. (2022) The Impact of Feature Importance Methods on the Interpretation of Defect Classifiers. IEEE Transactions on Software Engineering, 48, 2245-2261.
[6] Gao, Y.X., Zhu, Y. and Zhao, Y. (2022) Dealing with Imbalanced Data for Interpretable Defect Prediction. Information and Software Technology, 151, Article ID: 107016.
[7] 宫丽娜, 姜淑娟, 姜丽. 软件缺陷预测技术研究进展[J]. 软件学报, 2019, 30(10): 3090-3114. [Google Scholar] [CrossRef
[8] 李莉, 任振康, 石可欣. 代价敏感的Boosting软件缺陷预测方法[J]. 计算机工程, 2022, 48(3): 175-180. [Google Scholar] [CrossRef
[9] Shepperd, M., Song, Q., Sun, Z. and Mair, C. (2013) Data Quality: Some Comments on the NASA Software Defect Datasets. IEEE Transactions on Software Engineering, 39, 1208-1215. [Google Scholar] [CrossRef
[10] 刘旭同, 郭肇强, 刘释然, 等. 软件缺陷预测模型间的比较实验: 问题、进展与挑战[J]. 软件学报, 2023, 34(2): 582-624. [Google Scholar] [CrossRef
[11] Wang, S., Liu, T. and Tan, L. (2016) Automatically Learning Semantic Features for Defect Prediction. Proceedings of the 38th International Conference on Software Engineering, Austin, 14-22 May 2016, 297-308. [Google Scholar] [CrossRef
[12] Xuan, X., Lo, D., Xia, X. and Tian, Y. (2015) Evaluating Defect Prediction Approaches Using a Massive Set of Metrics: An Empirical Study. Proceedings of the 30th Annual ACM Symposium on Applied Computing, Salamanca, 13-17 April 2015, 1644-1647. [Google Scholar] [CrossRef
[13] Menzies, T., Greenwald, J. and Frank, A. (2007) Data Mining Static Code Attributes to Learn Defect Predictors. IEEE Transactions on Software Engineering, 33, 2-13. [Google Scholar] [CrossRef
[14] Jing, X., Ying, S., Zhang, Z., Wu, S. and Liu, J. (2014) Dictionary Learning Based Software Defect Prediction. Proceedings of the 36th International Conference on Software Engineering, Hyderabad, 31 May-7 June 2014, 414-423. [Google Scholar] [CrossRef
[15] Fukushima, T., Kamei, Y., McIntosh, S., Yamashita, K. and Ubayashi, N. (2014) An Empirical Study of Just-in-Time Defect Prediction Using Cross-Project Models. Proceedings of the 11th Working Conference on Mining Software Repositories, Hyderabad, 31 May-1 June 2014, 172-181. [Google Scholar] [CrossRef
[16] Turhan, B., Menzies, T., Bener, A.B. and Di Stefano, J. (2009) On the Relative Value of Cross-Company and Within-Company Data for Defect Prediction. Empirical Software Engineering, 14, 540-578. [Google Scholar] [CrossRef
[17] Tantithamthavorn, C. and Hassan, A.E. (2018) An Experience Report on Defect Modelling in Practice. Proceedings of the 40th International Conference on Software Engineering: Software Engineering in Practice, Gothenburg, 27 May-3 June 2018, 286-295. [Google Scholar] [CrossRef
[18] Tantithamthavorn, C., McIntosh, S., Hassan, A.E., Ihara, A. and Matsumoto, K. (2015) The Impact of Mislabelling on the Performance and Interpretation of Defect Prediction Models. 2015 IEEE/ACM 37th IEEE International Conference on Software Engineering, Florence, 16-24 May 2015, 812-823. [Google Scholar] [CrossRef
[19] Koziarski, M., Krawczyk, B. and Woźniak, M. (2019) Radial-Based Oversampling for Noisy Imbalanced Data Classification. Neurocomputing, 343, 19-33. [Google Scholar] [CrossRef
[20] Zimmermann, T. and Nagappan, N. (2008) Predicting Defects Using Network Analysis on Dependency Graphs. Proceedings of the 13th International Conference on Software Engineering, Leipzig, 10-18 May 2008, 531-540. [Google Scholar] [CrossRef