基于不平衡数据的AdaFocal-XGBoost集成信用评分模型研究
Research on AdaFocal-XGBoost Integrated Credit Scoring Model Based on Unbalanced Data
DOI: 10.12677/sa.2024.136214, PDF,    科研立项经费支持
作者: 郭 楷, 范 宏*:东华大学旭日工商管理学院,上海
关键词: 信用评分不平衡数据集成学习Credit Scoring Unbalanced Data Ensemble Learning
摘要: 随着大数据时代的到来,信用风险管理在金融领域的重要性日益凸显,信用评分作为其核心工具,面临着海量增长的客户信用数据和个体信用画像动态变迁的挑战。传统的信用评估方法在适应性和灵活性上存在不足,尤其是在处理不平衡数据时。本文提出了基于不平衡数据的AdaFocal-XGBoost集成信用评分模型,旨在提高信用风险预测的准确性和适应性。AdaFocal-XGBoost模型结合了XGBoost的高效计算和AdaFocalLoss的自适应损失调整,特别针对样本不平衡问题进行了优化。通过在UCI数据库中的四个信贷数据集(Australian、German、Japan和Taiwan地区)上的实验,本研究全面评估了AdaFocal-XGBoost模型的性能,并与其他多种信用评分模型进行了对比。结果表明,AdaFocal-XGBoost在AUC、准确率、F1分数、Gmean、KS、精确率和召回率等关键指标上均优于其他模型,特别是在处理严重不平衡数据集时表现出色。本研究不仅验证了集成学习与自适应损失函数结合的有效性,也为信用评分领域提供了新的解决方案,有助于金融机构提高融资效率和管控风险敞口。
Abstract: With the advent of the big data era, credit risk management has become increasingly important in the financial field, and credit scoring, as its core tool, faces the challenge of massive growth of customer credit data and dynamic changes in individual credit profiles. Traditional credit assessment methods are deficient in adaptability and flexibility, especially when dealing with unbalanced data. In this paper, we propose an AdaFocal-XGBoost integrated credit scoring model based on unbalanced data, aiming to improve the accuracy and adaptability of credit risk prediction. The AdaFocal-XGBoost model combines the efficient computation of XGBoost and the adaptive loss adjustment of AdaFocalLoss, which is especially optimized for the sample imbalance problem. Through experiments on four credit datasets (Australian, German, Japan, and Taiwan region) from the UCI database, this study comprehensively evaluates the performance of the AdaFocal-XGBoost model and compares it with various other credit scoring models. The results show that AdaFocal-XGBoost outperforms other models in key metrics such as AUC, accuracy, F1 score, Gmean, KS, precision, and recall, especially when dealing with severely unbalanced datasets. This study not only verifies the effectiveness of integrative learning combined with adaptive loss function, but also provides a new solution in the field of credit scoring, which can help financial institutions to improve the financing efficiency and control the risk exposure.
文章引用:郭楷, 范宏. 基于不平衡数据的AdaFocal-XGBoost集成信用评分模型研究[J]. 统计学与应用, 2024, 13(6): 2204-2214. https://doi.org/10.12677/sa.2024.136214

参考文献

[1] 章彤, 迟国泰. 基于最优信用特征组合的违约判别模型——以中国A股上市公司为例[J]. 系统工程理论与实践, 2020, 40(10): 2546-2562.
[2] Zhou, Y., Chi, G., Liu, J., Xiong, J. and Wang, B. (2022) Default Discrimination of Credit Card: Feature Combination Selection Based on Improved FDAF-Score. Expert Systems with Applications, 206, Article 117829. [Google Scholar] [CrossRef
[3] 陆爱国, 王珏, 刘红卫. 基于改进的SVM学习算法及其在信用评分中的应用[J]. 系统工程理论与实践, 2012, 32(3): 515-521.
[4] Shekhar, S., Hoque, N. and Bhattacharyya, D.K. (2022) PKNN-MIFS: A Parallel KNN Classifier over an Optimal Subset of Features. Intelligent Systems with Applications, 14, Article 200073. [Google Scholar] [CrossRef
[5] Kim, T. and Lee, J. (2023) Maximizing AUC to Learn Weighted Naive Bayes for Imbalanced Data Classification. Expert Systems with Applications, 217, Article 119564. [Google Scholar] [CrossRef
[6] 康海燕, 胡成倩. 基于特征提取和集成学习的个人信用评分方法[J]. 计算机仿真, 2024, 41(1): 311-320.
[7] 王重仁, 韩冬梅. 基于超参数优化和集成学习的互联网信贷个人信用评估[J]. 统计与决策, 2019, 35(1): 87-91.
[8] Xia, Y., Zhao, J., He, L., Li, Y. and Niu, M. (2020) A Novel Tree-Based Dynamic Heterogeneous Ensemble Method for Credit Scoring. Expert Systems with Applications, 159, Article 113615. [Google Scholar] [CrossRef
[9] Mushava, J. and Murray, M. (2024) Flexible Loss Functions for Binary Classification in Gradient-Boosted Decision Trees: An Application to Credit Scoring. Expert Systems with Applications, 238, Article 121876. [Google Scholar] [CrossRef
[10] Rao, C., Liu, M., Goh, M. and Wen, J. (2020) 2-Stage Modified Random Forest Model for Credit Risk Assessment of P2P Network Lending to “Three Rurals” Borrowers. Applied Soft Computing, 95, Article 106570. [Google Scholar] [CrossRef
[11] Wang, J. and Dong, Y. (2024) An Interpretable Deep Learning Multi-Dimensional Integration Framework for Exchange Rate Forecasting Based on Deep and Shallow Feature Selection and Snapshot Ensemble Technology. Engineering Applications of Artificial Intelligence, 133, Article 108282. [Google Scholar] [CrossRef
[12] Zhang, Q., Yang, Y., Ma, H. and Wu, Y.N. (2019) Interpreting CNNs via Decision Trees. 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, 15-20 June 2019, 6254-6263. [Google Scholar] [CrossRef
[13] 刘婧怡, 卢胜男. 基于自适应Borderline-SMOTE过采样的LightGBM不平衡数据分类算法[J]. 信息技术与信息化, 2024(6): 205-208.
[14] 周万珍, 盛媛媛, 张永强, 等. 基于ADASYN和WGAN的混合不平衡数据处理方法[J]. 河北工业科技, 2024, 41(4): 291-298.
[15] Douzas, G., Rauch, R. and Bacao, F. (2021) G-SOMO: An Oversampling Approach Based on Self-Organized Maps and Geometric Smote. Expert Systems with Applications, 183, Article 115230. [Google Scholar] [CrossRef
[16] Batista, G.E.A.P.A., Prati, R.C. and Monard, M.C. (2004) A Study of the Behavior of Several Methods for Balancing Machine Learning Training Data. ACM SIGKDD Explorations Newsletter, 6, 20-29. [Google Scholar] [CrossRef
[17] Kim, H., Jo, N. and Shin, K. (2016) Optimization of Cluster-Based Evolutionary Undersampling for the Artificial Neural Networks in Corporate Bankruptcy Prediction. Expert Systems with Applications, 59, 226-234. [Google Scholar] [CrossRef