基于机器学习的贷款违约模型的研究
Research on Loan Default Models Based on Machine Learning
摘要: 随着消费信贷与互联网金融的快速发展,贷款违约风险持续攀升,传统风控手段难以应对高维非线性数据的复杂特征。为提升违约预测的准确性与实用性,本文基于Kaggle平台百万级贷款数据集,通过特征选择、数据预处理、模型训练及优化等构建预测体系。采用IV值与PSI值筛选高相关性特征,结合中位数填充、异常值剔除、数据编码与标准化完成数据清洗;运用随机搜索与交叉验证对逻辑回归、决策树、随机森林模型进行超参数调优,以AUC值为核心评价指标,结合准确率、精确率综合评估模型性能。实验结果表明,随机森林模型表现最优(AUC = 0.7149,准确率 = 0.6565,精确率 = 0.6514)。本研究为金融机构优化信贷决策、降低不良贷款率提供技术支撑,同时为金融科技领域的模型工程化应用提供参考。
Abstract: With the rapid development of consumer credit and Internet finance, the risk of loan default has been on a steady rise, and traditional risk control methods are struggling to address the complex characteristics of high-dimensional and nonlinear data. To enhance the accuracy and practicality of default prediction, this study constructs a prediction framework based on a million-scale loan dataset from the Kaggle platform, encompassing processes such as feature selection, data preprocessing, model training, and optimization. Highly relevant features are screened using the Information Value (IV) and Population Stability Index (PSI). Data cleaning is accomplished through median imputation for missing values, outlier removal, data encoding, and standardization. Hyperparameter tuning for logistic regression, decision tree, and random forest models is conducted via random search and cross-validation. The Area Under the ROC Curve (AUC) is adopted as the core evaluation metric, with accuracy and precision used as supplementary indicators to comprehensively assess model performance. Experimental results demonstrate that the random forest model achieves the optimal performance, with an AUC of 0.7149, an accuracy of 0.6565, and a precision of 0.6514. This research provides technical support for financial institutions to optimize credit decisions and reduce non-performing loan rates, while also offering a reference for the engineering application of models in the fintech field.
参考文献
|
[1]
|
Durand, D. (1941) Risk Elements in Consumer Installment Financing. National Bureau of Economic Research.
|
|
[2]
|
Grablowsky, B.J. and Talley, W.K. (1981) Probit and Discriminant Functions for Classifying Credit Applicants—A Comparison. Journal of Economics and Business, 33, 254-261.
|
|
[3]
|
Lee, Y. (2007) Application of Support Vector Machines to Corporate Credit Rating Prediction. Expert Systems with Applications, 33, 67-74. [Google Scholar] [CrossRef]
|
|
[4]
|
Kvamme, H., Sellereite, N., Aas, K. and Sjursen, S. (2018) Predicting Mortgage Default Using Convolutional Neural Networks. Expert Systems with Applications, 102, 207-217. [Google Scholar] [CrossRef]
|
|
[5]
|
Ma, C., et al. (2021) Bond Default Prediction Based on Deep Learning and Knowledge Graph Technology. IEEE Access, 9, 12750-12761. [Google Scholar] [CrossRef]
|
|
[6]
|
李杨, 彭雅雷, 徐鸣一, 等. 基于集成学习算法提升方法的贷款违约预测模型选择[J]. 中国管理信息化, 2024, 27(9): 141-144.
|
|
[7]
|
袁俊. 基于知识图谱与机器学习的贷款违约预测[D]: [硕士学位论文]. 青岛: 青岛科技大学, 2023.
|
|
[8]
|
易成怡. 基于知识图谱和机器学习的贷款违约预测研究[D]: [硕士学位论文]. 苏州: 苏州大学, 2023.
|
|
[9]
|
张伍豪. 基于可解释机器学习的信贷违约预测[D]: [硕士学位论文]. 上海: 上海财经大学, 2023.
|
|
[10]
|
赖江萍. 基于随机森林模型的个人贷款违约预测研究[D]: [硕士学位论文]. 石河子: 石河子大学, 2023.
|
|
[11]
|
Loan Default Prediction Dataset, Kaggle. https://www.kaggle.com/datasets
|