# 基于机器学习的网贷借款人违约预测研究 Research on Default Prediction of Online Lending Borrowers Based on Machine Learning

• 全文下载: PDF(844KB)    PP.40-48   DOI: 10.12677/SSEM.2019.81006
• 下载量: 458  浏览量: 1,055

The rapid development of the online lending industry has made traditional risk control increasingly prominent in the timeliness, comprehensiveness and hierarchy of data. At present, the booming machine learning enables the online lending platform to build an intelligent risk control model by using multi-dimensional big data so that it can assess personal credit status more accurately and reduce default risk more effectively. Based on the borrower loan risk data provided by CCX Credit Technology, this paper uses Logistic, XGBoost and NN to construct a forecasting model and compares the results. The XGBoost algorithm has a high degree of flexibility and allows custom optimization goals and evaluation criteria, and it also has more parameters, the scope of adjustment is large. So the model built based on XGBoost algorithm has higher accuracy for default prediction of online loan borrower. At the same time, this article uses the automated tuning tool to traverse all the parameter combinations, which brings great convenience to the model tuning.

1. 引言

2. 文献综述

Sideny (2014)等 [4] 提出了使用决策树等有监督学习方法的预测欺诈行为，并通过变量筛选的方法，针对真实的在线商业数据，挑选出所有变量中解释性最强的10个变量，构造出一个比用所有变量更稳定、准确率更高的决策树模型。王茂光等(2016) [5] 以决策树算法为核心建立起风险监控模型，并通过对模型的参数进行调整，提出以保证整体错误率的前提下，尽可能的降低错误率的评价标准。王程龙等(2016) [6] 同时发现决策树模型在解释贷款违约原因、划分信用等级以及降低违约率等方面具有适用性强、精确度高、可解释强的优势。

Table 1. Method of loan default prediction

3. 数据分析

3.1. 数据描述

3.2. 问题分析

Figure 1. Solution of the problem

3.3. 数据清洗

3.4. 特征工程

$r=\frac{{\sum }_{i=1}^{n}\left({X}_{i}-\stackrel{¯}{X}\right)\left({Y}_{i}-\stackrel{¯}{Y}\right)}{\sqrt{{\sum }_{i=1}^{n}{\left({X}_{i}-\stackrel{¯}{X}\right)}^{2}}\sqrt{{\sum }_{i=1}^{n}{\left({Y}_{i}-\stackrel{¯}{Y}\right)}^{2}}}$

Table 2. Correlation analysis

4. 模型的建立

4.1. Logistic回归

Logistic回归是最简洁、快速、稳健的分析方法，可解释性强；但是借款人违约预测应以精度为标准，而Logistic 回归对变量关系的线性限制，难以达到精度最优，但是在建模时可以充分发挥它的特性：1) 作为基准，对数据清洗效果和模型表现进行快速评估；2) 与结构不同的模型加权组合预测，补充原模型精度和稳健性。

Logistic回归通过

$f\left(x\right)=\frac{1}{1+{\text{e}}^{-x}}$

$y=\frac{1}{1+{\text{e}}^{-\left(\alpha +{\beta }_{1}{x}_{1}+{\beta }_{2}{x}_{2}+\cdots \right)}}$

4.2. XGBoost算法

XGBoost的目标函数

$L\left(\theta \right)=\underset{i}{\overset{n}{\sum }}l\left({y}_{i},{\stackrel{^}{y}}^{\left(t-1\right)}+{f}_{t\left({x}_{i}\right)}\right)+\Omega \left(ft\right)$

$L\left(\theta \right)=-\frac{1}{2}\underset{j=1}{\overset{T}{\sum }}\frac{\left({\sum }_{i\in {I}_{j}}{g}_{i}\right)}{\lambda +{\sum }_{i\in {I}_{j}}{h}_{i}}+\gamma {T}_{t}$

4.3. 神经网络

XGBoost的出发点是各变量完全独立，而神经网络的出发点是各变量充满复杂的非线性关联，从而不断去优化网络权重向真实关联趋近。两种模型结构具有较高的互补性，因此本实验也选取了神经网络模型。一个基本的神经网络由输入层、隐藏层、输出层构成，相邻两层间由权重矩阵连接。通过不断提供训练样本，神经网络会学习最优权重参数，从而减小拟合误差，利用输入数据建立模型并模拟输出。本文选取借款人的性别、年龄、学历、借款次数、借款金额，还款情况等信息指标作为神经网络模型的输入值，模型的输出值是借款人的违约概率。

4.4. 参数优化

Table 3. List of best parameters

Table 4. List of best parameters

5. 实验结果

$AUC=\frac{1}{2}\underset{i}{\overset{m-1}{\sum }}\left({x}_{i+1}-{x}_{i}\right)\left({y}_{i}+{y}_{i+1}\right)$

Table 5. Comparison of modeling results

6. 总结

 [1] Correa Bahnsen, A. (2016) Feature Engineering Strategies for Credit Card Fraud Detection. Expert Systems with Ap-plications, 51, 134-142 https://doi.org/10.1016/j.eswa.2015.12.030 [2] 熊正德, 刘臻煊, 熊一鹏. 基于有序logistic模型的互联网金融客户违约风险研究[J]. 系统工程, 2017, 35(8): 29-38. [3] 阮素梅, 周泽林. 基于L1惩罚Logit模型的P2P网络借贷信用违约识别与预测[J]. 财贸研究, 2018, 29(2): 54-63. [4] Tsang, S. (2014) De-tecting Online Auction Shilling Frauds Using Supervised Learning. Expert Systems with Applications, 41, 3027-3040. https://doi.org/10.1016/j.eswa.2013.10.033 [5] 王茂光, 葛蕾蕾, 赵江平. 基于C5.0算法的小额网贷平台的风险监控研究[J]. 中国管理科学, 2016, 24(S1): 345-352. [6] 王程龙, 陈程. 基于决策树的P2P网贷平台信用评级体系研究[J]. 农村金融研究, 2016(12): 45-50. [7] Desai, V.S., Crook, J.N. and Overstreet, G.A. (1996) A Com-parison of Neural Network and Linear Scoring Models in the Credit Union Environment. European Journal of Opera-tional Research, 95, 24-37. https://doi.org/10.1016/0377-2217(95)00246-4 [8] 吴斌, 叶菁菁, 董敏. P2P网贷个人信用风险评估模型研究——基于混合果蝇神经网络的方法[J]. 会计之友, 2017(21): 32-35. [9] 李昕, 戴一成. 基于BP神经网络的P2P网贷借款人信用风险评估研究[J]. 武汉金融, 2018(2): 33-37. [10] Fernandez-Delgado, M., Cernadas, E., Barro, S., et al. (2014) Do We Need Hundreds of Classifiers to Solve Real World Classification Problems. The Journal of Machine Learning Research, 15, 3133-3181. [11] 张宁静, 顾新, 杨铖. P2P校园贷款个人违约风险因素指标探析[J]. 财会月刊, 2018(6): 82-89. [12] 蒋翠清, 王睿雅, 丁勇. 融入软信息的P2P网络借贷违约预测方法[J]. 中国管理科学, 2017, 25(11): 12-21. [13] 丁岚, 骆品亮. 基于Stacking集成策略的P2P网贷违约风险预警研究[J]. 投资研究, 2017, 36(4): 41-54.