保险反欺诈预测
Insurance Anti-Fraud Forecasting
摘要: 保险行业不断发展,保险欺诈现象在社会上越来越常见,不仅给保险公司带来巨大的损失,而且对社会发展产生了极其不利的影响。因此需要建立保险欺诈预测模型,从一定程度上遏制保险欺诈行为。利用700条保险反欺诈数据,其中特征有39个,标签为是否保险欺诈。首先,对数据进行缺失值、重复值、日期数据转化、类别数据离散化等数据预处理,以便于后续模型的建立。同时考虑到特征之间可能存在共线性问题,对特征进行方差分析、相关性分析筛除掉特征之间相关性大的多余特征,以及与标签相关性小的无用特征。对上述数据处理和特征筛选后的数据建立机器学习模型,来预测保险欺诈行为。选择使用用于分类预测的常见模型:logistics回归、knn以及Bagging集成学习随机森林模型和Boosting集成学习的LightGBM。从各模型对测试集的预测结果评估可以发现LightGBM模型的整体的模型预测性能最好,预测准确率达到88%,可以作为保险欺诈预测的参考模型。而logistics回归、knn模型存在对保险诈骗的预测准确率较低,将大部分数据预测为非保险诈骗数据,因此实际应用性较差。
Abstract: With the continuous development of the insurance industry, insurance fraud has become more and more frequent in society, which not only brings huge losses to insurance companies, but also has an extremely adverse impact on social development. Therefore, it is necessary to establish an insurance fraud prediction model to curb insurance fraud to a certain extent. 700 insurance anti-fraud data are used, of which there are 39 features and the label is whether it is insurance fraud. First, the data is preprocessed by missing values, duplicate values, date data conversion, and category data discretization to facilitate the establishment of subsequent models. At the same time, considering the possible collinearity problem between features, variance analysis and correlation analysis are performed on the features to screen out redundant features with a high correlation between features and useless features with a low correlation with labels. A machine learning model is established for the data after the above data processing and feature screening to predict insurance fraud. Common models for classification prediction are selected: logistics regression, knn, Bagging ensemble learning random forest model and Boosting ensemble learning LightGBM. From the evaluation of the prediction results of each model on the test set, it can be found that the overall model prediction performance of the LightGBM model is the best, with a prediction accuracy of 88%, which can be used as a reference model for insurance fraud prediction. However, the prediction accuracy of logistics regression and knn models for insurance fraud is low, and most of the data are predicted as non-insurance fraud data, so the actual application is poor.
参考文献
|
[1]
|
王素芬. 社会保险反欺诈何以可能: 基于公众认知的策略选择[J]. 深圳大学学报(人文社会科学版), 2021, 38(2): 84-94.
|
|
[2]
|
熊珈. 新科技背景下我国商业健康保险反欺诈路径研究[J]. 保险职业学院学报, 2022, 36(3): 61-66.
|
|
[3]
|
谢廷廷. 融入文本信息的P2P网贷平台跑路预测模型研究[D]: [硕士学位论文]. 成都: 西南财经大学, 2019.
|
|
[4]
|
翟晓风. 基于XGBoost算法的中国上市公司财务舞弊预测模型研究[D]: [硕士学位论文]. 北京: 中国财政科学研究院, 2022.
|
|
[5]
|
赵沛. 二分类Logistic回归模型对上市公司财务状况的预测效度研究[D]: [硕士学位论文]. 南宁: 广西大学, 2015
|
|
[6]
|
邵亚洁. 基于复合CatBoost模型的P2P网贷违约分类预测[D]: [硕士学位论文]. 兰州: 兰州大学, 2019.
|
|
[7]
|
陶能发. A股上市公司财务造假预测模型研究[D]: [硕士学位论文]. 哈尔滨: 哈尔滨工业大学, 2020.
|
|
[8]
|
钱苹, 罗玫. 中国上市公司财务造假预测模型[J]. 会计研究, 2015(7): 18-25, 96.
|
|
[9]
|
胡嘉麟. 基于LightGBM模型的车辆保险购买兴趣预测研究[D]: [硕士学位论文]. 大连: 大连理工大学, 2021.
|
|
[10]
|
沙靖岚. 基于LightGBM与XGBoost算法的P2P网络借贷违约预测模型的比较研究[D]: [硕士学位论文]. 大连: 东北财经大学, 2017.
|